R Introduction Part 1

7 minute read

Published:

  • This is my lecture note written for undergraduate R programming course, which is a supplementary course of the Advanced Macroeconomics course.
  • Most figures and tables are extracted from R in Action.
  • All exercises are copied from my teacher Jing Fang’s R Workshop, you can find the data for exercises on the website of R Workshop.

What is R ?

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. For more information, please see Wiki: R(programming language)

Why choose R?

  1. R is a GNU project and totally free for everyone.
  2. R runs on a wide array of platforms, like Windows, Unix and Mac OS. You can even run R programming on your iPhone or iPad!
  3. R can easily import data from various types of data sources, including text files, database management systems, statistical packages, and specialized data repositories. It can write data out to these systems as well.
  4. R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis.
  5. R has state-of-the-art graphics capabilities (try the package: ggplot2).
  6. R has a global community of more than 2 million users and developers who voluntarily contribute their time and technical expertise to maintain, support and extend the R language and its environment, tools and infrastructure.

Comparing R with other mainstream softwares

  • Stata: Widely used in Economics field.
  • SAS: excels at coping with big data, the syntax is awkward.
  • Matlab: has strong ability of numerical value calculation and is incredible large.
  • SPSS: widely used in Business Management.
  • Eviews: is good at Time Series Analysis.

  ** Most important:** They are all commercial software, which means they are not free to individual user.

Installation of R

  • How to get R?
    R can be download from CRAN
    R Download Page
  • RStudio
    RStudio is an amazing IDE for R. I strongly recommend you to run R program on RStudio.
    RStudio can be download from RStuio Website
    RStudio Download Page

Packages of R

One feature of R is that most functions can be achieved by the installation of different packages. There are 6523 packages on the websit of CRAN and the number is still growing.

  • Install packages
    1.Install packages manually
    Download package from This Page, RStudio->Tools->Install Packages
    2.Install packages by using command install.package()
    (1) Use command install.packages(), then choose the package you need.
    (2) Use command install.packages("lmtest"), R will install the “lmtest” package and those supporting packages automatically.
    1. Load packages by using command library()
      Use library(lmtest) to load the “lmtest” package before you want to use this package.
      Note: You only need to install packages once, but you need to reload those needed packages every time after you restart RStudio.
    2. Update packages
      Use command update.packages() to update those packages you have installed.

Other notes on using R

  • R is a case sensitive language.
  • = and <- can both be used as assignment symbol.
    Personally, I do not think there is a big difference between them and I highly recommend using = instead of <-. For more information on the subtle difference between these two symbol, please see stackoverflow and 知乎.
  • We use # in R to indicate the following part is comment and should not be compiled.
  • R cannot read setwd("c:\myprogram"), you should use ‘setwd(“c:/myprogram”)’ or setwd("c:\\myprogram").
  • Please wisely use help() or ?the_name_of_funciton to read help documentation. The website Rseeker and Google are also really helpful.

Create dataset

A dataset is usually a rectangular array of data with rows representing observations
and columns representing variables. The following table provides an example of a hypothetical
patient dataset.
A Patient Dataset

  • How to construct a dataset?
    1.Import or type data into the data structures.
    2.Choose a type of data structures to store data.

  • Data input
    From the following figure, we can see that R can cope with different data formats.
    Different Data Types
    read.table() #This can be used to read txt file, the argument na.strings="." should be used to convert the missing value (“.” in txt file) to NA value.
    read.csv() #This can be used to read CSV file, highly recommended.
    load() #This can be used to load Rdata file.
    Note: You can find out more information on how to input other different types of data in the book R in Action (R语言实战).

  • Data structures
    R has a wide variety of objects for holding data, including scalars, vectors, matrices,
    arrays, data frames, and lists. They differ in terms of the type of data they can hold,
    how they are created, their structural complexity, and the notation used to identify and
    access individual elements. The figure shows a diagram of these data structures.
    Different Data Structures

1.Vectors
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical
data. The combine function c() is used to form the vector.

a=c(1,2,3,4)                #This is a numeric vector.  
b=c("one","two","three")    #This is a character vector.
c=c(TRUE,FALSE,TRUE)        #This is a logical vector.

2.Matrices
A matrix is a two-dimensional array where each element has the same mode (numeric,
character, or logical). Matrices are created with the matrix function.

x=matrix(1:20, nrow=5, ncol=4)  #This can create a 5*4 matrix using the number 1~20.

The comment operation symbols in matrix:
Dimensions of matrix x: dim(x)
Rows of matrix x: nrow(x)
Columns of matrix x: col(x)
Transpose of matrix x: t(x)
The value of the determinant of matrix x: det(x)
If the determinant is not 0, then we can get the inverse of matrix x: solve(x)
Eigenvalue and eigenvector: y=eigen(x), then y$val is eigenvalue, y$vec is eigenvector.
Multiplication of matrices: a %*% b
Arithmetic operations and power operation on every element of matrix: + - * / ^

3.Data frames
Personally, I think data frame is the most important and common data structure you will deal with in R. A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, etc.).

patientID= c(1, 2, 3, 4)
age = c(25, 34, 28, 52)
diabetes = c("Type1", "Type2", "Type1", "Type1")
status = c("Poor", "Improved", "Excellent", "Poor")
patientdata= data.frame(patientID, age, diabetes, status)
patientdata   #patientdata is a data frame.

The different ways to read data from data frame:

patientdata[1:2]
patientdata[c("diabetes", "status")]
patientdata$age

Useful mathematical functions and statistical functions

Mathematical Functions
Statistical Functions

Useful data operation functions

1.summary
summary is a generic function used to produce result summaries of the results of various model fitting functions.
summary(data$example) or summary(data)

2.sum
sum returns the sum of all the values present in its arguments.
sum(data$example, na.rm=TRUE)

3.nrow and ncol
nrow and ncol return the number of rows or columns present in x.

4.cbind and rbind
Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively.

5.replace
replace replaces the values in x with indices given in list by those given in values.
data$V1=replace(data$V1,data$V1<5,0)
In this example, if the data in column V1 is less than 5, then those data will be converted to 0. This function is really useful when you want to convert your numerical variable to dummy variable.

6.subset
Return subsets of vectors, matrices or data frames which meet conditions.
subset(data, subset, select)

7.aggregate
Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
aggregate(data, by=list(data$variable), mean)
In this example, based on the column variable, R splits the data into subsets and computes mean of those subsets.

Reference books

Reference websites

Reference online courses

Tags: