R Introduction Part 2

3 minute read


  • This is my lecture note written for undergraduate R programming course, which is a supplementary course of the Advanced Macroeconomics course.
  • Most figures and tables are extracted from R in Action.
  • All exercises are copied from my teacher Jing Fang’s R Workshop, you can find the data for exercises on the website of R Workshop.

Regression wiht R

As you can see in the following table, the term regression can be confusing because there are so many specialized varieties. In this class we will only focus on OLS, Nonparametric Regression and Robust Regression.
Varieties of Regression Analysis

OLS regression

\(\hat{Y\_i}=\hat{\beta\_0}+\hat{\beta\_1}X\_{1,i}+...+\hat{\beta\_k}X\_{k,i} \qquad i=1...n\)
where $n$ is the number of observatiions and $k$ is the number of predictor variables. In this equation:
$\hat{Y_i}$: is the predicted value of the dependent variable for observation $i$ (specifically, it is the estimated mean of the $Y$ distribution, conditional on the set of predictor values).
$X_{j,i}$: is the $j^{th}$ predictor value for the $i^{th}$ observation.
$\hat{\beta_0}$: is the intercept (the predicted value of $Y$ when all the predictor variables equal 0).
$\hat{\beta_j}$: is the regression coefficient for the $j^{th}$ predictor (slope representing the change in $Y$ for a unit change in $X_j$).

To properly interpret the coefficients of the OLS model, you must satisfy a number of statistical assumptions:

  • Normality—For fixed values of the independent variables, the dependent
    variable is normally distributed.
  • Independence—The $Y_i$ values are independent of each other.
  • Linearity—The dependent variable is linearly related to the independent
  • Homoscedasticity—The variance of the dependent variable doesn’t vary with the
    levels of the independent variables.

lm(.) function

  • OLS regression can be conducted by using lm function.
  • Usage of lm:
    lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr= TRUE, singular.ok= TRUE, contrasts = NULL, offset, ...)  

      Note: You can find out more explanation on parameter settings by browsing R help documentation.

  • An example:
    Suppose you need to conduct OLS regression on $y=c+x_1+x_2+…+x_n$, data have been stored in a data frame named mydata.
    You type the following commend in RStudio to tell R to conduct OLS regression:
    reg=lm(y~x1+x2...+xn, mydata)

      Then you use the following commend to tell R to report the result:


There are some symbols listed in the following table, which are commonly used in regression.
Symbols Commonly Used in Regression



Regression diagnostics

  • Test whether the coefficients in regression model satisfy certain constrains.
    lht(reg, "the constrains you want to test")
  • Test whether some variables in your regression model are combined significant.
    1.Use anova function
    anova(reg1, reg2)

      2.Use waldtest function


      Q: What is vcovHC? (Hint: Please find the answer in R help documentation or Google.)




  • What is heteroskedasticity?
    One of our assumptions for linear regression is that the random error terms in our regression have same variance. However, there are times that regression ends with heteroskedasticity, which means the random error terms have different variances.

  • What may happen if heteroskedasticity does exist in our regression?
    In this situation, t-test and F-test may become not reliable and misleading. Those supposed-to-be-significant coefficients are not significant any more.

  • How to discern heteroskedasticity?

    1.Breusch-Pagan test


2.White test

#This is one solution
bptest(reg, ~x1+x2+I(x1^2)+I(x2^2)+x1:x2, mydata)
#This is another one.

* How to get robust standard error when heteroskedasticity appears?  
# The first method (recommended)
coeftest(reg, vcov=vcovHC(reg,”HC0”))
# The second method
# The third method