Linear regression tutorial

Introduction

In this tutorial, you will learn to fit a linear regression model in R and to run some model diagnostics. It is important that you are familiar with the theoretical background, otherwise you may not understand the different steps. In fact, let us start with a quiz. If you have trouble with the quiz, I suggest to revisit the teaching videos and material of this (and possibly the last) session.

Quiz

Load and inspect data

We use some soil data to demonstrate how to conduct a linear regression analysis. We first load the data that is contained in the package vegan.

data("varechem")

Exercises on data inspection

Now execute the code below to obtain information about the data. Then modify the code to inspect the data and count the number of variables in the data set.

?varechem
head()

Quiz

Use str() to check how many of the variables are numeric and how many are factors (i.e. categorical variables).

Quiz

First model

We examine the relationship between K and S to answer the question whether we can accurately predict S from K. A linear model can be fitted with the R function lm().

mod_1 <- lm(S ~ K, data = varechem)

Let us plot the observations and the fitted model. We can create simple scatter plots with plot(var1 ~ var2), where you replace var1 and var2 with the variables of interest. The model is added to the figure by providing the object containing the fitted model to abline().

par(cex = 1.4, las = 1)
plot(S ~ K, data = varechem, ylab = "S mg/kg", xlab = "K mg/kg")
abline(mod_1, lwd = 2)

The model looks appropriate, i.e. there is no indication of a non-linear relationship. The fitted model results can be returned with the summary() function.

summary.lm(mod_1)

## 
## Call:
## lm(formula = S ~ K, data = varechem)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4299 -4.8300 -0.3528  2.6097 15.1027 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.45423    3.60030   3.459  0.00223 ** 
## K            0.15183    0.02059   7.374 2.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.403 on 22 degrees of freedom
## Multiple R-squared:  0.7119, Adjusted R-squared:  0.6989 
## F-statistic: 54.38 on 1 and 22 DF,  p-value: 2.222e-07

You should be able to interpret the output if you paid attention in the lecture.

Quiz

Now that we just talked about it, you certainly are able to answer the following question:

Quiz

Matrix calculation

In the lecture we discussed the matrix calculation for the regression model. Let’s see how this works in R. We use the matrix multiplication to derive the regression coefficients. Remember that the equation for the calculation of b is:
b = (X^T X )^-1 X^T Y
Add the resulting objects to the code below to understand the different steps of the calculation.

# Create vector of 1s as intercept term 
inter <- rep(1, nrow(varechem))
#  Combine intercept term with values from X (K)
full <- cbind(inter, varechem$K)

# Implement the equation for the calculation of b in R:
betas <- solve(t(full) %*% full) %*% (t(full) %*% varechem$S)

# Create vector of 1s as intercept term 
inter <- rep(1, nrow(varechem))
inter
#  Combine intercept term with values from X (K)
full <- cbind(inter, varechem$K)
full

# Implement the equation for the calculation of b in R:
betas <- solve(t(full) %*% full) %*% (t(full) %*% varechem$S)
betas

Compare these results with those of the linear model. The coefficients of a model can be extracted with coef().

coef(mod_1)

For those who want to understand the details of matrix operations in R: The transpose of a matrix is computed by t(), matrix multiplication is executed by %*% and the inverse of a matrix is computed with solve().

Confidence intervals

We discussed the concept of confidence intervals in the lecture and discussed its application to evaluate the accuracy of the estimated mean and a new observation. You should remember that the confidence interval (CI) for a parameter \(\theta\) is given as: \(\theta\) \(\pm\) value related to confidence level and probability distribution \(\times\) standard error related to \(\theta\)
Check your knowledge in the following quiz, if you are not confident, repeat the sections on the standard error and confidence intervals.

Quiz

Calculation of regression bands

Calculation

Confidence intervals for the regression coefficients can be calculated using the function confint(), i.e. confint(mod_1) for the present case (execute this code in a code window). In the following, we focus on confidence intervals for the estimated mean and on prediction intervals. The point-wise calculation of confidence intervals for the estimated mean and of prediction intervals for a new observation can be done for each value along the X-axis, yielding to so-called regression bands. This is done easily in R via the predict() function. Inspect the objects ci and pi, what do they contain?

ci <- predict(mod_1, interval = "confidence", level = 0.95)
pi <- predict(mod_1, interval = "prediction", level = 0.95)

Warning in predict.lm(mod_1, interval = "prediction", level = 0.95): Vorhersagen auf aktuellen Daten beziehen sich auf zukünftige Daten

Hint: Just type in the name of the objects.

Plotting

Now lets plot the model with regression bands.

par(las = 1, cex = 1.8, mar = c(5,5,1,1))
plot(S ~ K, data = varechem, ylab = expression(paste("S (mg ", kg^-1,")")), xlab = expression(paste("K (mg ", kg^-1,")")))
abline(mod_1, lwd = 2)
lines(sort(varechem$K), ci[order(varechem$K) ,2], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), ci[order(varechem$K) ,3], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), pi[order(varechem$K) ,2], lty = 2, col = 'red', lwd = 2)
lines(sort(varechem$K), pi[order(varechem$K) ,3], lty = 2, col = 'red', lwd = 2)

Now compute and visualise the confidence bands for a confidence level of 0.68 (you can copy the code from above and modify it in the code editor). Do you expect narrower or wider bands?

ci2 <- predict(mod_1 ...)

ci2 <- predict(mod_1, interval = "confidence", level = 0.68)

ci2 <- predict(mod_1, interval = "confidence", level = 0.68)
par(las = 1, cex = 1.8, mar = c(5,5,1,1))
plot(S ~ K, data = varechem, ylab = expression(paste("S (mg ", kg^-1,")")), xlab = expression(paste("K (mg ", kg^-1,")")))
abline(mod_1, lwd = 2)
lines(sort(varechem$K), ci2[order(varechem$K) ,2], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), ci2[order(varechem$K) ,3], lty = 2, col = 'blue', lwd = 2)

Let’s end the topic with a quiz.

Quiz

Model accuracy

We discussed the concepts of R² and Root Mean Square Error (RMSE) in the lecture. We will not deal in detail with these model accuracy measures here, as they will play a more important role in following sessions. Nevertheless, by now you should be able to answer the following quiz question.

Quiz

Both are provided when calling the summary() function, though the RMSE is called Residual standard error in the output.

summary(mod_1)

## 
## Call:
## lm(formula = S ~ K, data = varechem)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4299 -4.8300 -0.3528  2.6097 15.1027 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.45423    3.60030   3.459  0.00223 ** 
## K            0.15183    0.02059   7.374 2.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.403 on 22 degrees of freedom
## Multiple R-squared:  0.7119, Adjusted R-squared:  0.6989 
## F-statistic: 54.38 on 1 and 22 DF,  p-value: 2.222e-07

The R² is approximately 0.71 , i.e. approximately 71% of the variance in S is explained by K. Additional information is provided in the variance table that can be called using anova().

anova(mod_1)

It provides the degrees of freedom (Df), sum of squares (Sum Sq) for K (MSS in lecture) and for the residuals (RSS in lecture) as well as the MSE (Mean Sq for Residuals). At this stage, the F value and Pr(>F) are not of interest to us.

Model diagnostics I

Introduction and quiz

We extensively discussed model diagnostics in the lecture. To check whether you have the knowledge required to understand this section, do the quiz and revisit the lecture in case you don’t know the correct answers.

Quiz

Diagnostic plots

The diagnostic plots can simply be called using the plot() function, for example as follows:

par(mfrow = c(1, 4))
plot(mod_1)

The argument mfrow in the par() function controls the number of plots in a plotting device (e.g. Plots window in R, file). par(mfrow = c(1, 4)) means that 1 row and 4 columns are set for plotting, hence 4 plots are plotted rowwise in a plotting device. This is more practical for diagnostics of the linear model than plotting a single plot one after another, which would happen if you call plot(mod_1) without a prior call of par(mfrow = c(1, 4)). For example, it is easier to spot whether it is the same observations that are responsible for deviations in multiple plots.

However, to reduce complexity, we first restrict plotting to the QQ plot. We have discussed the QQ plot in the previous session and in case you struggle with the interpretation and diagnosis, revisit this session.

par(mfrow = c(1, 1))
plot(mod_1, which = 2)

The QQ plot shows that the standardized residuals largely follow the theoretical quantiles. Observations 24 and 9 deviate a bit, and we can check whether this is unusual using the qreference() function.

library(DAAG)
qreference(rstandard(mod_1), nrep = 8, xlab = "Theoretical Quantiles")

Our observations display a pattern similar to those from a normal distribution that are displayed in the reference plots. Hence, not much to worry about the deviation of the two observations. Remember as well the discussion in the lecture that the normal distribution assumption is the least relevant and that deviations often are associated with violation of other assumptions (as we will also see later).

The non-linearity assumption can be checked by plotting the model. We have already done that (see section First model), and the figure did not indicate non-linearity. Moreover, violation of the linearity assumption would be spotted in the residuals vs. fitted values plots. We will use these plots in the following to evaluate this and the variance homogeneity assumption (also called homoscedasticity) as well as to identify model outliers. Thus, we focus on two diagnostic plots related to residuals vs. fitted values in the following, but as stated above, typically you would simultaneously use four plots, once you are familiar with the different diagnostic plots.

par(mfrow = c(1,2))
plot(mod_1, which = c(1,3))

The Residuals vs Fitted plot (left) and Scale-Location plot (right) can be used to evaluate linearity and variance homogeneity as well as to identify model outliers. The right plot is similar to the left plot but uses standardized residuals, or more precisely the square root of the standardized residuals. Standardized residuals are obtained by dividing each residual by its standard deviation. Hence, they show how many standard deviations any point is away from the fitted regression model. Again, we see no sign of non-linearity. Regarding model outliers, we discussed that if the assumption of normal distribution holds, approximately 5% of values would deviate more than 2 (1.96 to be precise) standard deviations.

Consider the definition of the standardized residuals in the lecture, the right plot and the number of observations (24) to answer the following quiz questions.

Quiz

Regarding variance homogeneity, in both plots the variance is rather similar along the gradient of fitted values suggesting that heteroscedasticity (unequal variance) is not an issue and that the assumption of variance homogeneity is met. Consider the slide Model diagnostics: Variance homogeneity from the lecture, if you don’t know what to look for. In fact, let’s test your ability to spot deviations.

Quiz diagnostic plots

In the following, you will see several plots with a quiz question below that asks you to evaluate the linear model assumptions.

Quiz

The increasing variance, i.e. violation of the variance homogeneity assumption, is easiest spotted in the Scale-Location plot. Note the axis scaling and that the values are square-root transformed in this plot. For example, the third plot displays a much stronger increase in variance along the fitted values than the first plot. Non-linearity is easiest spotted in the Residuals vs Fitted plot. The QQ Plot is shown to demonstrate that variance heterogeneity and non-linearity also affect the normal distribution.

If you are interested in additional plots and ways to examine residuals check out this blog.

Dealing with violated assumptions

Intro and heteroscedasticity

Let us start with a quiz related to dealing with violated model assumptions.

Quiz

We have discussed several ways to deal with violated assumptions in the lecture, here, we will only briefly discuss ways to deal with heteroscedasticity and non-linearity. Regarding heteroscedasticity, so-called Sandwich estimators of the variance-covariance matrix can be used to compute more accurate estimates of standard errors. We illustrate this in the following example, where the linear model exhibits an increasing variance.

plot(y_l ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
abline(mod_incvar)

par(mfrow = c(1,3))
plot(mod_incvar, which = c(1:3))

We use the sandwich estimator function vcovHC() provided in the sandwich package to compute corrected standard errors.

library(sandwich)
library(lmtest)
# Calculate variance covariance matrix
corse_lm <- vcovHC(mod_incvar)
# Compare original and corrected standard errors
coeftest(mod_incvar)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.406015   2.941152   0.138   0.8903    
## n           0.958550   0.050563  18.957   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coeftest(mod_incvar, vcov = corse_lm)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.406015   2.030991  0.1999   0.8418    
## n           0.958550   0.053612 17.8794   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When inspecting the regression coefficients and their standard errors with coeftest(), we note that the standard errors changed (decreased for the intercept, increased for n), whereas the regression coefficients remained the same. You can make the same observation when comparing the output of coeftest() for the variance-covariance matrix obtained with a Sandwich estimator to that from the summary() function.

coeftest()
summary()

coeftest(mod_incvar, vcov = corse_lm)
summary(mod_incvar)

Non-linearity

Sometimes non-linearity arises due to a quadratic or higher-order polynomial relationship between a response and explanatory variables. We can check this by adding transformed explanatory variables to the model and we briefly illustrate this for second- and third-order polynomial terms. For details on this issue see Matloff(2017: 247ff) and Faraway(2015: 139f).

Consider the model below that certainly is inappropriate, based on visual inspection of the model and model diagnostics.

mod_nl1 <- lm(y_nl1 ~ n)
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
abline(mod_nl1, lwd = 2, col = "red")

par(mfrow = c(1,3))
plot(mod_nl1, which = c(1:3))

summary(mod_nl1)

## 
## Call:
## lm(formula = y_nl1 ~ n)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10874  -3117   1048   3795   4789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7424.32     603.09   12.31   <2e-16 ***
## n            -314.51      10.37  -30.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4233 on 198 degrees of freedom
## Multiple R-squared:  0.8229, Adjusted R-squared:  0.822 
## F-statistic: 920.2 on 1 and 198 DF,  p-value: < 2.2e-16

Note that the model has a high R², despite obvious violation of model assumptions. See what happens when you add higher order terms such as n² and n³. For example, n² can be added by expanding the model formula with + I(n^2). In the R formula notation, + means that a term is added and I() means that the content within brackets is evaluated mathematically (i.e. squaring n in our case). Hence, if you wanted, for example, to add the sum of two variables (e.g. var1 and var2) as term to the model, you would use + I(var1 + var 2). Note that + var1 + var2 would simply add both terms individually to the model instead of the sum. To plot a model containing polynomials, we need to use a different function than abline(), which is limited to the plotting of straight lines. For plotting, we first use predict() to obtain the fitted y’s from our lm object. Subsequently, we use these data in the lines() function.

mod_nl1b <- lm(y_nl1 ~ n + I(n^2))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl1 <- predict(mod_nl1b)
# Plot
lines(n, fit_nl1, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1b, which = c(1:3))
summary(mod_nl1b)

Hopefully, you still diagnose a violation of model assumptions. See what happens when you modify the code by adding n³ to the formula.

mod_nl1c <- lm(y_nl1 ~ n + I(n^2))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl2 <- predict(mod_nl1c)
# Plot
lines(n, fit_nl2, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1c, which = c(1:3))
summary(mod_nl1c)

mod_nl1c <- lm(y_nl1 ~ n + I(n^2) + I(n^3))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl2 <- predict(mod_nl1c)
# Plot
lines(n, fit_nl2, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1c, which = c(1:3))
summary(mod_nl1c)

The polynomial terms that we added to the model are correlated, for example:

cor(n, n^2)

## [1] 0.9688545

This can create problems in linear models, an issue that we will discuss in depth later (called multicollinearity). However, in the example above, the data generating process, which is often unknown, was: \(y = 1 n + 0.5 n^2 -0.04 n^2 + \epsilon\) with \(\epsilon \sim \text {N}(0,2)\). Note that the estimated regression coefficients are very close to the coefficients used in data generation. To avoid correlations, you can use poly() to produce orthogonal, i.e. non-correlated, polynomials. See Faraway(2015: 139f) for details. As a final remark, you should only use polynomials if the model can be meaningfully interpreted.

Model diagnostics II

Introduction and quiz

As a final topic, we deal with unusual observations. Again, we start with a quiz and you should revisit the course materials if you are not able to correctly answer the questions.

Quiz

Diagnostic plots for unusual observations

We have already discussed how to identify model outliers, i.e. unusual fitted values. Here, we focus on influential observations, i.e. observations with a high leverage and high deviation from the model, which translates to a high standardized residual. This, together with regions of high Cook’s distance, is displayed via the following code:

plot(mod_1, which = 5)

If any observation is inside the boundaries of Cook’s distance, we do not need to worry about influential points. This is the case for our model. Note that as discussed above, typically we would inspect four diagnostic plots simultaneously by simply calling:

par(mfrow = c(1, 3))
plot(mod_1)

Now let us imagine, one or a few points would exhibit a high influence. We discussed in the lecture how to deal with this. The function dfbeta() gives the change in the model parameters when an observation is omitted from the model.

# First display original coefficients
coef(mod_1)

## (Intercept)           K 
##  12.4542279   0.1518294

# Inspect change in coefficients when removing the respective observation
dfbeta(mod_1)

##     (Intercept)             K
## 18 -0.117819293  3.486850e-04
## 15 -0.095063015 -1.252574e-04
## 24 -0.495820559  6.914500e-03
## 27  0.619223018 -5.842052e-03
## 23 -0.004665359 -7.167213e-05
## 19  0.066434001  2.124234e-04
## 22 -0.133822505 -4.278988e-04
## 16 -0.554691791  1.702198e-03
## 28  1.335023654 -1.099510e-02
## 13 -0.029897434  2.194759e-04
## 14  0.133787689 -3.241402e-04
## 20 -0.141566901  2.346729e-03
## 25 -0.334683375  6.414539e-04
## 7   0.186031165 -8.826389e-04
## 5  -1.249052696  6.350175e-03
## 6  -0.082927547  4.095212e-04
## 3   0.406785367 -1.954887e-03
## 4  -0.429496301  4.405688e-03
## 2   0.178182459  1.493101e-04
## 9   1.622091870 -5.869943e-03
## 12 -0.059604343  2.181440e-04
## 10 -0.173216920  2.017827e-03
## 11  0.244299883 -3.601539e-03
## 21 -0.927963673  4.003770e-03

For example, removal of observation 9 increases the intercept (b₀) by 1.6 and decreases b₁ by 0.006. Considering the original coefficients, these changes would be minor. For a publication or report, it is best practice to report the model with and without an influential observation (of course, only if there are any). Although observation 9 is not influential, let us imagine that it was and create a model without this observation. This is generally fairly easy, if the row name, which is reported in the diagnostic plots, is identical to the row number. Unfortunately, this is not the case in our example, hence we identify the row number first using which():

rem_obs <- which(rownames(varechem) == "9") 
mod_2 <- lm(S ~ K, data = varechem[-rem_obs, ])

We can now plot both models to compare them visually and compare the regression coefficients, R² and standard errors. You should actually be able to do this at this stage. Complete the code below and interpret the differences.

# Plot without observation
plot(S ~ K, data = varechem[-rem_obs, ])
# Add point in different colour
points(varechem$K[rem_obs], varechem$S[rem_obs], col = "red")
# Plot both models
abline()
abline()
# Compare summaries
summary()
summary()
# The difference between the coefficients is reported with dfbeta
dfbeta(mod_1)[rem_obs, ]

Hint: Come on, you just need to put both model objects as arguments into the abline() and summary() functions.

Helper functions for lm objects

We end with an overview on some helper functions that provide or extract useful information from a lm object. We illustrate these functions for the model fitted in this case study.

coef(mod_1) Regression coefficients
fitted(mod_1) Fitted values (estimates of mean y given x)
confint(mod_1) Confidence intervals for the regression parameters
residuals(mod_1) Model residuals (i.e. estimates of the error)
rstandard(mod_1) Standardized model residuals
anova(mod_1) Variance table with explained and residual variance, MSE etc.
str(mod_1) Structure of linear model object, inspect to see what can be extracted from model

Finally, for publication style output (i.e. formatting) of the model you can use the sjPlot package that provides various functions (see package help and vignettes). The output can directly be saved to a file (e.g. Word (.doc) or Open Document (.odt)) using the file argument, for example: tab_model(mod_1, file = "Filename.odt").

library(sjPlot)

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

tab_model(mod_1)

	S
Predictors	Estimates	CI	p
(Intercept)	12.45	4.99 – 19.92	0.002
K	0.15	0.11 – 0.19	<0.001
Observations	24
R² / R² adjusted	0.712 / 0.699

Introduction

Load and inspect data

Exercises on data inspection

First model

Matrix calculation

Confidence intervals

Calculation of regression bands

Calculation

Plotting

Model accuracy

Model diagnostics I

Introduction and quiz

Diagnostic plots

Quiz diagnostic plots

Dealing with violated assumptions

Intro and heteroscedasticity

Non-linearity

Model diagnostics II

Introduction and quiz

Diagnostic plots for unusual observations

Helper functions for lm objects

Linear regression tutorial

Ralf B. Schäfer

2025-01-23