Introduction
In this tutorial, you will learn to fit a linear regression model in R and to run some model diagnostics. It is important that you are familiar with the theoretical background, otherwise you may not understand the different steps. In fact, let us start with a quiz. If you have trouble with the quiz, I suggest to revisit the teaching videos and material of this (and possibly the last) session.
Load and inspect data
We use some soil data to demonstrate how to conduct a linear regression analysis. We first load the data that is contained in the package vegan.
data("varechem")
Exercises on data inspection
Now execute the code below to obtain information about the data. Then modify the code to inspect the data and count the number of variables in the data set.?varechem
head()
str()
to check how many of the variables are numeric
and how many are factors (i.e. categorical variables).
First model
We examine the relationship between K and S to
answer the question whether we can accurately predict S from
K. A linear model can be fitted with the R function
lm()
.
mod_1 <- lm(S ~ K, data = varechem)
Let us plot the observations and the fitted model. We can create
simple scatter plots with plot(var1 ~ var2)
, where you
replace var1 and var2 with the variables of interest. The model is added
to the figure by providing the object containing the fitted model to
abline()
.
par(cex = 1.4, las = 1)
plot(S ~ K, data = varechem, ylab = "S mg/kg", xlab = "K mg/kg")
abline(mod_1, lwd = 2)
The model looks appropriate, i.e. there is no indication of a
non-linear relationship. The fitted model results can be returned with
the summary()
function.
summary.lm(mod_1)
##
## Call:
## lm(formula = S ~ K, data = varechem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4299 -4.8300 -0.3528 2.6097 15.1027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.45423 3.60030 3.459 0.00223 **
## K 0.15183 0.02059 7.374 2.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.403 on 22 degrees of freedom
## Multiple R-squared: 0.7119, Adjusted R-squared: 0.6989
## F-statistic: 54.38 on 1 and 22 DF, p-value: 2.222e-07
You should be able to interpret the output if you paid attention in the lecture.
Now that we just talked about it, you certainly are able to answer the following question:
Matrix calculation
In the lecture we discussed the matrix calculation for the regression
model. Let’s see how this works in R. We use the matrix multiplication
to derive the regression coefficients. Remember that the equation for
the calculation of b is:
b = (XT
X )-1 XT
Y
Add the resulting objects to the code below to understand the different
steps of the calculation.
# Create vector of 1s as intercept term
inter <- rep(1, nrow(varechem))
# Combine intercept term with values from X (K)
full <- cbind(inter, varechem$K)
# Implement the equation for the calculation of b in R:
betas <- solve(t(full) %*% full) %*% (t(full) %*% varechem$S)
# Create vector of 1s as intercept term
inter <- rep(1, nrow(varechem))
inter
# Combine intercept term with values from X (K)
full <- cbind(inter, varechem$K)
full
# Implement the equation for the calculation of b in R:
betas <- solve(t(full) %*% full) %*% (t(full) %*% varechem$S)
betas
coef()
.
coef(mod_1)
For those who want to understand the details of matrix operations in
R: The transpose of a matrix is computed by t()
, matrix
multiplication is executed by %*%
and the inverse of a
matrix is computed with solve()
.
Confidence intervals
We discussed the concept of confidence intervals in the lecture and
discussed its application to evaluate the accuracy of the estimated mean
and a new observation. You should remember that the confidence interval
(CI) for a parameter \(\theta\) is
given as: \(\theta\) \(\pm\) value related to confidence level and
probability distribution \(\times\)
standard error related to \(\theta\)
Check your knowledge in the following quiz, if you are not confident,
repeat the sections on the standard error and confidence intervals.
Calculation of regression bands
Calculation
Confidence intervals for the regression coefficients can be calculated using the functionconfint()
,
i.e. confint(mod_1)
for the present case (execute this code
in a code window). In the following, we focus on confidence intervals
for the estimated mean and on prediction intervals. The point-wise
calculation of confidence intervals for the estimated mean and of
prediction intervals for a new observation can be done for each value
along the X-axis, yielding to so-called regression bands. This
is done easily in R via the predict()
function. Inspect the
objects ci and pi, what do they contain?
ci <- predict(mod_1, interval = "confidence", level = 0.95)
pi <- predict(mod_1, interval = "prediction", level = 0.95)
Warning in predict.lm(mod_1, interval = "prediction", level = 0.95): Vorhersagen auf aktuellen Daten beziehen sich auf zukünftige Daten
Hint: Just type in the name of the objects.
Plotting
Now lets plot the model with regression bands.
par(las = 1, cex = 1.8, mar = c(5,5,1,1))
plot(S ~ K, data = varechem, ylab = expression(paste("S (mg ", kg^-1,")")), xlab = expression(paste("K (mg ", kg^-1,")")))
abline(mod_1, lwd = 2)
lines(sort(varechem$K), ci[order(varechem$K) ,2], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), ci[order(varechem$K) ,3], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), pi[order(varechem$K) ,2], lty = 2, col = 'red', lwd = 2)
lines(sort(varechem$K), pi[order(varechem$K) ,3], lty = 2, col = 'red', lwd = 2)
ci2 <- predict(mod_1 ...)
ci2 <- predict(mod_1, interval = "confidence", level = 0.68)
ci2 <- predict(mod_1, interval = "confidence", level = 0.68)
par(las = 1, cex = 1.8, mar = c(5,5,1,1))
plot(S ~ K, data = varechem, ylab = expression(paste("S (mg ", kg^-1,")")), xlab = expression(paste("K (mg ", kg^-1,")")))
abline(mod_1, lwd = 2)
lines(sort(varechem$K), ci2[order(varechem$K) ,2], lty = 2, col = 'blue', lwd = 2)
lines(sort(varechem$K), ci2[order(varechem$K) ,3], lty = 2, col = 'blue', lwd = 2)
Let’s end the topic with a quiz.
Model accuracy
We discussed the concepts of R2 and Root Mean Square Error (RMSE) in the lecture. We will not deal in detail with these model accuracy measures here, as they will play a more important role in following sessions. Nevertheless, by now you should be able to answer the following quiz question.
Both are provided when calling the summary()
function,
though the RMSE is called Residual standard error in the
output.
summary(mod_1)
##
## Call:
## lm(formula = S ~ K, data = varechem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4299 -4.8300 -0.3528 2.6097 15.1027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.45423 3.60030 3.459 0.00223 **
## K 0.15183 0.02059 7.374 2.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.403 on 22 degrees of freedom
## Multiple R-squared: 0.7119, Adjusted R-squared: 0.6989
## F-statistic: 54.38 on 1 and 22 DF, p-value: 2.222e-07
The R2 is approximately 0.71 , i.e. approximately
71% of the variance in S is explained by K. Additional
information is provided in the variance table that can be called using
anova()
.
anova(mod_1)
It provides the degrees of freedom (Df), sum of squares (Sum Sq) for K (MSS in lecture) and for the residuals (RSS in lecture) as well as the MSE (Mean Sq for Residuals). At this stage, the F value and Pr(>F) are not of interest to us.
Model diagnostics I
Introduction and quiz
We extensively discussed model diagnostics in the lecture. To check whether you have the knowledge required to understand this section, do the quiz and revisit the lecture in case you don’t know the correct answers.
Diagnostic plots
The diagnostic plots can simply be called using the
plot()
function, for example as follows:
par(mfrow = c(1, 4))
plot(mod_1)
The argument mfrow in the par()
function
controls the number of plots in a plotting device (e.g. Plots window in
R, file). par(mfrow = c(1, 4))
means that 1 row and 4
columns are set for plotting, hence 4 plots are plotted rowwise in a
plotting device. This is more practical for diagnostics of the linear
model than plotting a single plot one after another, which would happen
if you call plot(mod_1)
without a prior call of
par(mfrow = c(1, 4))
. For example, it is easier to spot
whether it is the same observations that are responsible for deviations
in multiple plots.
However, to reduce complexity, we first restrict plotting to the QQ plot. We have discussed the QQ plot in the previous session and in case you struggle with the interpretation and diagnosis, revisit this session.
par(mfrow = c(1, 1))
plot(mod_1, which = 2)
The QQ plot shows that the standardized residuals largely follow the
theoretical quantiles. Observations 24 and 9 deviate a bit, and we can
check whether this is unusual using the qreference()
function.
library(DAAG)
qreference(rstandard(mod_1), nrep = 8, xlab = "Theoretical Quantiles")
Our observations display a pattern similar to those from a normal distribution that are displayed in the reference plots. Hence, not much to worry about the deviation of the two observations. Remember as well the discussion in the lecture that the normal distribution assumption is the least relevant and that deviations often are associated with violation of other assumptions (as we will also see later).
The non-linearity assumption can be checked by plotting the model. We have already done that (see section First model), and the figure did not indicate non-linearity. Moreover, violation of the linearity assumption would be spotted in the residuals vs. fitted values plots. We will use these plots in the following to evaluate this and the variance homogeneity assumption (also called homoscedasticity) as well as to identify model outliers. Thus, we focus on two diagnostic plots related to residuals vs. fitted values in the following, but as stated above, typically you would simultaneously use four plots, once you are familiar with the different diagnostic plots.
par(mfrow = c(1,2))
plot(mod_1, which = c(1,3))
The Residuals vs Fitted plot (left) and Scale-Location plot (right) can be used to evaluate linearity and variance homogeneity as well as to identify model outliers. The right plot is similar to the left plot but uses standardized residuals, or more precisely the square root of the standardized residuals. Standardized residuals are obtained by dividing each residual by its standard deviation. Hence, they show how many standard deviations any point is away from the fitted regression model. Again, we see no sign of non-linearity. Regarding model outliers, we discussed that if the assumption of normal distribution holds, approximately 5% of values would deviate more than 2 (1.96 to be precise) standard deviations.
Consider the definition of the standardized residuals in the lecture, the right plot and the number of observations (24) to answer the following quiz questions.
Regarding variance homogeneity, in both plots the variance is rather similar along the gradient of fitted values suggesting that heteroscedasticity (unequal variance) is not an issue and that the assumption of variance homogeneity is met. Consider the slide Model diagnostics: Variance homogeneity from the lecture, if you don’t know what to look for. In fact, let’s test your ability to spot deviations.
Quiz diagnostic plots
In the following, you will see several plots with a quiz question below that asks you to evaluate the linear model assumptions.
The increasing variance, i.e. violation of the variance homogeneity assumption, is easiest spotted in the Scale-Location plot. Note the axis scaling and that the values are square-root transformed in this plot. For example, the third plot displays a much stronger increase in variance along the fitted values than the first plot. Non-linearity is easiest spotted in the Residuals vs Fitted plot. The QQ Plot is shown to demonstrate that variance heterogeneity and non-linearity also affect the normal distribution.
If you are interested in additional plots and ways to examine residuals check out this blog.
Dealing with violated assumptions
Intro and heteroscedasticity
Let us start with a quiz related to dealing with violated model assumptions.
We have discussed several ways to deal with violated assumptions in the lecture, here, we will only briefly discuss ways to deal with heteroscedasticity and non-linearity. Regarding heteroscedasticity, so-called Sandwich estimators of the variance-covariance matrix can be used to compute more accurate estimates of standard errors. We illustrate this in the following example, where the linear model exhibits an increasing variance.
plot(y_l ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
abline(mod_incvar)
par(mfrow = c(1,3))
plot(mod_incvar, which = c(1:3))
We use the sandwich estimator function vcovHC()
provided
in the sandwich package to compute corrected standard errors.
library(sandwich)
library(lmtest)
# Calculate variance covariance matrix
corse_lm <- vcovHC(mod_incvar)
# Compare original and corrected standard errors
coeftest(mod_incvar)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.406015 2.941152 0.138 0.8903
## n 0.958550 0.050563 18.957 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mod_incvar, vcov = corse_lm)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.406015 2.030991 0.1999 0.8418
## n 0.958550 0.053612 17.8794 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When inspecting the regression coefficients and their standard errors
with coeftest()
, we note that the standard errors changed
(decreased for the intercept, increased for n), whereas the
regression coefficients remained the same. You can make the same
observation when comparing the output of coeftest()
for the
variance-covariance matrix obtained with a Sandwich estimator to that
from the summary()
function.
coeftest()
summary()
coeftest(mod_incvar, vcov = corse_lm)
summary(mod_incvar)
Non-linearity
Sometimes non-linearity arises due to a quadratic or higher-order polynomial relationship between a response and explanatory variables. We can check this by adding transformed explanatory variables to the model and we briefly illustrate this for second- and third-order polynomial terms. For details on this issue see Matloff(2017: 247ff) and Faraway(2015: 139f).
Consider the model below that certainly is inappropriate, based on visual inspection of the model and model diagnostics.
mod_nl1 <- lm(y_nl1 ~ n)
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
abline(mod_nl1, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1, which = c(1:3))
summary(mod_nl1)
##
## Call:
## lm(formula = y_nl1 ~ n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10874 -3117 1048 3795 4789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7424.32 603.09 12.31 <2e-16 ***
## n -314.51 10.37 -30.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4233 on 198 degrees of freedom
## Multiple R-squared: 0.8229, Adjusted R-squared: 0.822
## F-statistic: 920.2 on 1 and 198 DF, p-value: < 2.2e-16
Note that the model has a high R2, despite
obvious violation of model assumptions. See what happens when you add
higher order terms such as n2 and
n3. For example, n2 can be added
by expanding the model formula with + I(n^2)
. In the R
formula notation, +
means that a term is added and
I()
means that the content within brackets is evaluated
mathematically (i.e. squaring n in our case). Hence, if you
wanted, for example, to add the sum of two variables (e.g. var1
and var2) as term to the model, you would use
+ I(var1 + var 2)
. Note that + var1 + var2
would simply add both terms individually to the model instead of the
sum. To plot a model containing polynomials, we need to use a different
function than abline()
, which is limited to the plotting of
straight lines. For plotting, we first use predict()
to
obtain the fitted y’s from our lm object.
Subsequently, we use these data in the lines()
function.
mod_nl1b <- lm(y_nl1 ~ n + I(n^2))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl1 <- predict(mod_nl1b)
# Plot
lines(n, fit_nl1, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1b, which = c(1:3))
summary(mod_nl1b)
mod_nl1c <- lm(y_nl1 ~ n + I(n^2))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl2 <- predict(mod_nl1c)
# Plot
lines(n, fit_nl2, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1c, which = c(1:3))
summary(mod_nl1c)
mod_nl1c <- lm(y_nl1 ~ n + I(n^2) + I(n^3))
plot(y_nl1 ~ n, xlab = "Explanatory variable", ylab = "Response", las = 1)
# Extract fitted values
fit_nl2 <- predict(mod_nl1c)
# Plot
lines(n, fit_nl2, lwd = 2, col = "red")
par(mfrow = c(1,3))
plot(mod_nl1c, which = c(1:3))
summary(mod_nl1c)
The polynomial terms that we added to the model are correlated, for example:
cor(n, n^2)
## [1] 0.9688545
This can create problems in linear models, an issue that we will
discuss in depth later (called multicollinearity). However, in
the example above, the data generating process, which is often unknown,
was: \(y = 1 n + 0.5 n^2 -0.04 n^2 +
\epsilon\) with \(\epsilon \sim \text
{N}(0,2)\). Note that the estimated regression coefficients are
very close to the coefficients used in data generation. To avoid
correlations, you can use poly()
to produce orthogonal,
i.e. non-correlated, polynomials. See Faraway(2015: 139f) for details.
As a final remark, you should only use polynomials if the model can be
meaningfully interpreted.
Model diagnostics II
Introduction and quiz
As a final topic, we deal with unusual observations. Again, we start with a quiz and you should revisit the course materials if you are not able to correctly answer the questions.
Diagnostic plots for unusual observations
We have already discussed how to identify model outliers, i.e. unusual fitted values. Here, we focus on influential observations, i.e. observations with a high leverage and high deviation from the model, which translates to a high standardized residual. This, together with regions of high Cook’s distance, is displayed via the following code:
plot(mod_1, which = 5)
If any observation is inside the boundaries of Cook’s distance, we do not need to worry about influential points. This is the case for our model. Note that as discussed above, typically we would inspect four diagnostic plots simultaneously by simply calling:
par(mfrow = c(1, 3))
plot(mod_1)
Now let us imagine, one or a few points would exhibit a high
influence. We discussed in the lecture how to deal with this. The
function dfbeta()
gives the change in the model parameters
when an observation is omitted from the model.
# First display original coefficients
coef(mod_1)
## (Intercept) K
## 12.4542279 0.1518294
# Inspect change in coefficients when removing the respective observation
dfbeta(mod_1)
## (Intercept) K
## 18 -0.117819293 3.486850e-04
## 15 -0.095063015 -1.252574e-04
## 24 -0.495820559 6.914500e-03
## 27 0.619223018 -5.842052e-03
## 23 -0.004665359 -7.167213e-05
## 19 0.066434001 2.124234e-04
## 22 -0.133822505 -4.278988e-04
## 16 -0.554691791 1.702198e-03
## 28 1.335023654 -1.099510e-02
## 13 -0.029897434 2.194759e-04
## 14 0.133787689 -3.241402e-04
## 20 -0.141566901 2.346729e-03
## 25 -0.334683375 6.414539e-04
## 7 0.186031165 -8.826389e-04
## 5 -1.249052696 6.350175e-03
## 6 -0.082927547 4.095212e-04
## 3 0.406785367 -1.954887e-03
## 4 -0.429496301 4.405688e-03
## 2 0.178182459 1.493101e-04
## 9 1.622091870 -5.869943e-03
## 12 -0.059604343 2.181440e-04
## 10 -0.173216920 2.017827e-03
## 11 0.244299883 -3.601539e-03
## 21 -0.927963673 4.003770e-03
For example, removal of observation 9 increases the intercept
(b0) by 1.6 and decreases b1 by
0.006. Considering the original coefficients, these changes would be
minor. For a publication or report, it is best practice to report the
model with and without an influential observation (of course, only if
there are any). Although observation 9 is not influential, let us
imagine that it was and create a model without this observation. This is
generally fairly easy, if the row name, which is reported in the
diagnostic plots, is identical to the row number. Unfortunately, this is
not the case in our example, hence we identify the row number first
using which()
:
rem_obs <- which(rownames(varechem) == "9")
mod_2 <- lm(S ~ K, data = varechem[-rem_obs, ])
We can now plot both models to compare them visually and compare the regression coefficients, R2 and standard errors. You should actually be able to do this at this stage. Complete the code below and interpret the differences.
# Plot without observation
plot(S ~ K, data = varechem[-rem_obs, ])
# Add point in different colour
points(varechem$K[rem_obs], varechem$S[rem_obs], col = "red")
# Plot both models
abline()
abline()
# Compare summaries
summary()
summary()
# The difference between the coefficients is reported with dfbeta
dfbeta(mod_1)[rem_obs, ]
Hint: Come on, you just need to put both model
objects as arguments into the abline()
and
summary()
functions.
Helper functions for lm objects
We end with an overview on some helper functions that provide or extract useful information from a lm object. We illustrate these functions for the model fitted in this case study.
coef(mod_1)
Regression coefficientsfitted(mod_1)
Fitted values (estimates of mean y given x)confint(mod_1)
Confidence intervals for the regression parametersresiduals(mod_1)
Model residuals (i.e. estimates of the error)rstandard(mod_1)
Standardized model residualsanova(mod_1)
Variance table with explained and residual variance, MSE etc.str(mod_1)
Structure of linear model object, inspect to see what can be extracted from model
Finally, for publication style output (i.e. formatting) of the model
you can use the sjPlot package that provides various functions (see
package help and vignettes). The output can directly be saved to a file
(e.g. Word (.doc) or Open Document (.odt)) using the file argument, for
example: tab_model(mod_1, file = "Filename.odt")
.
library(sjPlot)
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
tab_model(mod_1)
S | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 12.45 | 4.99 – 19.92 | 0.002 |
K | 0.15 | 0.11 – 0.19 | <0.001 |
Observations | 24 | ||
R2 / R2 adjusted | 0.712 / 0.699 |