In the social sciences, regresion analysis is a popular tool to estimate relationships between a dependent variable and one or more independent variables. It is a way to find trends in data, quantify the impact of input variables, and make predictions for unseen data.
In this building block, we illustrate how to estimate a model, identify outliers, plot a trend line, and make predictions.
Linear regression (
lm) is suitable for a response variable that is numeric. For logical values (e.g., did a customer churn: yes/no), you need to estimate a logistic regression model (
glm). The code sample below estimates a model, checks the model assumptions, and shows the regression coefficients.
- Model transformations can be incorporated into the formula, for example:
formula = log(y) ~ I(x^2).
- The coefficients (
coefficients(mdl)), predictions for the original data set (
fitted(mdl)), and residuals (
residuals(mdl)) can be directly derived from the model object.
- A concrete example on how to evaluate model assumptions (mean residuals is 0, residuals are normally distributed, homskedascticiy) can be found here.
library(broom) # estimate linear regression model # to estimate a logistic regression model use: # glm(formula = y ~ x, data = data, family = binomial) mdl <- lm(formula = y ~ x, data = data) # check model assumptions autoplot( mdl, which = 1:3, nrow = 1, ncol = 3 ) # show regression coefficients summary(mdl)
Compute the leverage of your data records and influence on
mdl to identify potential outliers.
library(dplyr) leverage_influence <- mdl %>% augment() %>% select(y, x, leverage = .hat, cooks_dist = .cooksd) %>% arrange(desc(cooks_dist)) %>%
Plot Trend Line
Plot a scatter plot of two numeric variables and add a linear trend line on top of it.
library(ggplot2) ggplot(data = data, aes(x, y)) + geom_points() + geom_smooth(method = "lm", se = FALSE)
Given a linear regression model (
mdl), make predictions for unseen input data (
explanatory_data). Note that for multiple linear regression models, you need to pass an
explanatory_data object with multiple columns.
explanatory_data <- c(..., ..., ...) prediction_data <- explanatory_data %>% mutate( y = predict( mdl, explanatory_data, type = "response" ) ) # See the result prediction_data
Export Model Output
You can export your model output using
stargazer. This package will create a nicely-formatted regression table for you in a variety of formats. You can learn more about it here.
Convert regression coefficients of
mdl_2 into a HTML file that can be copied into a paper.
library(stargazer) stargazer(mdl_1, mdl_2, title = "Figure 1", column.labels = c("Model 1", "Model 2"), type="html", out="output.html" )
This tutorial outlines how to run, evaluate, and export regression model results for the
cars dataset. In particular, it analyzes the relationship between a car’s speed and the stop distance.