Regression is predictive and gives causation i.e., how one variable affects or influences another, but correlation is symmetric and reveals a degree of relationship between variables.1
A very good tutorial exists at https://computationalthinking.mit.edu/Spring21/linearmodel_datascience/.
Multiple linear regression
https://simon.cs.vt.edu/SoSci/converted/MRegression/ is a website that walks through a multiple linear regression example, replicated below. This sample data is also used in the ekman
package for testing.
Lecture notes on the regression model in matrix form: http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/MatrixModel.pdf
Linear regression coefficients and partial correlations?
- https://stats.stackexchange.com/questions/76815/multiple-regression-or-partial-correlation-coefficient-and-relations-between-th
- https://stats.stackexchange.com/questions/92992/whats-the-difference-between-regression-coefficients-and-partial-regression-coe?rq=1
Computer software
For Python, use statsmodels
(here with a testing dataset built into the ekman
package). This is multiple linear regression, where a column of ones is added as the \(b_{0}^{\mathrm{th}}\) column vector in X
.
import ekman
import statsmodels.api as sm
df = ekman.dataset("sales")
y = df.sales
X = df[["intelligence", "extroversion"]].to_numpy()
X = sm.add_constant(X)
results = sm.OLS(y, X).fit()
results.summary()
How to diagnose the quality of the fit?
Wilks (2006) says that checking on the residuals is important since they are where the earlier assumptions are applied. Consistent behavior from the residuals should give a picture on whether the results are misleading.
- Make a scatter plot of the residuals as function of the predicted value \(\hat{y}\) and see if the scatter is uniform, skewed or outliers are present.
- Make a Q-Q plot of the residuals to make sure they are Gaussian distributed.
- Make a scatter plot of the residuals in time to inspect for a temporal correlation. The Durbin-Watson test is an indicator of this: the higher the number, the more likely the residuals are uncorrelated. However, the Durbin-Watson number is a function of sample size and predictor variables. Seems that for a sample size of 70 and 2 predictor variables, a value of about 1.7 is needed to assume temporal independence.
resid = results.resid
yhat = results.predict()
# Point 1
plt.scatter(resid, yhat)
plt.hlines(0., color='red')
# Point 2
qq = sm.qqplot(resid, dist=stats.norm, line="45")
# Point 3
plt.scatter(time, resid) # check Durbin-Watson test statistic
Further reading
- Wilks, “Statistical Methods in the Atmospheric Sciences”, 2006.
- Gelman et al. (2020), “Regression and Other Stories”, https://avehtari.github.io/ROS-Examples/.
- Bruce Walsh, “Chapter 3: Covariance, Regression and Correlation”, http://nitro.biosci.arizona.edu/courses/EEB581-2006/handouts/bivdistrib.pdf.
-
Wilks (2006). ↩︎