Linear Regression

Regression is predictive and gives causation i.e., how one variable affects or influences another, but correlation is symmetric and reveals a degree of relationship between variables.1

A very good tutorial exists at https://computationalthinking.mit.edu/Spring21/linearmodel_datascience/.

Multiple linear regression

https://simon.cs.vt.edu/SoSci/converted/MRegression/ is a website that walks through a multiple linear regression example, replicated below. This sample data is also used in the ekman package for testing.

Lecture notes on the regression model in matrix form: http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/MatrixModel.pdf

Linear regression coefficients and partial correlations?

Computer software

For Python, use statsmodels (here with a testing dataset built into the ekman package). This is multiple linear regression, where a column of ones is added as the \(b_{0}^{\mathrm{th}}\) column vector in X.

import ekman
import statsmodels.api as sm

df = ekman.dataset("sales")
y = df.sales
X = df[["intelligence", "extroversion"]].to_numpy()
X = sm.add_constant(X)

results = sm.OLS(y, X).fit()
results.summary()

How to diagnose the quality of the fit?

Wilks (2006) says that checking on the residuals is important since they are where the earlier assumptions are applied. Consistent behavior from the residuals should give a picture on whether the results are misleading.

  1. Make a scatter plot of the residuals as function of the predicted value \(\hat{y}\) and see if the scatter is uniform, skewed or outliers are present.
  2. Make a Q-Q plot of the residuals to make sure they are Gaussian distributed.
  3. Make a scatter plot of the residuals in time to inspect for a temporal correlation. The Durbin-Watson test is an indicator of this: the higher the number, the more likely the residuals are uncorrelated. However, the Durbin-Watson number is a function of sample size and predictor variables. Seems that for a sample size of 70 and 2 predictor variables, a value of about 1.7 is needed to assume temporal independence.
resid = results.resid
yhat = results.predict()

# Point 1
plt.scatter(resid, yhat)
plt.hlines(0., color='red')

# Point 2
qq = sm.qqplot(resid, dist=stats.norm, line="45")

# Point 3
plt.scatter(time, resid)  # check Durbin-Watson test statistic

Further reading


  1. Wilks (2006). ↩︎