Linear Regression: A Practical Review

Regression is the foundation of most time series modelling work, trend is modelled as a function of time, seasonality is captured with indicator variables, and everything is diagnosed through residuals. Before those ideas make sense in a time series context, it helps to have the basics of regression solid. This post walks through the full workflow using the California Housing dataset: explore the data, fit a model, diagnose it, and refine it.


The Dataset

The California Housing dataset contains information from the 1990 US census, with one row per census block group. The response variable is the median house value for that block. The predictors include median income, average number of rooms, house age, population, and geographic location.

We’ll focus on the relationship between median income and median house value, one of the strongest signals in the data, and build from there.


Step 1: Explore Before You Model

Before fitting anything, look at the distributions of your variables and how they relate to each other. This means histograms and boxplots to understand each variable individually, scatter plots to visualise pairwise relationships, and a correlation matrix to get a quick numerical summary.

For the California Housing data, median income and house value have a correlation of roughly +0.69, a strong positive association. But correlation only captures linear relationships. The scatter plot often reveals things the correlation number misses: curvature, clusters, outliers, or a relationship that changes direction at some threshold. If the scatter shows clear curvature, a straight line will be systematically wrong in parts of the data, and you need to know that before you model.

Scatter plot showing linear regression between median income and house value.
Figure 1. Median income versus median house value (California Housing, n=600). The regression line captures the overall positive direction. Notice the spread around the line grows for higher income values, a sign that something in the model assumptions will need attention.

Step 2: Fit the Model

The linear regression model expresses the response as a linear combination of predictors plus an error term:

y=ฮฒ0+ฮฒ1x1+ฮฒ2x2+โ‹ฏ+ฮฒpxp+ฯตy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon

In matrix notation: y = Xฮฒ + ฮต

The coefficients ฮฒ are estimated by ordinary least squares, minimising the sum of squared differences between observed and predicted values. Each coefficient tells you how much y is expected to change per unit increase in that predictor, holding all others fixed.

Three numbers to read from the output:

Coefficients: the direction and magnitude of each predictor’s effect. For median income in this dataset, the coefficient is roughly 0.45, meaning each additional unit of income (approximately $10,000) is associated with an increase of about $45,000 in median house value, all else equal.

p-values: a measure of whether the observed relationship could plausibly be due to chance. A small p-value (conventionally below 0.05) suggests the predictor carries real information. In a dataset this size, p-values for strong predictors like income will be vanishingly small.

Rยฒ: the proportion of variance in y explained by the model. A value of 0.74 means the model accounts for 74% of the variation in house values. Adjusted Rยฒ applies a small penalty for each additional predictor added, to prevent overfitting by adding variables that contribute nothing.


Step 3: Categorical Predictors โ€” Dummy Variables

Regression handles categorical predictors through indicator (dummy) variables. A binary predictor like “does this property border the river?” (yes/no) becomes a variable that equals 1 for yes and 0 for no. Its coefficient gives the expected difference in y between the two groups, all else equal.

For a predictor with multiple categories, like month of year, you create one dummy variable per category, leaving one out as the reference. If January is the reference, then the coefficient for February tells you how much February differs from January on average.

This is exactly how seasonality is modelled in time series regression:

yt=ฮฒ0+ฮฒ1t+ฮณ2โ‹…๐Ÿ™[montht=Feb]+ฮณ3โ‹…๐Ÿ™[montht=Mar]+โ‹ฏ+ฮณ12โ‹…๐Ÿ™[montht=Dec]+ฯตty_t = \beta_0 + \beta_1 t + \gamma_2 \cdot \mathbb{1}[\text{month}_t = \text{Feb}] + \gamma_3 \cdot \mathbb{1}[\text{month}t = \text{Mar}] + \cdots + \gamma_{12} \cdot \mathbb{1}[\text{month}_t = \text{Dec}] + \epsilon_t

Each ฮณ coefficient captures the average effect of that month relative to the reference month. The regression doesn’t “know” these are months, it just sees a set of 0/1 predictors, and the seasonal pattern emerges from the estimated coefficients.


Step 4: Model Diagnostics

Fitting a model is only half the job. The other half is checking whether the model’s assumptions are actually met. There are four standard diagnostic plots, each targeting a different assumption.

Regression analysis and data visualization for linear regression.
Figure 2. Four diagnostic plots for the regression of house value on median income. Each plot checks a different assumption. The patterns visible here, fanning residuals, tail deviations in the Q-Q plot, and an upward slope in the scale-location plot, all point to heteroscedasticity and mild non-normality.

Plot 1 โ€” Residuals vs Fitted

Plots the residuals (observed minus predicted) against the fitted values. If the model assumptions hold, you should see a flat horizontal band of points scattered randomly around zero, with no pattern. Two specific patterns are worth watching for. A curved shape means the model is missing a nonlinear component, the relationship isn’t actually linear. A funnel or fan shape means the variance of the errors is not constant, the model is more wrong in some regions than others. In the California Housing example, you can see the spread widen as the fitted value grows.

Plot 2 โ€” Q-Q Plot

Tests whether the residuals follow a normal distribution. The x-axis shows the quantiles you’d expect from a normal distribution; the y-axis shows the actual quantiles of your residuals. If the normality assumption holds, all points should fall close to the diagonal line. Deviations at the tails, points curving away from the line at the extremes, indicate the residual distribution has heavier tails than normal. This matters for inference: confidence intervals and p-values are based on the normality assumption, so if it’s badly violated, they can’t be trusted. For pure prediction tasks it matters less.

Plot 3 โ€” Scale-Location

Also called the spread-location plot. It shows the square root of the absolute residuals against the fitted values. The purpose is the same as the residuals vs fitted plot for detecting non-constant variance, but it’s easier to spot the trend when the values are all positive. A flat horizontal line means the variance is constant (homoscedastic). An upward slope means the variance grows with the fitted value, which is what you see in the California data.

Plot 4 โ€” Residual Distribution

A histogram of residuals, which should be roughly symmetric and bell-shaped, centred at zero. Strong skew or long tails in one direction are a warning sign for the normality assumption.

In time series, there is a fifth check: the ACF of residuals. Even if all four plots above look fine, if consecutive residuals are correlated, which is almost always the case in raw time series regression, the independence assumption is violated. The ACF plot of residuals directly tests this. Bars outside the confidence bands at low lags mean the model isn’t capturing the temporal structure. This is precisely why ARIMA models become necessary when regression alone isn’t enough.


Step 5: Nonlinear Extensions

A common misconception is that “linear” regression means the relationship must be a straight line. It doesn’t. Linear regression is linear in the parameters, meaning the ฮฒ coefficients enter the equation linearly, but the predictors themselves can be transformed in any way.

Visual representation of linear, polynomial, and log transformations in regression models.
Figure 3. Three versions of the incomeโ€“house value model: linear (a), polynomial degree 2 (b), and log-transformed predictor (c). The polynomial version captures the flattening relationship at high income levels that the straight line misses, with modest improvements in Rยฒ.

Polynomial terms:

Adding xยฒ to the model allows the fitted curve to bend. The model becomes:

y=ฮฒ0+ฮฒ1x+ฮฒ2x2+ฯตy = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon

This is still a linear regression model. ฮฒโ‚ and ฮฒโ‚‚ are estimated linearly, but the predicted values trace a curve rather than a line. In the California data, adding a quadratic income term improves Rยฒ slightly, because the relationship flattens at high income levels rather than continuing to rise at a constant rate.

Log transformation of the predictor:

If the relationship looks like it grows quickly at first and then flattens, which is common for income, population, and other right-skewed variables, replacing x with log(x) often captures this well. The model becomes:

y=ฮฒ0+ฮฒ1logโก(x)+ฯตy = \beta_0 + \beta_1 \log(x) + \epsilon

A one-unit change in log(x) corresponds to a proportional change in x, so the model effectively says “a doubling of income is associated with a fixed increase in house value” rather than “each additional dollar of income adds the same amount.” This is often a better description of how income affects prices.

Log transformation of the response:

When the response variable is right-skewed or when the variance of residuals grows with the fitted value, log-transforming y often helps. This is the case in the California data. After taking log(house value), the spread of residuals becomes more uniform and the model assumptions are better satisfied. The trade-off is that you’re now modelling log(y), so the coefficients need to be interpreted on the log scale. A coefficient of 0.1 means a one-unit increase in x is associated with a 10% increase in y,  a multiplicative effect rather than an additive one.

These same transformations carry directly into time series regression. Polynomial time trends model acceleration or deceleration in growth. Log transformations stabilise the variance of a multiplicative series. The machinery is identical.


The Core Principle

The recurring theme across all of this is residual analysis. The residuals are what the model didn’t explain. If they show a pattern, a trend, a seasonal shape, or clustering of large errors, the model is incomplete, and the pattern tells you what’s missing. This principle drives the entire modelling workflow:

Start simple, check the residuals, add what they reveal is missing, check again. In time series, if the residuals show a trend, model it. If they show seasonality, add seasonal indicators. If they show autocorrelation, move to ARIMA.


The next post goes into decomposition mechanics directly: how moving averages work, how classical decomposition separates a series into its components step by step, and what happens when the seasonal pattern isn’t fixed over time.

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

Discover more from Writing my way through ideas.

Subscribe now to keep reading and get access to the full archive.

Continue reading