Time Series with Machine Learning (Part 6)

Part 6: Measuring the Right Thing, Splitting the Right Way

The Cost Function and the Validation Split

Before training a model, two decisions need to be locked in. First, what does “good performance” actually mean? Which metric will be used to judge the model? Second, how will the available data be split so that the model can be evaluated honestly before it is deployed on the real test period? Both choices have consequences, and in a time series context, both require more care than the standard defaults.

The Custom Cost Function

The standard regression metrics, MAE, MSE, RMSE, measure error in absolute units. If a model is off by 5 units, it reports 5, regardless of whether the actual sales figure is 10 or 1,000. That is fine when all series are on the same scale, but this dataset has 500 store-item combinations whose average daily sales range from about 15 to 85. An error of 5 units matters much more for a low-volume item than a high-volume one. A scale-sensitive metric would systematically favor high-volume series.

From MAE to MAPE to SMAPE

MAPE (mean absolute percentage error) solves the scale problem by expressing error as a fraction of the actual value. An error of 5 units on an actual of 50 is 10%; an error of 5 units on an actual of 500 is 1%. MAPE treats them proportionally. However, MAPE has an asymmetry problem: over-forecasting and under-forecasting by the same amount produce different MAPE values. If actual = 10 and predicted = 20, MAPE is 100%. If actual = 20 and predicted = 10, MAPE is 50%. The same absolute gap scores differently depending on direction.

SMAPE (Symmetric Mean Absolute Percentage Error) fixes this by replacing the actual value in the denominator with the average of the actual and the predicted. This makes the metric symmetric: over-forecasting and under-forecasting by the same amount now produce the same SMAPE value. The formula is:

SMAPE = (100 / n) · Σ  |Fₜ − Aₜ| / ((|Aₜ| + |Fₜ|) / 2)

where Aₜ is the actual value and Fₜ is the forecast. The denominator is the average of their absolute values. SMAPE ranges from 0% (perfect) to 200% (maximally wrong). Lower is always better.

Writing ideas for time series analysis with machine learning techniques.
Figure 1. All three cases have the same absolute error of 2 units. MAE is blind to scale. MAPE penalises small-actual cases harshly (Case C = 40%). SMAPE is more balanced (Case C = 33.3%) because the denominator includes the prediction.

Walking Through the Formula

A concrete example makes it immediate. Actual sales = 47 units, predicted = 55 units:

  1. Step 1 — numerator:    |F − A| = |55 − 47| = 8
  2. Step 2 — denominator:  (|A| + |F|) / 2 = (47 + 55) / 2 = 51
  3. Step 3 — ratio:        8 / 51 = 0.157
  4. Step 4 — percentage:   0.157 × 100 = 15.7%

SMAPE for this single prediction is 15.7%, a meaningful, interpretable number. The model was off by about one-sixth of the combined scale of actual and predicted. SMAPE = 0% means perfect; SMAPE = 200% means maximally wrong.

Two Roles for SMAPE

SMAPE serves two purposes here. First, it is the evaluation metric, after training, predictions are compared to actuals using SMAPE to get a single interpretable score. Second, it is used as a custom objective inside LightGBM during training. LightGBM’s default objective is MSE, but passing a custom function tells the model to optimise directly for SMAPE. This matters because a model optimised for MSE may not minimise SMAPE, the two metrics respond differently to large errors. Optimising for what you actually care about typically produces better-calibrated predictions.

One implementation detail: the target variable has been log-transformed. Predictions from the model therefore come out in log-scale. The SMAPE function needs actual sales units, so before computing any error, both predictions and labels are inverted with expm1, the reverse of log1p.

Time-Based Validation

Standard cross-validation randomly shuffles rows into folds. For most tabular problems, this approach is fine. For time series, it is not future observations that would end up in the training fold, giving the model information it could not have had in real deployment. The result is an optimistically biased performance estimate that does not reflect how the model would actually behave on new dates.

The Split

The solution is a time-ordered split: all training data comes before all validation data, with no overlap. The training set covers 2013-01-01 through 2016-12-31; four full years across all 500 store-item series. The validation set covers 2017-01-01 through 2017-03-31, exactly three months.

This three-month validation window is not arbitrary. The competition asks for three-month-ahead forecasts on 2018 data. By validating on the first three months of 2017, the setup mirrors the real task as closely as possible: the model sees all data up to a cutoff and then predicts the following quarter. The remaining nine months of 2017; April through December are not used. They sit between the validation window and the test period and would introduce a mismatch if included.

Visual representation of time series data analysis using machine learning techniques.
Figure 2. The dataset split across time. The model trains on 2013–2016, is evaluated on Jan–Mar 2017, and eventually predicts Jan–Mar 2018. Apr–Dec 2017 is discarded to keep the validation scenario consistent with the test task.

Feature Columns

Not all 146 columns in the dataframe are used as model inputs. Four are excluded for distinct reasons, each worth spelling out.

date is a raw timestamp. The calendar information it contains has already been extracted into the date feature columns; month, day_of_week, week_of_year, and so on. Passing the raw timestamp in addition would add no new signal and could cause issues depending on how the model handles datetime types.

id is a row identifier that exists only in the test set. It was introduced by Kaggle to tag each test row for submission. It carries zero predictive information and is NaN for all training rows, so including it would contribute only noise.

sales is the target variable. The value the model is supposed to predict. It cannot be an input to predict itself; that would be a direct leak of the answer.

year is excluded because it would give the model a direct signal about which calendar year it is predicting. The model has never seen 2018 during training. If year is a feature, the model encounters an out-of-range value at prediction time and has no basis for extrapolating. More subtly, including the year encourages the model to learn year-specific patterns. 2016 looked like this, 2017 looked like that, rather than learning reusable seasonal structure that transfers to 2018. The trend component is already captured by the lag and rolling mean features; year as an explicit integer would be redundant at best and misleading at worst.

Handwritten notes and sketches on paper for data analysis and machine learning ideas.
Figure 3. Shapes of the four arrays passed to the model. The training set is 16 times larger than the validation set, consistent with a 4-year vs 3-month time split.

With X_train (730,500 × 142), Y_train (730,500,), X_val (45,000 × 142), and Y_val (45,000,) in hand, the model is ready to train. The SMAPE cost function is defined. The validation set mirrors the real evaluation scenario. The next step is fitting LightGBM, monitoring SMAPE on the validation set across iterations, and reading off the final score.

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

Discover more from Writing my way through ideas.

Subscribe now to keep reading and get access to the full archive.

Continue reading