Time Series with Machine Learning (Part 8)

Part 8: Feature Importance

The model has been trained. The next natural question is: out of the 142 features that went in, which ones actually drove the predictions? Tree-based models like LightGBM give a direct answer to this question through feature importance scores. This post explains what those scores mean, why there are two different ways to compute them, and what the results tell us about how the model learned to forecast demand.

What Is Feature Importance?

Every time a decision tree splits a node, it chooses a feature and a threshold that best separates the data. That choice can be evaluated in two ways: how often was this feature chosen, and how much did it actually improve the predictions each time it was used. These two questions give rise to two different importance metrics.

Gain — Quality of Each Split

Gain measures how much a feature reduces prediction error each time it is used for a split. Before the split, a node contains a mixed group of observations with some variance in the target. After the split, the two child nodes should be more homogeneous, closer to their own averages. The difference in error before and after is the gain from that split.

Creative writing process for idea development.
Figure 1. A node containing mixed low and high sales values (MSE = 449) is split into two purer groups (weighted MSE = 43). The gain from this split is 406, the amount of error eliminated.

Gain is accumulated across every split that uses a given feature, across every tree in the ensemble. A feature with high gain consistently created clean, meaningful splits. It genuinely helped the model. In the code, gain is normalised to percentages so the values sum to 100 across all features.

gain (%) = 100 × feature_gain / total_gain_all_features

Split — Frequency of Use

Split count simply counts how many times a feature was chosen as the splitting variable anywhere across all trees. A feature with a high split count is one the model reached for often, but that does not necessarily mean each individual use had a large impact. A feature could be used thousands of times for small, incremental refinements, or a handful of times for large, decisive separations.

This is the key distinction between the two metrics. Split tells you how frequently a feature was consulted. Gain tells you how much it contributed per consultation. Both are useful, but they can tell very different stories about the same feature.

What the Model Found Important

Auto Draft - Writing my way through ideas.
Figure 2. Top 25 features ranked by gain. Rolling mean features (green), lag features (blue), EWM features (purple), and calendar features (orange) are colour-coded by family.

The results are telling. Three features dominate the top of the ranking: sales_roll_mean_546 (54.4% of total gain), sales_lag_364 (13.2%), and sales_roll_mean_365 (9.8%). Together they account for over three quarters of all gain in the model. Everything else — the remaining 139 features — contributes the remaining quarter.

This makes intuitive sense. The 546-day rolling mean captures the smoothed average of the past 18 months, a stable, noise-free summary of each store-item’s typical demand level. The 364-day lag brings in the actual sales figure from exactly one year ago, which is the single most comparable past reference point for a seasonal series. The 365-day rolling mean adds a slightly different perspective on the same annual window. Together, these three features give the model the level and the annual cycle of demand, which is most of what matters for a quarterly forecast.

Further down the ranking, EWM features and calendar features contribute small but non-zero amounts. Month 12 (December) and day_of_week_0 (Monday) appear, suggesting the model picked up end-of-year and start-of-week patterns. The is_wknd flag also ranks, reflecting the weekend sales rhythm visible in the raw data. Many item and store OHE columns, by contrast, scored zero importance, the model found them unhelpful.

The Split vs Gain Discrepancy

Woman writing in a notebook with a pen, focused on creative ideas.
Figure 3. Each point is a feature, plotted by split count (x-axis) and gain (y-axis). sales_lag_364 has the highest split count (1251) but lower gain than sales_roll_mean_546 (936 splits, 54% gain). High frequency of use does not equal high impact per use.

The scatter plot reveals an important contrast. sales_lag_364 has the highest split count of all features — 1,251 times — but its gain is 13.2%, well below sales_roll_mean_546’s 54.4% on only 936 splits. The lag feature is consulted constantly, making many small refinements. The rolling mean is used slightly less often, but when it is used, each split removes a much larger share of the error.

This distinction matters when deciding which features to keep. A feature with high split and near-zero gain is being used as a tiebreaker — the model reaches for it when no better option is available, but it contributes almost nothing. Features with high gain but moderate split are the load-bearing ones. When in doubt, gain is the more informative metric.

A Combined Importance Score

Neither metric alone is perfect. Gain can be dominated by a single spectacular split in one tree. Split can inflate the importance of features that are used as weak fallbacks. A more robust view combines both: normalise each metric to a 0–1 scale, then multiply them together. A feature scores high only if it is both used frequently and impactful per use. Features that are merely frequent or merely impactful score lower.

  • split_norm  =  (split − min) / (max − min)
  • gain_norm   =  (gain  − min) / (max − min)
  • combined    =  split_norm × gain_norm
Writing my way through ideas.
Figure 4. The same features ranked three ways: by split count, by gain, and by the combined normalised score. The top three features are consistent across all three rankings — confirming their dominance is not an artefact of the metric choice.

The combined ranking confirms what gain already suggested: sales_roll_mean_546, sales_lag_364, and sales_roll_mean_365 are the three most important features by any measure. Below them, the ranking shifts slightly depending on which metric you use, but the broad picture is stable. Long-horizon temporal averages and year-ago lags dominate; calendar features and EWM features play a supporting role; most OHE columns contribute nothing at all.

Features with Zero Importance

A notable portion of the 142 features received a gain score of exactly zero, the model never used them for a split that reduced error. These are mostly individual item and store one-hot columns. The model apparently found that the temporal features already captured enough of the store-item-specific patterns, leaving the identity flags redundant.

Zero-importance features can safely be dropped before the final model refit. They add computation without contributing signal. The code filters them out by keeping only features with gain > 0, reducing the feature matrix from 142 columns to a smaller, cleaner set.

importance_zero = feat_imp[feat_imp[‘gain’] == 0][‘feature’].values

imp_feats = [col for col in cols if col not in importance_zero]

This pruned feature list will be used for the final model refit on the full dataset — train and validation combined — before generating predictions on the test period. Fewer redundant inputs means faster training and, in some cases, slightly better generalisation.

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

Discover more from Writing my way through ideas.

Subscribe now to keep reading and get access to the full archive.

Continue reading