Time Series with Machine Learning (Part 7)

Part 7: Training LightGBM; Gradient Boosting, Hyperparameters, and Early Stopping

The feature set is ready. The validation split is in place. The next step is fitting the model, and understanding what is actually happening during that fit. LightGBM is a gradient boosted tree framework, which means the training process is iterative: it builds one tree at a time, each one correcting what the previous trees got wrong. This post walks through how that works, what each hyperparameter controls, and what the training log is telling you.

How Gradient Boosting Works

A single decision tree is a weak learner on its own; it can capture broad patterns but leaves a lot of residual error. Gradient boosting turns this weakness into a strength by building trees sequentially, where each new tree is trained specifically to correct the mistakes left by the current ensemble.

The process is straightforward. The first tree makes a rough prediction. The residuals, the gaps between predictions and actuals, are computed. The second tree is trained not on the original target but on those residuals. Its predictions are then added to the ensemble, reducing the error. A third tree fits the new residuals, and so on. After enough iterations, the accumulated ensemble of shallow trees collectively approximates a much more complex function than any single tree could.

Writing my way through ideas for creative inspiration.
Figure 1. The boosting process on a toy series. Each panel adds more trees to the ensemble. Red lines show remaining residuals, these shrink with each additional tree. RMSE drops from 2.64 (1 tree) to 1.41 (10 trees).

Two parameters directly control this process. num_boost_round sets how many trees to build in total. learning_rate (also called shrinkage) scales each tree’s contribution before adding it to the ensemble. Smaller values mean each tree steps more cautiously, which generally produces a more robust model at the cost of needing more iterations to converge. A learning rate of 0.02 with 1,000 iterations is a conservative setting; in practice, values of 10,000–15,000 iterations are common with this learning rate.

The Hyperparameters

LightGBM exposes many parameters. The eight used here were selected through grid search, a systematic sweep over candidate combinations, and then verified by hand. Most of the chosen values sit close to the library defaults, which is often a reasonable starting point for a well-engineered feature set.

Woman writing in a notebook with a laptop and coffee on a desk.
Figure 2. The eight LightGBM hyperparameters used, their values, and what each one controls.

Tree Structure Parameters

meannum_leaves controls how many terminal nodes each tree can have. More leaves means a more complex tree that can capture finer patterns, but also one that is more prone to memorizing training data. With 10 leaves, each tree stays deliberately shallow. max_depth sets a hard ceiling on tree depth, working alongside num_leaves to cap complexity from two angles.

feature_fraction tells LightGBM to consider only a random 80% of features at each boosting iteration. This is the same idea as Random Forest’s random subspace method: by not giving every tree access to every feature, the ensemble becomes more diverse and generalizes better. Each tree is slightly weaker, but the collection is more robust.

Iteration and Stopping Parameters

num_boost_round is the most important single parameter in LightGBM. It determines how many trees are built equivalently and how many gradient descent steps are taken. More iterations mean more refinement, but also more risk of overfitting if the model starts fitting noise rather than signal. The value of 1,000 here is deliberately conservative; the training results below show why it needs to be increased.

early_stopping_rounds is the mechanism that makes a high num_boost_round safe. At each iteration, LightGBM evaluates performance on the validation set. If the validation metric has not improved for 200 consecutive iterations, training stops, regardless of what num_boost_round says. The model reverts to the iteration with the best validation score. This prevents the model from continuing to overfit after its peak generalisation point has passed, and it also saves training time.

nthread = −1 tells LightGBM to use every CPU core available. This has no effect on the model quality, only on training speed.

LightGBM’s Own Data Format

LightGBM ships with its own internal data structure called lgb.Dataset. Rather than passing a raw NumPy array or pandas DataFrame directly to the training function, you first wrap the data in this format. The library can then preprocess and bin the features internally in a way that makes training significantly faster — particularly on large datasets like this one with 730,000 training rows.

lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)

lgbval   = lgb.Dataset(data=X_val,   label=Y_val,   reference=lgbtrain, feature_name=cols)

The reference=lgbtrain argument on the validation dataset is important: it tells LightGBM to use the same binning scheme for the validation set as for the training set, ensuring the two datasets are treated consistently. Without it, LightGBM might bin features differently, making the validation evaluation unreliable.

Training is then called with lgb.train rather than the sklearn-style .fit. Both work, but the native lgb.train interface gives full access to callbacks, custom evaluation functions, and the training log format shown below.

Early Stopping in Practice

Once training starts, LightGBM prints a log line every 100 iterations (controlled by lgb.log_evaluation(100)). Each line reports four values: training L2 loss, training SMAPE, validation L2 loss, and validation SMAPE. Reading this log is the primary way to understand what the model is doing.

Writing my way through ideas for creative content and storytelling.
Figure 3. Training and validation SMAPE across 1,000 iterations. Both curves are still declining at iteration 1,000, early stopping was never triggered. This signals that num_boost_round should be increased substantially.

The training log for this run ends with the message: “Did not meet early stopping. Best iteration is: [1000].” This is a direct signal that 1,000 iterations was not enough. The validation SMAPE was still improving at the final step, it had not plateaued, so early stopping had no reason to intervene. Setting num_boost_round to 10,000 would allow the model to keep improving and let early stopping find the natural optimum.

The generalisation gap, the difference between training and validation SMAPE, is small and roughly stable. Training SMAPE reaches 13.35%; validation reaches 13.83%. A gap of about half a percentage point across 730,000 training rows and 45,000 validation rows suggests the model is not meaningfully overfitting at this iteration count. The main problem is simply that it has not converged yet.

Woman writing in a notebook with ideas, creativity, and writing process.
Figure 4. L2 loss and SMAPE plotted at each 100-iteration checkpoint. Both metrics improve monotonically — confirming that the model would benefit from more iterations.

Predicting on the Validation Set

After training, the model’s best iteration is used to generate predictions on the validation set features. The argument num_iteration=model.best_iteration ensures the prediction uses the ensemble at its peak validation performance, not necessarily the final iteration.

y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)

These predictions come out in log-scale, because the target variable was transformed with log1p before training. To evaluate them against real sales figures, both predictions and actuals must be inverted with expm1 — the exact reverse of log1p. Applying expm1 to a log1p-transformed value recovers the original units: expm1(log1p(x)) = x.

smape( np.expm1(y_pred_val),  np.expm1(Y_val) )  →  13.828%

The validation SMAPE is 13.83%. In practical terms, the model’s predictions are off by about 14% on a symmetric basis across 45,000 daily store-item observations. For a demand forecasting problem with 500 distinct series, each with its own scale and seasonal pattern, this is a reasonable starting result — and one that should improve once num_boost_round is increased to allow the model to fully converge.

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

Discover more from Writing my way through ideas.

Subscribe now to keep reading and get access to the full archive.

Continue reading