LightGBM Training: Hyperparameters and Early Stopping

Part 7: Training LightGBM; Gradient Boosting, Hyperparameters, and Early Stopping

The feature set is ready. The validation split is in place. The next step is fitting the model, and understanding what is actually happening during that fit. LightGBM is a gradient boosting framework that builds one tree at a time, with each new tree correcting the mistakes of the previous ones. This post walks through how that works, what each hyperparameter controls, and what the training log is telling you.

How Gradient Boosting Works

A single decision tree is a weak learner on its own; it can capture broad patterns but leaves a lot of residual error. Gradient boosting turns this weakness into a strength by adding trees one at a time, each one learning from the mistakes of the current ensemble.

The process is straightforward. The first tree makes a rough prediction. The model computes the residuals—the gaps between predictions and actual values. The second tree learns from those residuals rather than the original target. Its predictions are added to the ensemble, reducing the error. A third tree fits the new residuals, and so on. After enough iterations, the accumulated ensemble of shallow trees collectively approximates a much more complex function than any single tree could.

Writing my way through ideas for creative inspiration. — **Figure 1.** The boosting process on a toy series. Each panel adds more trees to the ensemble. Red lines show remaining residuals, these shrink with each additional tree. RMSE drops from 2.64 (1 tree) to 1.41 (10 trees).

Two parameters directly control this process. num_boost_round sets how many trees to build in total. learning_rate (also called shrinkage) scales each tree’s contribution before adding it to the ensemble. Smaller values mean each tree steps more cautiously, which generally produces a more robust model at the cost of needing more iterations to converge. A learning rate of 0.02 with 1,000 iterations is a conservative setting; in practice, values of 10,000–15,000 iterations are common with this learning rate.

The Hyperparameters

LightGBM exposes many parameters. Grid search selected these eight candidate values, and we verified them manually. Most of the chosen values sit close to the library defaults, which is often a reasonable starting point for a well-engineered feature set.

Woman writing in a notebook with a laptop and coffee on a desk. — **Figure 2.** The eight LightGBM hyperparameters used, their values, and what each one controls.

Tree Structure Parameters

num_leaves controls how many terminal nodes each tree can have. More leaves let the tree capture finer patterns, but also make it more likely to memorize the training data. With 10 leaves, each tree stays deliberately shallow. max_depth complements num_leaves by placing a hard limit on tree depth.

feature_fraction tells LightGBM to consider a random 80% of features at each boosting iteration. Like Random Forest’s random subspace method, this increases diversity across trees and improves generalization. Each tree becomes slightly weaker, but the ensemble becomes more robust.

Iteration and Stopping Parameters

num_boost_round is the most important parameter in LightGBM. It determines how many trees the model builds—or equivalently, how many boosting iterations it performs. More iterations refine the model but also increase the risk of overfitting. We intentionally start with 1,000 iterations; the training log below shows why the model benefits from more.

early_stopping_rounds makes a high num_boost_round safe. LightGBM evaluates the validation metric after every iteration and stops if it doesn’t improve for 200 consecutive rounds. It then keeps the model from the best iteration, preventing unnecessary overfitting while reducing training time.

nthread = -1 tells LightGBM to use all available CPU cores. It affects only training speed, not model quality.

LightGBM’s Own Data Format

LightGBM ships with its own internal data structure called lgb.Dataset. Rather than passing a raw NumPy array or pandas DataFrame directly to the training function, you first wrap the data in this format. The library can then preprocess and bin the features internally in a way that makes training significantly faster — particularly on large datasets like this one with 730,000 training rows.

lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)

lgbval = lgb.Dataset(data=X_val, label=Y_val, reference=lgbtrain, feature_name=cols)

The reference=lgbtrain argument on the validation dataset is important: it tells LightGBM to use the same binning scheme for the validation set as for the training set, ensuring the two datasets are treated consistently. Without it, LightGBM might bin features differently, making the validation evaluation unreliable.

Training is then called with lgb.train rather than the sklearn-style .fit. Both work, but the native lgb.train interface gives full access to callbacks, custom evaluation functions, and the training log format shown below.

Early Stopping in Practice

Once training starts, LightGBM prints a log line every 100 iterations (controlled by lgb.log_evaluation(100)). Each line reports four values: training L2 loss, training SMAPE, validation L2 loss, and validation SMAPE. Reading this log is the primary way to understand what the model is doing.

Writing my way through ideas for creative content and storytelling. — **Figure 3.** Training and validation SMAPE across 1,000 iterations. Both curves are still declining at iteration 1,000, early stopping was never triggered. This signals that num_boost_round should be increased substantially.

The training log for this run ends with the message: “Did not meet early stopping. Best iteration is: [1000].” This is a direct signal that 1,000 iterations was not enough. The validation SMAPE was still improving at the final step, it had not plateaued, so early stopping had no reason to intervene. Setting num_boost_round to 10,000 would allow the model to keep improving and let early stopping find the natural optimum.

The generalisation gap, the difference between training and validation SMAPE, is small and roughly stable. Training SMAPE reaches 13.35%; validation reaches 13.83%. A gap of about half a percentage point across 730,000 training rows and 45,000 validation rows suggests the model is not meaningfully overfitting at this iteration count. The main problem is simply that it has not converged yet.

Woman writing in a notebook with ideas, creativity, and writing process. — **Figure 4.** L2 loss and SMAPE plotted at each 100-iteration checkpoint. Both metrics improve monotonically — confirming that the model would benefit from more iterations.

Predicting on the Validation Set

After training, the model’s best iteration is used to generate predictions on the validation set features. The argument num_iteration=model.best_iteration ensures the prediction uses the ensemble at its peak validation performance, not necessarily the final iteration.

y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)

These predictions come out in log-scale, because the target variable was transformed with log1p before training. To evaluate them against real sales figures, both predictions and actuals must be inverted with expm1 — the exact reverse of log1p. Applying expm1 to a log1p-transformed value recovers the original units: expm1(log1p(x)) = x.

smape( np.expm1(y_pred_val), np.expm1(Y_val) ) → 13.828%

The validation SMAPE is 13.83%. In practical terms, the model’s predictions are off by about 14% on a symmetric basis across 45,000 daily store-item observations. For a demand forecasting problem with 500 distinct series, each with its own scale and seasonal pattern, this is a reasonable starting result — and one that should improve once num_boost_round is increased to allow the model to fully converge.

Time Series with Machine Learning (Part 7)

How Gradient Boosting Works

The Hyperparameters

LightGBM’s Own Data Format

Early Stopping in Practice

Predicting on the Validation Set

Like this:

Related

Leave a ReplyCancel reply

How Gradient Boosting Works

The Hyperparameters

LightGBM’s Own Data Format

Early Stopping in Practice

Predicting on the Validation Set

Share:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Writing my way through ideas.