Machine Learning Model Testing: Final Steps Explained

Part 9: The Final Model and Submission

Everything up to this point has been preparation. The feature engineering produced 142 inputs. The validation run found the optimal number of boosting iterations and confirmed a validation SMAPE around 13.6%. Feature importance analysis identified which of those 142 features actually contributed. Now all of that feeds into one final step: train the model on every available data point, generate predictions for the test period, and produce a submission file.

Woman writing in a notebook with ideas and notes for creative writing. — **Figure 1.** The full pipeline from raw data to submission file. The validation model serves as the hyperparameter tuning stage; the final model uses its output (best_iteration) and all available training data.

Splitting Train and Test

At this stage the dataframe still contains both training rows (with known sales) and test rows (where sales is NaN — the values to be predicted). The split is simple: rows where sales is not NaN become the training set; rows where sales is NaN become the test set.

train = df.loc[~df.sales.isna()] # 2013-01-01 to 2017-12-31

test = df.loc[ df.sales.isna()] # 2018-01-01 to 2018-03-31

This is different from the validation split used earlier. Then, a slice of 2017 data was held out as a proxy test set to tune hyperparameters. Now that tuning is done, all rows with known sales — the full five years — go into training. The model sees more data, which should produce slightly better predictions than the validation run suggested.

Why Refit on the Full Dataset?

The validation model’s purpose was to find the right number of boosting iterations — when to stop training before the model starts overfitting. That number (model.best_iteration) is now known. There is no longer any reason to hold data back for evaluation: every row of training data can now be used to fit the final model.

The final model uses the same hyperparameters as the validation run, with one critical change: early_stopping_rounds is removed, and num_boost_round is set exactly to model.best_iteration. There is no validation set to monitor — the model trains for exactly the number of iterations that the earlier run determined to be optimal, then stops.

lgb_params[‘num_boost_round’] = model.best_iteration

final_model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration)

One optional step before training: drop the zero-importance features identified in the previous post. The feature list cols is filtered to imp_feats, which excludes any column whose gain score was exactly zero. This reduces the feature matrix slightly, speeds up training, and removes noise that was never contributing signal.

Generating and Converting Predictions

With the final model trained, predictions for the test period are generated in a single call. The model takes X_test — the feature matrix for the 45,000 test rows — and returns one predicted value per row.

test_preds = final_model.predict(X_test, num_iteration=model.best_iteration)

These predictions come out in log scale, because the target variable was transformed with log1p before training. The raw output looks like this:

array([2.553, 2.742, 2.773, …, 4.375, 4.412, 4.471])

These are log(1 + sales) values, not unit sales. Before they can be submitted, they must be converted back to actual sales figures using expm1 — the exact inverse of log1p. For any value x, expm1(x) = eˣ − 1.

Woman writing in a notebook to explore ideas and creativity. — **Figure 3.** Left: raw model output in log scale (compact, symmetric distribution). Right: the same predictions after expm1 — back in unit sales, with the familiar right-skewed shape matching the original training data.

After conversion, the predicted sales values are in the same units as the original dataset — daily items sold per store-item pair. The distribution looks reasonable: the median prediction sits around 46 units per day, which is close to the 47-unit median in the training data. The shape of the distribution is right-skewed, consistent with what was observed in the raw data from the beginning of this series.

Building the Submission File

The competition expects a CSV file with two columns: id and sales. The id column comes directly from the test set — Kaggle assigned a unique integer to each row when it released the test data. The sales column receives the expm1-converted predictions.

submission_df = test.loc[:, [‘id’, ‘sales’]]
submission_df[‘sales’] = np.expm1(test_preds)
submission_df[‘id’] = submission_df.id.astype(int)
submission_df.to_csv(‘submission_demand.csv’, index=False)

The id column is cast to integer before saving — the test set stores it as a float because NaN values elsewhere in the dataframe force the column to float dtype, and Kaggle expects integer IDs in the submission. The index=False argument prevents pandas from writing the row index as an extra column, which would break the submission format.

The result is a CSV with 45,000 rows: 90 days of predictions for each of 500 store-item combinations. Each row contains one id and one predicted sales value. That file is what gets uploaded to Kaggle for scoring.

What This Series Covered

This post closes out the modelling section of the series. Starting from a raw CSV with four columns — date, store, item, sales — the pipeline built 142 features across five families: calendar features, lag features, rolling mean features, exponentially weighted mean features, and one-hot encoded identities. A LightGBM model was trained with a custom SMAPE objective, tuned via time-based validation, and evaluated against a held-out quarter of 2017 data.

The feature importance analysis showed that long-horizon rolling averages and year-ago lag features carried the most signal — consistent with the intuition that annual seasonality is the dominant pattern in this data. The final model, trained on all five years of data at the optimal iteration count, produced 45,000 daily sales predictions for the test period.

The validation SMAPE of approximately 13.6% suggests the model captures roughly the right level and seasonal shape of demand for each store-item pair. Whether that is competitive on the Kaggle leaderboard depends on the benchmark — but the pipeline itself is complete, reproducible, and ready to iterate on.

Series complete — Machine Learning for Time Series

Time Series with Machine Learning (Part 9)

Splitting Train and Test

Why Refit on the Full Dataset?

Generating and Converting Predictions

Building the Submission File

What This Series Covered

Like this:

Related

Leave a ReplyCancel reply

Splitting Train and Test

Why Refit on the Full Dataset?

Generating and Converting Predictions

Building the Submission File

What This Series Covered

Share:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Writing my way through ideas.