Measuring forecast model accuracy to optimize your business objectives with Amazon Forecast

We’re excited to announce that you can now measure the accuracy of your forecasting model to optimize the trade-offs between under-forecasting and over-forecasting costs, giving you flexibility in experimentation. Costs associated with under-forecasting and over-forecasting differ. Generally, over-forecasting leads to high inventory carrying costs and waste, whereas under-forecasting leads to stock-outs, unmet demand, and missed revenue opportunities. Amazon Forecast allows you to optimize these costs for your business objective by providing an average forecast as well as a distribution of forecasts that captures variability of demand from a minimum to maximum value. With this launch, Forecast now provides accuracy metrics for multiple distribution points when training a model, allowing you to quickly optimize for under-forecasting and over-forecasting without the need to manually calculate metrics.

Retailers rely on probabilistic forecasting to optimize their supply chain to balance the cost of under-forecasting, leading to stock-outs, with the cost of over-forecasting, leading to inventory carrying costs and waste. Depending on the product category, retailers may choose to generate forecasts at different distribution points. For example, grocery retailers choose to over-stock staples such as milk and eggs to meet variable demand. While these staple foods have relatively low carrying costs, stock-outs may not only lead to a lost sale, but also a complete cart abandonment. By maintaining high in-stock rates, retailers can improve customer satisfaction and customer loyalty. Conversely, retailers may choose to under-stock substitutable products with high carrying costs when the cost of markdowns and inventory disposal outweighs the occasional lost sale. The ability to forecast at different distribution points allows retailers to optimize these competing priorities when demand is variable.

Anaplan Inc., a cloud-native platform for orchestrating business performance, has integrated Forecast into its PlanIQ solution to bring predictive forecasting and agile scenario planning to enterprise customers. Evgy Kontorovich, Head of Product Management, says, “Each customer we work with has a unique operational and supply chain model that drives different business priorities. Some customers aim to reduce excess inventory by making more prudent forecasts, while others strive to improve in-stock availability rates to consistently meet customer demand. With forecast quantiles, planners can evaluate the accuracy of these models, evaluate model quality, and fine-tune them based on their business goals. The ability to assess the accuracy of forecast models at multiple custom quantile levels further empowers our customers to make highly informed decisions optimized for their business.”

Although Forecast provided the ability to forecast at the entire distribution of variability to manage the trade-offs of under-stocking and over-stocking, the accuracy metrics were only provided for the minimum, median, and maximum predicted demand, providing an 80% confidence band centered on the median. To evaluate the accuracy metric at a particular distribution point of interest, you first had to create forecasts at that point, and then manually calculate the accuracy metrics on your own.

With today’s launch, you can assess the strengths of your forecasting models at any distribution point within Forecast without needing to generate forecasts and manually calculate metrics. This capability enables you to experiment faster, and more cost-effectively arrive at a distribution point for your business needs.

To use this new capability, you select forecast types (or distribution points) of interest when creating the predictor. Forecast splits the input data into training and testing datasets, trains and tests the model created, and generates accuracy metrics for these distribution points. You can continue to experiment to further optimize forecast types without needing to create forecasts at each step.

Understanding Forecast accuracy metrics

Forecast provides different model accuracy metrics for you to assess the strength of your forecasting models. We provide the weighted quantile loss (wQL) metric for each selected distribution point, and weighted average percentage error (WAPE) and root mean square error (RMSE), calculated at the mean forecast. For each metric, a lower value indicates a smaller error and therefore a more accurate model. All these accuracy metrics are non-negative.

Let’s take an example of a retail dataset as shown in the following table to understand these different accuracy metrics. In this dataset, three items are being forecasted for 2 days.

Item ID	Remarks	Date	Actual Demand	Mean Forecast	P75 Forecast	P75 Error (P75 Forecast – Actual)	Mean Absolute error \|Actual – Mean Forecast\|	Mean Squared Error (Actual – Mean Forecast)²
Item₁	Item 1 is a popular item with a lot of product demand	Day 1	200	195	220	20	5	25
Item₁	Item 1 is a popular item with a lot of product demand	Day 2	100	85	90	-10	15	225
Item₂	With low product demand, Item 2 is in the long tail of demand	Day 1	1	2	3	2	1	1
Item₂		Day 2	2	3	5	3	1	1
Item₃	Item 3 is in the long tail of demand and observes a forecast with a large deviation from actual demand	Day 1	5	45	50	45	40	1600
Item₃		Day 2	5	35	40	35	30	900
			Total Demand = 313			Used in wQL[0.75]	Used in WAPE	Used in RMSE

The following table summarizes the calculated accuracy metrics for the retail dataset use case.

Metric	Value
wQL[0.75]	0.21565
WAPE	0.29393
RMSE	21.4165

In the following sections, we explain how each metric was calculated and recommendations for the best use case for each metric.

Weighted quantile loss (wQL)

The wQL metric measures the accuracy of a model at specified distribution points called quantiles. This metric helps capture the bias inherent in each quantile. For a grocery retailer who prefers to over-stock staples such as milk, choosing a higher quantile such as 0.75 (P75) better captures spikes in demand and therefore is more informative than forecasting at the median quantile of 0.5 (P50).

In this example, more emphasis is given to over-forecasting than under-forecasting and suggests that a higher amount of stock is needed to satisfy customer demand with 75% probability of success. In other words, the actual demand is less than or equal to forecasted demand 75% of the time, allowing the grocer to maintain target in-stock rates with less safety stock.

We recommend using the wQL measure at different quantiles when the costs of under-forecasting and over-forecasting differ. If the difference in costs is negligible, you may consider forecasting at the median quantile of 0.5 (P50) or use the WAPE metric, which is evaluated using the mean forecast. The following figure illustrates the probability of meeting demand dependent on quantile.

For the retail dataset example, the P75 forecast indicates that we want to prioritize over-forecasting and penalize under-forecasting. To calculate the wQL[0.75], sum the positive values in the P75 Error column and multiply them by a smaller weight of 1-0.75 = 0.25, and sum the negative values in the P75 Error column and multiply them by a larger weight of 0.75 to penalize under-forecasting. The wQL[0.75] is as follows:

Weighted absolute percentage error (WAPE)

The WAPE metric is the sum of the absolute error normalized by the total demand. The WAPE doesn’t penalize for under-forecasting or over-forecasting, and uses the average (mean) expected value of forecasts to calculate the differences. We recommend using the WAPE metric when the difference in costs of under-forecasting or over-forecasting is negligible, or if you want to evaluate the model accuracy at the mean forecast. For example, to predict the amount of cash needed at a given time in an ATM, a bank may choose to meet average demand because there is less concern of losing a customer or having excess cash in the ATM machine. In this example, you may choose to forecast at the mean and choose the WAPE as your metric to evaluate the model accuracy.

The normalization or weighting helps compare models from different datasets. For example, if the absolute error summed over the entire dataset is 5, it’s hard to interpret the quality of this metric without knowing the scale of total demand. A high total demand of 1000 results in a low WAPE (0.005) and a small total demand of 10 results in a high WAPE (0.5). The weighting in the WAPE and wQL allows these metrics to be compared across datasets with different scales.

The normalization or weighting also helps evaluate datasets that contain a mix of items with large and small demand. The WAPE metric emphasizes accuracy for items with larger demand. You can use WAPE for datasets where forecasting for a small number of SKUs drives majority of sale. For example, a retailer may prefer to use the WAPE metric to put less emphasis on forecasting errors related to special edition items, and prioritize forecasting errors for standard items with most sales.

In our retail dataset use case, the WAPE is equal to the sum of the absolute error column divided by the sum of the actual demand column (total demand).

Because the sum of the total demand is mostly driven by Item₁, the WAPE gives more importance to the accuracy of the popular Item₁.

Many retail customers work with sparse datasets, where most of their SKUs are sold infrequently. For most of the historical data points, the demand is 0. For these datasets, it’s important to account for the scale of total demand, making wQL and WAPE a better metric over RMSE to evaluate sparse datasets. The RMSE metric doesn’t take into account the scale of total demand, and returns a lower RMSE value by considering the total number of historical data points and the total number of SKUs, giving you a false sense of security that you have an accurate model.

Root mean square error (RMSE)

The RMSE metric is the square root of the sum of the squared errors (differences of the mean forecasted and actual values) divided by the product of the number of items and number of time points. The RMSE also doesn’t penalize for under-forecasting or over-forecasting, and can be used when the trade-offs between under-forecasting or over-forecasting are negligible or if you prefer to forecast at mean. Because the RMSE is proportional to the square of the errors, it’s sensitive to large deviations between the actual demand and forecasted values.

However, you should use RMSE with caution, because a few large deviations in forecasting errors can severely punish an otherwise accurate model. For example, if one item in a large dataset is severely under-forecasted or over-forecasted, the error in that item skews the entire RMSE metric drastically, and may make you reject an otherwise accurate model prematurely. For use cases where a few large deviations are not of importance, consider using wQL or WAPE.

In our retail dataset example, the RMSE is equal to square root of the sum of the squared error column divided by the total number of points (3 items * 2 days = 6).

The RMSE gives more importance to the large deviation of forecasting error of Item₃, resulting in a higher RMSE value.

We recommend using the RMSE metric when a few large incorrect predictions from a model on some items can be very costly to the business. For example, a manufacturer who is predicting machine failure may prefer to use the RMSE metric. Because operating machinery is critical, any large deviation from the actual and forecasted demand, even infrequent, needs to be over-emphasized when assessing model accuracy.

The following table summarizes our discussion on selecting an accuracy metric dependent on your use case.

Use case	wQL	WAPE	RMSE
Optimizing for under-forecasting or over-forecasting, which may have different implications	X
Prioritizing popular items or items with high demand is more important than low-demand items	X	X
Emphasizing business costs related to large deviations in forecasts errors			X
Assessing sparse datasets with 0 demand for most items in the historical data points	X	X

Selecting custom distribution points for assessing model accuracy in Forecast

You can measure model accuracy in Forecast through the CreatePredictor API and GetAccuracyMetrics API or using the Forecast console. In this section, we walk through the steps to use the console.

On the Forecast console, create a dataset group.

Upload your dataset.

In the navigation pane, choose Predictors.
Choose Train predictor.

Under Predictor Settings for Forecast types, you can enter up to five distribution points of your choosing. Mean is also accepted.

If unspecified, default quantiles of 0.1, 0.5, and 0.9 are used to train a predictor and calculate accuracy metrics.

For Algorithm selection, select Automatic (AutoML).

AutoML optimizes a model for the specified quantiles.

Optionally, for Number of backtest windows, you can select up to five to assess your model strength.

After your predictor is trained, choose your predictor on the Predictors page to view details of the accuracy metrics.

On the predictor’s details page, you can view the overall model accuracy metrics and the metrics from each backtest windows that you specified. Overall model accuracy metrics are calculating by averaging the individual backtest window metrics.

Now that your model is trained, choose Forecasts in the navigation pane.
Choose Create a forecast.
Select your trained predictor to create a forecast.

Forecasts are generated for the distribution points used during training a predictor. You also have the option to specify different distribution points for generating forecasts.

Tips and best practices

In this section, we share a few tips and best practices when using Forecast:

Before experimenting with Forecast, define your business problem related to costs of under-forecasting or over-forecasting. Evaluate the trade-offs and prioritize if you would rather over-forecast than under.
Experiment with multiple distribution points to optimize your forecast model to balance the costs associated with under-forecasting and over-forecasting.
If you’re comparing different models, use the wQL metric at the same quantile for comparison. The lower the value, the more accurate the forecasting model.
Forecast allows you to select up to five backtest windows. Forecast uses backtesting to tune predictors and produce accuracy metrics. To perform backtesting, Forecast automatically splits your time-series datasets into two sets: training and testing. The training set is used to train your model, and the testing set to evaluate the model’s predictive accuracy. We recommend choosing more than one backtest window to minimize selection bias that may make one window more or less accurate by chance. Assessing the overall model accuracy from multiple backtest windows provides a better measure of the strength of the model.

Conclusion

You can now measure the accuracy of your forecasting model at multiple distribution points of your choosing with Forecast. This gives you more flexibility to align Forecast with your business, because the impact of over-forecasting frequently differs from under-forecasting. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see Region Table. For more information about APIs specifying custom distribution points when creating a predictor and accessing accuracy metrics, see the documentation for the CreatePredictor API and GetAccuracyMetrics API. For more details on evaluating a predictor’s accuracy for multiple distribution points, see Evaluating Predictor Accuracy.

About the Authors

Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code/low-code ML services. On the side, she frequently advises startups and is raising a puppy named Imli.

Danielle Robinson is an Applied Scientist on the Amazon Forecast team. Her research is in time series forecasting and in particular how we can apply new neural network-based algorithms within Amazon Forecast. Her thesis research was focused on developing new, robust, and physically accurate numerical models for computational fluid dynamics. Her hobbies include cooking, swimming, and hiking.