The Southern hemisphere tropical cyclone season is now officially over, concluding a first full year of forecasting for us at reask. In the past 12 months we have issued three forecasts for each of the six active basins using our automated Machine Learning (ML) approach – how good were these?
Given the probabilistic nature of our forecasts choosing a framework to judge performance is critical. Having spent considerable time and effort modeling complete risk distributions we believe that simply comparing mean predictions with observed occurrences is a huge waste of useful information. Instead, and following a recent article from Nate Silver’s FiveThirtyEight blog, we here look at how well calibrated the model appeared in its first year.
Model Calibration
As discussed in the FiveThirtyEight piece a probabilistic model is said to be well calibrated if events that are forecast to happen X percent of the time actually happen about X percent of the time. Throughout the year we have released our forecasts quoting a “high probability range” corresponding to probabilities of occurrence between 60 and 70% (see exact numbers in the table below). For a well calibrated model we should therefore expect observed outcomes to fall into these predicted windows for roughly two thirds of cases.
The table above splits our forecasts in 3 groups: Named storms (Cat 0), hurricanes (Cat 1) and major hurricanes (Cat 3). As these groups are not modelled independently, they can be seen as 3 samples of 6 forecasts rather than a collection of 18 forecasts. The first 5 columns show that, when only named storms are considered, observed outcomes occurred within our forecast ranges for 5 of the 6 basins – the South Indian ocean being the only basin where they did not (yellow shading). From this initial analysis we can populate the calibration plot with its first data point (see below): the average probability estimate quoted by our model for that group is ~63% (0.63 on the x-axis) and the corresponding observed outcome rate is ~83% (0.83 on the y-axis). A similar approach applied to the other groups in the table provides two more data points at a similar level of model confidence (i.e. x = 0.63 – 0.67).
For these high probability ranges, and when averaged across the 3 groups, observed outcomes occurred within our predicted windows 4 out of 6 times, suggesting the quoted probabilities of 60-70% were in line with this year’s observed outcomes.
These numbers only sample a limited part of the calibration plot and the analysis above needs to be repeated using different levels of model confidence to complete the picture. To do so we designed alternative forecast ranges representing modeled occurrence probabilities of ~18% (1 in 6), ~33% (2 in 6) and 83% (5 in 6). The results are summarized in the calibration plot below with the full analysis presented in the 3 tables at the end of this post:
Despite the relatively limited depth of data this analysis provides important insights about our model’s behavior over the past 12 months:
- As a whole our predictions appeared very well calibrated across a range of forecast windows.
- The low intensity forecasts (named storms) are on the conservative end (blue dots above the 1:1 line) indicating that either the ranges could be narrower, or the confidence quoted higher for that group (i.e. a sharper peak in our probability distributions).
- With red dots all falling on the other side of the 1:1 line the high intensity forecasts appear to have been too confident.
Clearly it can be dangerous to see too much into these early results and as with all type of verification exercise involving probabilistic estimates a much larger volume of records is needed to confidently reach a conclusion. Yet having automatically rolled out our methodology to all 6 active basins has allowed us to maximize the amount of validation data we could use to judge our model after only 12 months. These initial results are certainly very encouraging, showing some ability to capture the whole risk distribution and to scale globally.
Ahead of the 2019 Atlantic season we have been working closely with clients and partners to refine our approach and move beyond basin wide metrics of activity. In particular our ML methods have now evolved to capture the risk of high impact US land-falling systems.