A framework for probabilistic weather forecast post-processing across models and lead times using machine learning

Forecasting the weather is an increasingly data-intensive exercise. Numerical weather prediction (NWP) models are becoming more complex, with higher resolutions, and there are increasing numbers of different models in operation. While the forecasting skill of NWP models continues to improve, the number and complexity of these models poses a new challenge for the operational meteorologist: how should the information from all available models, each with their own unique biases and limitations, be combined in order to provide stakeholders with well-calibrated probabilistic forecasts to use in decision making? In this paper, we use a road surface temperature example to demonstrate a three-stage framework that uses machine learning to bridge the gap between sets of separate forecasts from NWP models and the ‘ideal’ forecast for decision support: probabilities of future weather outcomes. First, we use quantile regression forests to learn the error profile of each numerical model, and use these to apply empirically derived probability distributions to forecasts. Second, we combine these probabilistic forecasts using quantile averaging. Third, we interpolate between the aggregate quantiles in order to generate a full predictive distribution, which we demonstrate has properties suitable for decision support. Our results suggest that this approach provides an effective and operationally viable framework for the cohesive post-processing of weather forecasts across multiple models and lead times to produce a well-calibrated probabilistic output. This article is part of the theme issue ‘Machine learning for weather and climate modelling’.

Forecasting the weather is an increasingly data intensive exercise. Numerical Weather Prediction (NWP) models are becoming more complex, with higher resolutions, and there are increasing numbers of different models in operation around the world. While the forecasting skill of NWP models continues to improve, the number and complexity of these models poses a new challenge for the operational meteorologist: how should the information from all available models, each with their own unique biases and limitations, be combined in order to provide stakeholders with well-calibrated probabilistic forecasts to use in decision making?
In this paper, we use a road surface temperature forecasting example to demonstrate a three-stage framework that uses machine learning to bridge the gap between sets of separate forecasts from NWP models and the 'ideal' forecast for decision support: probabilities of future weather outcomes. First, we use Quantile Regression Forests to learn the error profile of each numerical model, and use these to apply empirically-derived probability distributions to forecasts. Second, we combine these probabilistic forecasts using quantile averaging. Third, we interpolate between the aggregate quantiles in order to generate a full predictive distribution, which we demonstrate has properties suitable for decision support. Our results suggest that this three stage approach provides an effective and operationally viable framework for the cohesive post-processing of weather forecasts across multiple models and lead times in order to produce well-calibrated probabilistic forecasts.

Introduction
The importance of weather forecasting for decision support is likely to increase as we progress into times of changing climate and perhaps more frequent extreme conditions [1]. Any methodological developments that can improve our ability to make the optimal decisions in the face of meteorological uncertainty are likely to have a real impact on all areas that utilise weather forecasts.
Since the inception of meteorology as a mathematical science, driven by the likes of Abbe [2], Bjerknes [3], and Richardson [4], numerical modelling has been the core methodology of weather forecasting. In 2015, Bauer et al. [5] reviewed the progress of numerical forecasting methods in the quiet revolution of numerical weather prediction, and explained how improvements in physical process representation, model initialisation, and ensemble forecasting have resulted in average forecast skill improvements equivalent to one day's worth per decade -implying that in 2020 our five day forecasts have approximately the same skill as the one day forecasts of 1980.
However, the continuation of these gains requires ever more computational resources. For example, in pursuit of higher resolution models, halving grid cell length in three dimensions requires eight times the processing power, but due to model biases and initial condition uncertainty, corresponding improvements in forecasting skill are not guaranteed. At the same time, as society progresses we are placing greater emphasis on efficiency and safety in everything we do. In order for businesses to operate efficiently and in order to keep the public safe from meteorological hazards, there should be great emphasis on improving the functionality of weather forecasts as decision support tools -and that means bridging the gap between deterministic NWP model outputs (including sparse ensembles from these) and fully probabilistic forecasting approaches suitable for supporting decision making through the use of decision theory [6,7]. In essence, statistical approaches are key to optimal, transparent, and consistent decision making.
At the same time, while numerical weather prediction methodology has evolved gradually over the last century (hence 'the quiet revolution'), the last decade has seen significant developments in machine learning and its rise into the scientific limelight, with promising results being demonstrated in a wide range of applications [e.g. [8][9][10]. The catalyst for this new wave of machine learning can perhaps be attributed to the results of Krizhevsky et al. [11] in the Large Scale Visual Recognition Challenge (ILSVRC) of 2012, who demonstrated for the first time that deep neural networks -with their ability to automatically learn predictive features in order to maximise an objective function -could outperform existing state-of-the-art image classifiers based on hand-crafted features, which had been the established approach for previous decades. The parallels between the hand-crafted features in image classification, and the human choices that are made in all kinds of data processing pipelines -including weather forecasting -have inspired exploration into new applications of machine learning. In meteorology, could these tools relieve pressure from current model development and data processing bottlenecks and deliver a step-change in the rate of progress in forecasting skill?
Initial efforts using machine learning in the context of post-processing NWP model output have shown promising results [e.g. [12][13][14] but have also tended to deal with deterministic forecasting rather than providing fully probabilistic approaches. We believe that the greatest value of machine learning in weather forecasting lies in the probabilistic capabilities of these methods: not only do they have the potential to learn to improve forecasting skill empirically, but also to bridge the gap between traditionally deterministic forecasting approaches (i.e. numerical weather prediction) and the probabilistic requirements of robust decision support tools.
To this end, in this paper we demonstrate our framework for probabilistic weather forecast post-processing using machine learning. We have designed this framework to be suitable for use by operational meteorologists, and therefore, unlike other studies that we are currently aware of, our proposed solution incorporates forecast data from all available model solutions (i.e. multiple NWP model types, and all available forecast lead times). The framework aggregates the available forecast information into a single well-calibrated predictive distribution, providing probabilities of weather outcomes for each hour into the future. Our application is road surface temperature forecasting -a univariate output -using archived operational data from the UK Met Office. In this demonstration we use Quantile Regression Forests [QRF, 15] as our machine learning algorithm, but hope to convince readers that our overall approach -flexible quantile regression for each forecast, followed by averaging of quantiles across forecasts, and finally interpolating the full predictive distribution -provides a flexible framework for probabilistic weather forecasting, and crucially one that is compatible with the use of any probabilistic forecasting models (post-processed or otherwise).
Our framework can be seen as an overarching aggregator of forecast information, emulating part of the role of the operational meteorologist, who must otherwise develop a sense for how skillful each individual forecast is through experience, and mentally combine these forecasts in order to make probabilistic statements to inform decision making. These include judgements of uncertainty such as a 'most likely scenario' and a 'reasonable worst case scenario' [16]. Figure 1 gives an example of how complex a task it is to make sense of the available forecast information, even for the single variable of road surface temperature at a single site.  Figure 1. A visualisation of the information provided by numerical weather prediction (NWP) forecasts. Each coloured line represents an ensemble member from a different model type. Observations (solid black line) go as far as time zero (vertical dashed line -the 'current time', which is 00:00 on 5th of Jan in this figure) and beyond that, if a statistical approach is not used, it's down to individual meteorologists to determine the likely weather outcomes based on the information presented by the models.
While methods for weather forecast post-processing using more traditional statistical approaches have existed for some time [e.g. [17][18][19][20], we believe our machine learning based approach to be a useful contribution to the field as interest in meteorological machine learning grows. The development of our framework has been guided by the needs of operational weather forecasting, including handling sets of different weather forecasting models with their own unique ranges of lead times. Increasingly these forecasts may not all be raw NWP forecasts, but are themselves likely to have been individually post-processed using machine learning (e.g. for downscaling), or purely statistical spatio-temporal forecasts. It is therefore a strength of our proposed framework that we can post-process any number of models of any type, and for any lead times. rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 0000000

Post-processing framework
The key considerations in designing our framework were that we wanted to develop an approach that was flexible, compatible, and fast. Flexible in the sense that we would like to minimise the number of assumptions made that would constrain the form of our probabilistic forecasts, and largely 'let the data do the talking', as tends to be the machine learning ethos. Compatible in the sense that we would like our framework to generalise to scenarios in which NWP model outputs are not the only forecast available -this is likely to become more common as machine learning becomes more commonplace. And fast, because weather forecasting is a near-real-time activity and any post-processing approach has to be able to keep up.
There are many possible approaches for post processing individual weather forecasts, and indeed many possible approaches for producing forecasts in the first place (for example spatiotemporal statistical models [21], or more recently neural network based approaches [22], in addition to the traditional NWP models). By using quantiles as the basis on which we combine multiple forecasts, our approach is compatible with any forecast from which well-calibrated predictive quantiles can be obtained, either from the forecast model directly (if probabilistic), or through uncertainty quantification of deterministic models, as we demonstrate in this paper. The three stages of our framework's methodology are explained in the following subsections.

(a) From deterministic to probabilistic forecasts
For our application to road surface temperature forecasting, the available forecasts come from a set of NWP models, as is commonly the case. Our model set spans from long range, low resolution global models (glu, glm) through medium range, medium resolution European models (eur_eu, eur_uk) to shorter range, high resolution UK specific models (ukv, enuk) including a six-hour nowcast (pvrn). Apart from the 'enuk' model, which itself provides an ensemble of 12 members on each run, the other models provide single deterministic forecasts. While all of these models provide spatial forecasts, in this study we post-process the forecasts for specific sites in order to focus on the probabilistic aspects. Figure 1 shows a snapshot of the set of model forecasts for a single site.
While the final output of our framework is a full predictive distribution summarising the information contained in the entire set of NWP model output, the first step is to convert each deterministic forecast into an individually well-calibrated probabilistic forecast. We do this by using machine learning to model the error profile of each deterministic forecast conditional on forecasting covariates. The error is defined as: where x t,m is a NWP model forecast for model type m (e.g. 'enuk') and lead time t while y is the corresponding observation. For our surface temperature data, lead times range from 0 hours to 168 hours. Predictions of future data points are then obtained bŷ Modelling the errors rather than y was empirically found to produce better predictions using significantly less training data. An explanation for this is that x t,m is used as a complex trend removal function (e.g. for seasonality and other non-stationary effects), thus allowing us to treat t,m as a time-invariant (stationary) variable (see Figure 2). Modelling the errors has the added benefit of having much more data available for training for each t and m combination, compared to modelling the unique y at each time step. The recent work of Taillardat and Mestre [23] shows that we are not alone in successfully using an error modelling approach. The particular model is initialised at 00:00 and 12:00 hours, so we see increased errors on a 12 hour cycle starting from initialisation. This is because temperature errors tend to be larger in the early hours of the afternoon (when effects of inaccurately modelled cloud coverage on solar irradiance are most pronounced) compared to the early evening and morning. In order to learn the error distribution of each NWP model type, we use Quantile Regression Forests [QRF,15] as implemented in the 'ranger' package in R [24]. While many other data modelling options are possible, QRF has a number of desirable properties. First, it has the flexibility to fit complex functions with minimal assumptions. For data rich problems such as ours, not specifying a parametric distribution allows us to capture the true complexity of the error distribution. Second, it is very fast in both training and prediction, and suitable for operational settings avoiding user input such as convergence checks (e.g. MCMC or gradient descent based methods). Third, it is relatively easy to understand the algorithm and has only a few hyper-parameters to tune, which makes getting reasonably good results in new problems quite straightforward.
For a detailed explanation of the QRF algorithm see Athey et al. [25] or Taillardat et al. [14] for a more weather oriented description. For regression problems like ours, the QRF algorithm (a variant of the popular random forest algorithm) consists of an ensemble of regression trees. A regression tree recursively partitions the space defined by the covariates into progressively smaller non-overlapping regions. A prediction is then some property/statistic of the observations contained within the relevant region. Conventionally for each tree the prediction is the sample mean of the observations in the partition corresponding to new input data. Suppose for instance that a regression tree is grown on the data in Figure 2 and that our aim is to predict the mean forecast error at 100 hours. Suppose also that the tree had decided to group all observations in t ∈ [98, 106] into the same partition. Then the prediction for t = 100 would simply be the mean of all observations between 98 and 106 hours. For a QRF however, the same tree would instead return the values of all the observations between 98 and 106 hours as an empirical distribution from which quantiles are later derived. The predictive performance of random forests is sensitive to the partitioning of the covariate space. The splitting rule, which governs the placement of partitioning splits as each tree grows, is therefore an important parameter. Here we use the variance splitting rule, which minimises the intra-partition variance within the two child partitions at each split. A key aspect of the random forest and QRF algorithm is that each tree in the ensemble is grown on its own unique bootstrapped random sample of the training data. This produces a forest of uncorrelated trees, which when aggregated (called bootstrap aggregation or 'bagging') results in an overall prediction that is less prone to over-fitting than an individual decision tree, while retaining the ability to learn complex functions. To produce quantile predictions, the QRF returns sample quantiles from all observations contained within the relevant partition of each individual tree in the forest. In doing so it behaves as a conditional (on the covariates) estimate of the CDF.
For modelling NWP surface temperature errors, the QRF hyper-parameters and covariate choice were manually adjusted to achieve good out-of-bag quantile coverage (a QRF proxy for out-of-sample performance) across all lead times. This was achieved using visual checks such as Figure 3, which indicates that on average, prediction intervals are close to the ideal coverage across lead times, i.e. 90% of the time observations will fall within the 90% prediction interval. However for operational setups it may be preferable to use a more formal optimisation procedure, such as Bayesian optimisation. We found that using just t and m as covariates gave the best calibration results. Inclusion of other covariates did improve metrics during training (both in-bag and out-of-bag), but this was not found to translate to better forecasting metrics on the future data. In other words the additional complexity resulted in overfitting in the context of unseen forecast data. The chosen hyper-parameters were: mtry = 1, min.node.size = 1, sample.fraction = 128/nrow(training data), and num.trees = 250. The use of a relatively small sample size (128 observations for each tree, out of a total of around 50,000 observations in a 14 day run-in period) and a minimum node size of one (trees grown to full depth) was found to produce the best out-of-bag coverage at a minimal run time. Our mtry setting meant that one of the two covariates (t and m) was made available at random to each tree at each split. If another objective had been prioritised (e.g. to minimise mean squared error, rather than optimise coverage) the optimal hyper-parameters would be different.  Figure 3. Coverage of the 50%, 80%, 90%, and 95% QRF prediction intervals on out-of-bag data from one training scenario (though the picture is indicative of other scenarios). The coverage is the proportion of observations that fall within each prediction interval, and should match the interval (i.e. with 95% of observations falling within the 95% prediction interval) in a well-calibrated setup.
Once the QRF has been trained, each NWP forecast can be converted to a probabilistic forecast by adding to it the predicted error distribution (2.2). Unlike the deterministic NWP forecast, the prediction is now a probability distribution, constructed through a conditional  Figure 4. A deterministic NWP forecast for m = glm that has been converted to a probabilistic forecast using equation (2.2). The 80% and 95% prediction intervals are shown as overlain grey ribbons, while the solid grey line is the median (which differs little from the NWP forecast here).

(b) Combining probabilistic forecasts
The next step is to combine these predictive distributions from each NWP model output into a single distribution that is suitable for use in decision support. The challenge is to combine the forecasts in a probabilistically coherent manner, with the goal of producing a single well-calibrated and skillful predictive distribution. A popular approach for combining probabilistic models is Bayesian Model Averaging (BMA), and its use in the statistical post-processing of weather forecasts has precedent [e.g. 17,27,28]. However, in order to satisfy the requirements of our framework, we propose an alternative approach using quantile averaging. An illustrative comparison of unweighted BMA and quantile averaging is shown in (Figure 5). For the purposes of our framework, we found BMA to be unsuitable for the following three reasons: 1) Achieving good calibration of the combined distribution produced by BMA requires optimisation of the intra-model variance, i.e. the spread of each individual model's error profile. In our case, where each model's error profile has been learned independently by QRF, and is already well-calibrated, combining these through BMA produces an over-dispersed predictive distribution due to the inclusion of the inter-model variance in addition to the already calibrated intra-model variances. 2) In turn, this makes BMA rather incompatible with input models that are individually well-calibrated (e.g. statistical nowcasts), and therefore incompatible with a general framework like ours.
3) The use of BMA across all models and lead times is complicated by the fact that there are not an equal number of forecasts available for each lead time. This means that the inter-model variance has no consistency across lead times, and in fact trends opposite to forecast uncertainty owing to the fact that there are more NWP forecasts available for shorter lead times (Figure 1). Our framework overcomes these unstable inter-model variance issues by using quantile averaging via the 'Vincentization' method [29,30] to combine forecasts that are already well-calibrated for coverage (owing to their QRF error profiles, in our case). In effect, this integrates out the inter-model variance, preserving the calibration of the input models. The choice and implementation of distribution combination scheme can have important implications for decision-support forecasting, and while quantile averaging satisfies our general requirements for this framework, we do not discount that alternative approaches may be preferable depending on the application.
Our quantile averaged forecast benefits from stability owing to the law of large numbersany quantile of the forecast distribution represents an average of the estimates of that quantile across the available individual forecasts. This approach is therefore more akin to model stacking procedures, as used in ensemble machine learning to improve prediction accuracy by reducing prediction variance [31]. Indeed, this same logic is behind the bootstrap aggregation ('bagging') procedure of the random forest algorithm: by averaging the predictions of multiple individual predictors -each providing a different perspective on the same problem -the variance of the aggregate prediction is reduced, resulting in improved prediction accuracy at the expense of some increased bias [32]. Crucially for our framework, unlike a BMA approach which retains the inter-model variance, the calibration of our quantile averaged output is invariant to the number of forecasts available at each timestep. This is key for temporally coherent forecast calibration across all lead times.
Our error modelling approach does require one extra-step of processing in order to handle model types which themselves have multiple interchangeable ensemble members. The 'enuk' model ( Figure 1) is our example of this, having twelve non-unique members. In such cases, the apparent error profile for the model type as a collective gets overinflated by the inter-member variance. Our solution to this is to label each ensemble member by its rank (at each time step). This creates 12 unique covariates in our case and results in well-calibrated error profiles (though with significant offset bias in the extreme ranking members, as would be expected).

(c) Simulation from the full predictive distribution
While quantile averaging provides an effective way of combining multiple probabilistic forecast distributions, it leaves us with only a set of discrete quantiles rather than the full predictive distribution. This distribution is desirable because it allows us to (a) answer important questions such as 'what is the probability that the temperature will be below 0 • C?' and (b) evaluate the skill of the probabilistic forecast using proper scoring rules. To obtain the full predictive distribution, we interpolate between the quantiles of our combined forecast using the method of QuiÃśonero-Candela et al. [33], which has previously been applied to precipitation forecasting [34]. The method interpolates between the quantiles, while assuming exponential decay for the tails of the distribution. This minor simplifying assumption was judged to be reasonable for continuous variables such as temperature and ensures that the full probability density function sums to one. Applying this method allows us to simulate temperature outcomes for each hour of our combined post-processed forecast (see Figure 6 for one example) which is the final step of our framework -taking us from a set of disparate NWP forecasts to a single full predictive distribution of weather outcomes.

Results and discussion
To evaluate our framework, we applied it to 200 randomly time-sliced and site-specific forecasting scenarios extracted from our UK Met Office road surface temperature dataset, which we have aggregated to hourly time steps. Each scenario has a training window of 14 days, provided approximately 50 000 data points of t,m to train the QRF, which is then used to post-process the forecasts leading out from time zero. While there are only 336 hours in a 14 day training window, the number of NWP models and their regular re-initialisation schedule, means that approximately 150 forecasts are made for any hour by the time it is observed. While we only use the current forecasts from each model type to generate our predictions, the training benefits from every historical forecast within the window. Figure 7 shows an example prediction of up to 168 hours into the future for a particular scenario. Although the prediction at each hour ahead is a full probability distribution, here we present prediction intervals as well as a simulation of 1000 temperature values from it. The samples were used to derive the probability of the temperature being below 0 • C, as the proportion of values less than zero. Different stakeholders will require their own unique predictive quantities, and by providing a full predictive distribution, our framework should cater for a wide variety of requirements.
Various metrics could be used to evaluate the skill of our probabilistic forecasts over multiple scenario runs. From the perspective of decision support, the ideal metric to evaluate would be the change in loss resulting from using our forecasts to make real world decisions, such as about when to grit roads. However, in the interest of a more general analysis we use a range of standard metrics. These are: prediction interval coverage (Figure 8 left), the root-mean-squared-error (RMSE) of the median (Figure 8  is still desired), as well as the continuous ranked probability score (CRPS) and logarithmic score of our probabilistic forecast (both in Figure 9). Figure 8 indicates that coverage is good overall, with 94.7% of observations falling within the 95% prediction interval, although there is some over-dispersion of our forecast at the shortest ranges and under-dispersion at the longest ranges. This is an indication that, despite producing near perfect results on out-of-bag training data (Figure 3), the QRF performance diminishes slightly when applied to new data. The range dependent over-and under-dispersion may be due to the partitioning process on which the forest is grown -by necessity the partitions that represent the extremes of forecast range must extend some distance towards the middle of the range, and in doing so end up capturing an empirical error distribution that is slightly biased towards the average empirical error distribution, rather than perfectly representing the distribution at the extremes of covariates. It may be the case that other data modelling approaches could do better in this respect. Although deterministic performance was not our focus, the QRF median prediction does outperform the mean of the available NWP models across the entire forecast range in terms of RMSE (a metric which should favour the mean, Figure 8 right). While only a conceptual benchmark, this can be taken as some indication that we have not 'thrown away' deterministic performance in pursuit of probabilistic calibration. Figure 8 also indicates that our method results in a monotonically increasing error with forecast range, unlike the mean of the original NWP forecasts. Similarly, we see a monotonic increase in both the CRPS and the logarithmic score with increasing forecast range ( Figure 9). While we present these metrics mostly to convey the consistency of our framework across forecast ranges, there is some opportunity to make cautious comparison to the same metrics achieved by the BMA approach of Raftery et al. on (non-road) surface temperature forecasting [17] -cautious because our datasets are not the same, and so we do not suggest that the differences should be attributed to the post-processing methods involved. Nevertheless, we mention the results of Raftery et al to provide some frame of reference for our own. In the Raftery et al. study, the best CRPS achieved for a 48 hour forecast was 1.61, which required 60 days of training data. Our framework achieves a CRPS of 0.92 for a 48 hour forecast using 14 days of training data. Similarly, while Raftery et al. achieve a RMSE of 2.92 from their deterministic output, our median forecast achieves a RMSE of 1.32.
The publication of open datasets by Haupt et al. [35] (to which our road surface temperature dataset has been contributed) should enable direct comparison between post-processing methods in the future.
In terms of speed, training the QRF for each forecast scenario takes between just three and four seconds on an i7-8550U laptop, and so the implementation of this framework can be expected to add very little overhead to a typical operational NWP forecasting setup.

Discussion and conclusions
The conversion of disparate forecasts into a cohesive probabilistic output is important. A key function of weather forecasts is to support decision making, but current numerical methods do not provide the well-calibrated probabilistic output required to do this rigorously. By applying our framework we compensate for this shortcoming, effectively supplementing forecasts with information from their historic performance in order to combine all available deterministic inputs, for all lead times, into a single well-calibrated probabilistic forecast. Whilst our approach is by no means the first to provide probabilistic post-processing of weather forecasts, we believe the flexibility and speed provided by our use of machine learning, along with our framework's relative simplicity and ability to simultaneously deal with all available models and lead times, makes it a strong option for consideration in operational forecasting settings.
In this study we have only applied our framework to site specific forecasting, but there are no fundamental reasons why the same principles cannot be applied to spatial forecasting by providing the QRF with additional spatial covariates against which to learn its error profiles. The error modelling approach that we use seems a very effective way of minimising the amount  of training data required compared to predicting absolute values. Taillardat et al. [14], who also make use of QRF in their post-processing, initially used four years of training data for their absolute value forecasting system in 2016, but have since adopted an error modelling approach [23].
There are still several aspects of our framework that are open to further investigation. One large aspect that we explored in preliminary experiments but have not included in our methodology here, is the opportunity to use weighted quantile averaging for combining forecasts. In our  setup, where all of the inputs are recent NWP forecasts (and therefore similarly skillful), we saw negligible difference in using a weighted averaging approach, but in situations where more diverse forecast types are in use, it may prove beneficial to assign weightings according to forecast skill. A dynamic weighting approach also enables individual models to be updated without jeopardising the overall post-processed output, as the contribution of the new or updated model will be minimal until it's error profile is well understood. The QRF algorithm provides a convenient means by which skill can be estimated ahead of time, in the form of out-of-bag metrics. For example, we showed earlier the out-of-bag coverage of our trained QRF (Figure 3). Metrics such as the CRPS, logarithmic score, and KullbackâĂŞLeibler divergence would provide good comparisons of forecast skill on which to base quantile averaging weight, although their calculation would add some additional processing time. Yao et al. [36] provide more detail about using such metrics for weighted model stacking, and in fact these weights can be optimised as an additional supervised learning problem [31]. The overall strategy for combining forecasts is also open to further research. Because it retains the inter-model variance, BMA may be considered to provide a better representation of extreme outcomes at the expense of well-calibrated coverage (at least in setups where each input forecast is already well-calibrated, which is likely to become the norm). We also think that the output of BMA would be difficult to make use of in practice when applied across all lead times as in our framework, because of the discrepancy in the number of models available at each time step, and therefore the spurious inconsistency of the inter-model variance across the forecast range. Still, applications where capturing extremes is a priority may wish to investigate further. For general purposes, we are satisfied with our time-consistent and calibration-preserving quantile averaging approach.
It is our belief that, as time goes on, and the number of different forecasting models in use -along with their complexity and resolution -continues to increase, there will be increasing need for algorithmic interfaces such as ours to summarise the otherwise overwhelming sea of forecast information into decision ready output. This would consist of optimally well-calibrated probabilities of future weather outcomes given all available information. Probabilistic machine learning is a technology that can enable this, and we hope that the work we have demonstrated here will go some way in aiding progression towards this goal.
Data Accessibility. Our dataset has been made available with permission from the Met Office and Highways England, for which we are grateful. It is available for download along with several other open weather forecast post-processing datasets collated by Haupt et al. [35] at ***URL-TBC***