Can deep learning beat numerical weather prediction?

The recent hype about artificial intelligence has sparked renewed interest in applying the successful deep learning (DL) methods for image recognition, speech recognition, robotics, strategic games and other application areas to the field of meteorology. There is some evidence that better weather forecasts can be produced by introducing big data mining and neural networks into the weather prediction workflow. Here, we discuss the question of whether it is possible to completely replace the current numerical weather models and data assimilation systems with DL approaches. This discussion entails a review of state-of-the-art machine learning concepts and their applicability to weather data with its pertinent statistical properties. We think that it is not inconceivable that numerical weather models may one day become obsolete, but a number of fundamental breakthroughs are needed before this goal comes into reach. This article is part of the theme issue ‘Machine learning for weather and climate modelling’.


Introduction
The history of numerical weather prediction (NWP) and that of machine learning (ML) or artificial intelligence (for the purposes of this paper, the two terms can be used interchangeably) differ substantially.  DL theory and practices. It is structured as follows: §2 gives a brief overview about the major developments and state of the art of NWP including aspects of data assimilation (DA) and model output processing. It is followed by §3 which surveys the literature on fundamental ML and DL developments and their application to weather and climate research. Section 4 discusses several fundamental aspects of meteorological data and other requirements of weather forecasts and points to corresponding solutions in DL research where these exist. Section 5 reflects on two aspects which are relevant for both weather forecasting and DL, but where we find different best practices in both domains. These aspects are data preparation and model evaluation. In §6, we reflect on the issues of physical constraints and system-wide forecast consistency in a DL framework. Section 7 discusses the state of the art with respect to estimating forecast uncertainties. Finally, §8 presents conclusions. We hope that this article will lead to a better understanding between 'machine learners' and 'weather researchers' and thus contribute to a more effective development of DL solutions in the field of weather and climate.

State-of-the-art numerical weather prediction
Modern weather prediction relies extensively on massive numerical simulation systems which are routinely operated by national weather agencies all over the world. The associated process chain to generate these numerical weather forecasts can be divided into several steps which interact closely with each other ( figure 1, left column). In order to retrieve the initial state of the Earth system (atmosphere, soil and ocean), a great variety of meteorological observations are collected all over the world. In addition to classical weather and radiosondes stations, aircraft measurements and remote sensing products (such as radar and satellite observations) have become an integral part of the global observation network (e.g. [17]). Although millions of different direct and indirect measurements are obtained every day, these observations are still not sufficient to describe the complete state of the atmosphere and other Earth system components with which the atmosphere exchanges energy or mass.
At this point, DA comes into play. The central task of DA is to fill the gap between the incomplete, heterogeneous, and scattered observations and the initial value fields which are required by the NWP models. To achieve this, the observation data must be projected onto the discrete model grid, interpolated in time and adjusted to be consistent in terms of state variables (e.g. temperature, pressure, wind etc.). DA also has to take into account measurement errors such as biases between different space instruments or malfunctions of individual ground-based sensors. The obtained initial state of the Earth system after the DA step therefore constitutes an optimized estimate of the real conditions (e.g. [18]).
Given the initial conditions, the NWP model can perform a simulation of atmospheric processes. By solving numerically the coupled partial differential equation system describing the atmosphere in terms of momentum, mass and enthalphy (the Navier-Stokes equations), the future atmospheric state is obtained in each model grid cell. Processes occurring at scales smaller than the model grid size are captured by empirical parameterizations. The direct model output constitutes the first forecast product of the NWP workflow. In contemporary global NWP models, the grid boxes cover an area of several square kilometres.
In order to arrive at finer scale end-user forecast products, a post-processing step is added to the NWP workflow. The outcome of such post-processing can cover a variety of forecast products starting with the conversion of the vertical axis from sigma-coordinates to pressure levels or geometric height (above mean sea level) or bias corrections. Statistical methods are applied to remove systematic biases of the NWP output and to incorporate local scale adjustments (statistical downscaling). Furthermore, limited-area models which allow for finer grid spacings ( x ∼ O(1-5 km) compared to x O ∼ (10 km) in global models) provide added value on forecasting meteorological features on finer scales. The output of ensemble simulations can be used to estimate forecast uncertainties which are of major interest especially for high impact weather situations, or for the renewable energy sector [19,20] (see also §7).
Over the past decades, the ability of NWP models to predict the future atmospheric state has continuously improved. Contemporary global NWP models are not only able to predict the synoptic-scale weather pattern for several days, but they have also reached remarkable accuracy in forecasting end-user relevant meteorological quantities such as the 2 m temperature and regional-scale precipitation events. For instance, the deterministic forecasts of the Integrated Forecast System provided by the European Centre for Medium-Range Weather Forecasts maintains an anomaly correlation coefficient of the 500 hPa geopotential height of 80% for about 7 days, while the root-mean square error for 2 m temperature predictions of 72 h forecasts is close to 2 K [21]. Larger scale high-impact events such as hurricane tracks can be predicted with an accuracy of 150 km up to 4 days in advance [22].
The increasing success of operational NWP models is due to improvements of all the steps involved in the NWP workflow and new capabilities of the global Earth observation system. In the following, we briefly highlight a couple of important developments which led to significant enhancements of forecast quality. A detailed review of recent advances in the NWP process chain is beyond the scope of this article.
While in situ observations (weather stations and radiosondes) have a long history in observing the Earth's atmosphere, fundamental improvements in the spatiotemporal coverage of observations has been achieved with the help of satellite data over recent decades (e.g [23]).
Nowadays, several geostationary and polar-orbiting satellites deliver a great variety of data products (such as temperature and humidity profiles, soil moisture and atmospheric motion vectors). Satellite data are particularly valuable as they provide information on the atmosphere above the ocean and uninhabitated areas where conventional measurements are hard to come by. However, measurements from commercial aircraft (e.g. [24]) and radar observations (e.g. [25,26]) have also contributed to better constraining the initial state of NWP models.
The ability of DA systems to make use of the manifold, diverse observations has seen continuous improvement due to algorithmic developments. Current DA systems are primarily based on three-or four-dimensional variational approaches (3D-Var and 4D-Var, respectively) and on ensemble methods (commonly Kalman filter). In the 3D-Var approach, a single deterministic state is estimated by minimizing a cost function which generally consists of the three terms background, observation, and model error. 4D-Var additionally captures observation changes in time (see [18] for more details).
In order to obtain a loss function which can be optimized with reasonable efficiency, the model and observation operators have to be linearized. This can lead to forecast errors, in particular if the NWP model contains discontinuous parameterizations [27]. Another simplification of the variational DA approach is the a priori definition of the uncertainties of the state vector X which leads to a static background error covariance matrix. By contrast, an ensemble approach allows for dynamic estimation of the probability density function of X. The Ensemble Kalman Filter approach makes use of such an estimation which then results in a non-static, i.e. flow-evolving background error covariance matrix. A disadvantage of classical ensemble methods is that they are only conditioned on past measurements [28]. Therefore, leading meteorological centres have started to establish combinations of the variational approach with ensemble approaches such as the 4D-EnvVar DA method. The quality of ensemble DA depends on the number of ensemble members which is typically restricted to a relatively small number due to computational reasons. Therefore, so-called hybrid DA systems have been developed which include climatological error information in order to lessen the sensitivity to undersampling [29].
NWP model improvements can in part be related to resolution enhancements. The continuous refinement of the grid spacing has also required re-formulating the dynamical cores of NWP models where the discretized Navier-Stokes equations are solved. Simulating the atmosphere on kilometre-scale comes along with the demand of highly parallelizable algorithms of the dynamical core [30]. Since (classical) global spectral transform models are less suited for such a requirement, finite-difference or finite-volume discretizations on platonic solids projected on the sphere (e.g. icosahedral [31] or cubed-sphere grids [32]) have been developed over the previous decade. Simultaneously, remarkable progress has been achieved in designing discretization approaches which enable the grid-scale dynamics to follow the conservation laws of energy, enthrophy and mass [33,34] while also minimizing the need for numerical filters to suppress artificial numerical modes [35]. An extensive overview of contemporary dynamical core architectures can be found in [32].
In addition to the improvement of dynamical cores, further gains in accuracy have been achieved by fine-tuning physical parameterizations which are mandatory to represent atmospheric processes that cannot be captured by the grid-scale thermodynamics. Among others, these parameterizations encompass the representation of (deep) convection, turbulent mixing, smaller-scale atmosphere-land/ocean coupling, the representation of cloud microphysics and radiative transfer. Advances in capturing the diurnal cycle of convection (e.g. [36,37]), the turbulent transports in the planetary boundary layer (e.g. [38]) and in simulating the bulk properties of hydrometeors (e.g. [39]), i.e. clouds and precipitation, are only a small sample of recent progress in tuning physical parameterization schemes.

Deep learning in weather research
The increased computational power, the availability of large datasets, and the rapid development of new NN architectures all contribute to the ongoing success of DL. Some of these new NN can solve certain ML tasks much more efficiently than the classical fully connected, feed-forward networks. One especially successful concept, which has been widely applied, is convolutional neural networks [40] (CNN), where a stack of small-sized filters with few trainable parameters is royalsocietypublishing.org/journal/rsta Phil. Trans. R. Soc. A 379: applied to images or other gridded data to extract coarser scale features. CNNs have been used in weather and climate applications, where the NN was trained to recognize spatial features, for example in the analysis of satellite imagery [41] or weather model output [42].
The family of recurrent neural networks (RNN) was designed specifically for the learning of time-dependent features (i.e. text and speech recognition). More advanced RNN architectures are long short-term memory (LSTM) nodes [43,44] and gated recurrent units (GRU, [45]). LSTM and GRU cells can be embedded in more complex neural network architectures. For example, the combination of a normal CNN with LSTM yields the so-called ConvLSTM network [46].
Two more recent DL concepts are variational auto-encoders (VAE) [47] and generative adversarial networks (GAN) [48]. Both of these are so-called generative models, i.e. they learn the data distributions from training samples and use generators to produce novel samples which match the characteristics of the training data. They are widely used in different applications such as image-to-image translation [49], super-resolution image generation [50], in-painting [51], image enhancement [52], image synthesis [53], style transfer and texture synthesis [54] and video generation and prediction [12,13]. VAEs use an encoder to project the high-dimensional data with posterior distribution into a latent space with lower dimensionality. This latent space is then sampled by a decoder to reconstruct the original feature space in all dimensions. For further information on VAE, we refer to [47,55]. In GANs, the competition of two NNs is used to improve on image generation during training. One network is optimized to generate realistic images, while a second one is trained concurrently to discriminate between generated and real images. Typically, both VAE and GAN-based architectures are coupled to multiple convolutional layers for capturing the semantic features of the input data and represent them with fewer dimensions. Examples are PixelVAE [56], DCGAN [57], sinGAN [58] and SAVP [13]. It is a general tendency in DL research that new NN architectures are composed of many building blocks which are themselves substantially large DL networks. The problem-complexity which can be addressed with modern DL networks is already quite substantial. The largest NNs have several million degrees of freedom which is comparable to operational NWP models.
ML as 'an approach to data analysis that involves building and adapting models, which allow programs to learn through experience' has been employed by meteorologists for a long time, for example in curve fitting, linear regression, or DA (see §2). However, in this article, we focus on ML in a narrower sense, i.e. involving NNs and in particular modern DL.
First studies employing NN for meteorological and air quality applications appeared during the 1990s [59][60][61]. These studies used multi-layer-perceptron architectures with typically three layers to analyse and forecast time series at individual station locations. Later, other simple semantic network techniques were used for post-processing and prediction optimization of NWP output [62,63], and as surrogate models for different parameterization schemes in climate models [64,65].
It took a few years before the weather and climate research community started to pick up modern DL concepts and began to explore their use in NWP and other environmental applications. Table 1 lists various state-of-the-art DL architectures and their first applications in weather and climate research. A couple of examples are briefly described below. A review of ML in remote sensing can be found in [41]. Zhou et al. [83] and Denby et al. [84] used a CNN for classification of weather satellite images, while Xu et al. [85] used a combination of GAN and LSTM for prediction of cloud images. Based on the concept of video prediction, various types of networks were used for short-term prediction of sky images and radar images [46,79,81]. There have also been some attempts to produce data-driven weather forecasts, for example by Dueben & Bauer [86], who used a multi-layer perceptron, or Grover et al. [16], who constructed a three-stage model consisting of boosted decision trees, a dynamic Gaussian Process model, and a deep belief network consisting of restricted Boltzman machines [77,78]. The study of Wandel et al. [87] could be regarded as a first step towards replacing the dynamical core of a numerical weather model as they demonstrate unsupervised learning of the full incompressible Navier-Stokes equations on a Eulerian, grid-based representation.

Challenges of end-to-end deep learning weather prediction
The studies that were cited in the previous section already demonstrate that DL concepts can be successfully applied to problems related to weather forecasting. However, the few existing attempts to replace the entire NWP workflow with a DL system have been limited to short-term forecasting (up to 24 h or less) or used a rather limited subset of the available meteorological data.
In this section, we discuss a number of challenges which need to be overcome before a complete end-to-end DL weather forecasting system can deliver results of comparable quality as current NWP.
Weather forecasting is essentially a prediction of spatiotemporal features based on a diverse array of observations from ground-based, airborne and satellite platforms. If we treat the core part of the NWP workflow (figure 1), i.e. DA, model forecast and output post processing, as one entity, then a weather forecast can be described as a function, which maps observation data to a final forecast product (figure 2). The forecast product can be a map of a specific weather variable (e.g. temperature), a time series of one or more variables at a specific location or aggregated over a region, some aggregate statistics of a specific variable over a given time range, a (categorical) warning index, etc. With current NWP, we are used to employing one forecasting system (which may well have several components) to derive the whole set of forecast products which are requested by end users, or needed for the system evaluation and further improvement. By contrast, DL applications excel if they can focus on a specific task, i.e. a reasonably small set of target variables. Therefore, an end-to-end DL weather forecast as depicted in the right column of figure 1 would likely consist of several deep NNs which would be trained individually on specific subsets of forecast products. Advantages of such a DL weather forecasting system could be the intrinsic absence of model bias (because the system would be trained to reproduce the target values) and the possible savings of computational resources. Once NNs are trained, they can very efficiently calculate forecasts with new data. Forward propagation in NNs consists only of fast add and multiply operations. Therefore, even NNs with O(10 8 ) parameters (i.e. similar complexity as contemporary NWP models) can be expected to use far less computing resources than current numerical models. The determining factors of the required computational resources in an end-to-end DL weather forecasting system are the necessary training cycles and the data processing. The former will depend on the learning approach (e.g. lifelong learning [88] requires regular re-training of some NN components) and on the success of transfer learning [89] concepts (i.e. whether it is possible to re-use NNs trained in one region of the globe for weather forecasts in another region). Data processing is a challenge which also limits further scaling of NWP models and other applications on current and next generation supercomputing systems (see [90]). At present, it is impossible to predict how much computing time, if any, could be saved if all weather forecasting would be based on DL. A fair comparison should always consider the entire weather prediction workflow and the whole range of forecast products that shall be generated.
In the following, we discuss a number of challenges for end-to-end DL weather predictions, which are mostly a consequence of the specific properties of meteorological data and the complexity of the atmosphere and its interactions with other Earth system compartments. These challenges are graphically summarized in figure 3. As will be seen, many of these challenges also appear in other DL contexts, and the DL community has begun to develop solutions for these problems. Nevertheless, there are no systems in place which can cope with all of these challenges together.
The success of DL methods hinges on a good understanding of relevant data properties. Meteorological variables can be described with different cumulative distribution functions or corresponding probability density functions. Some variables (e.g. temperature) are nearly normally distributed, while others (e.g. precipitation and cloud droplet size) might be better approximated by gamma or beta distributions [91]. The fraction of cloud cover is often reported in eighths and therefore needs to be treated as a discrete variable. Ignoring the different properties of meteorological data can cause erroneous results when they are not accounted for in a statistical analysis or forecasting procedure. This is particularly relevant as some DL methods (e.g. the Bayesian approach by [92]) make implicit assumptions about the frequency distribution of variables.  Meteorological features can show dynamic behaviour on a wide range of spatiotemporal scales, and the quality of the weather forecast is influenced by phenomena on many different scales [93]. Consider sea ice as an example, which may change little during the time of a typical forecast lead time, but whose mid-to longer-term variations can have a profound influence on the local and non-local atmospheric state [94]. Multiple spatial scales have been investigated in the context of video prediction (e.g. [12,58]).
Related to the interaction of scales, the spectra of energy and momentum are important aspects in meteorology, but such spectra do not play a similar role in most mainstream DL applications. However, spectral transformations have already been used in DL applications (e.g. [9] in the context of speech recognition). First attempts at using such concepts in ML weather applications [95][96][97][98] were limited to simple NNs and limited complexity datasets.
Many meteorological features vary periodically, although there can be large variability between cycles. This periodicity is induced by orbital parameters and the Earth rotation together with various solar cycles. As shown in Ziyin et al. [99], NNs generally have difficulties extrapolating periodic features correctly. However, as the authors show, replacing common monotonic activation functions with functions which include a periodic term can solve such problems and produce, for example, better temperature forecasts.
Meteorological variables are correlated in space and time, and these correlations change with time [16]. For example, temperatures at different altitudes may exhibit very similar patterns (possibly with a time lag) in a well-mixed boundary layer (i.e. summer, daytime), while different vertical levels can be almost completely decoupled during an inversion (often during winter or at night). This can also be seen in larger scale features such as tropical cyclones, Rossby waves, fronts and (organized) convection. While we are not aware of a publication which addresses this issue in the context of DL with weather data, there are other DL studies which demonstrate that it is possible to cope with such correlations (e.g. [100]).
A related property of meteorological features is auto-correlation. While auto-correlation in a way simplifies the forecasting task (at least on short time scales), it imposes the risk of Meteorological features may appear and vanish on time scales much shorter than the forecast range. Prominent examples are the triggering of convective cells or the transition between convective and relief precipitation in the presence of orography. An NWP model has some skill in predicting such features, because it can diagnose their potential occurrence from relations between other variables. In principle, such complex relations may be decipherable by NNs as well. However, we believe that this will require additional measures to make the NN aware of such relations. Such measures could be feature engineering (i.e. the calculation of derived properties from combinations of input variables) or the implementation of physical constraints (see §6).
Another challenge for a DL weather forecast application is the scarcity of extreme events, which are, however, very important to get right as extreme weather phenomena have the largest impact on civil safety and economy. For example, to accurately predict heavy precipitation events (>25 mm h −1 ) over Germany, the DL model must be trained with less than 10 episodes at any given location during a full decade [101]. A few studies have touched upon the subject of ML and extreme climate events: Vandal et al. [102] find that complex DL models have more problems in capturing extremes than classical statistical downscaling models, whereas O'Gorman et al. [103] state that their model captures extremes quite well without the need for special training on these cases. While the subject of classifying imbalanced data has received considerable attention (cf. [104]), there appears to be little research on dealing with imbalanced sample sizes in regression problems [105]. In contrast to standard DL algorithms, humans have acquired the ability to learn from isolated extreme events because they pay special attention to extraordinary occurrences in their environment and quickly generalize to other situations [106]. Some studies have explored the possibility to have deep neural networks learn extreme events in a similar way. For example, Li et al. [107] implemented one-shot learning of object categories by using prior knowledge learned from other training samples. Lake et al. [108] developed ML methods within Bayesian program learning (BPL) to mimic human capabilities of learning visual recognition from a few samples.
Finally, other critical aspects related to meteorological observation data are the frequent appearance of missing values and the possibility of input data errors and biases. Current DA procedures often include a substantial code base for filtering or interpolating missing values, blacklisting observations, and monitoring of biases and their evolution over time. Similar issues occur in various application areas of DL. For example, Smieja et al. [109] have dealt with the issue of missing data values, and Žliobaité [110] and Lu et al. [111] investigated the problem of drifting biases (known as concept drift in the ML community). A recent example of a DL application working on meteorological satellite data with missing values is Barth et al. [112].

Data preparation and model evaluation
In this section, we discuss two aspects of ML in weather and climate, which we have found to be important in our practical experience and where best practices differ between the meteorological and ML communities. These are data preparation and model evaluation. This discussion may shed some light upon the reasons why it has been difficult for the DL community to tackle weather data problems and why, conversely, the meteorological community has been cautious to adopt DL in their research and for routine weather analyses and forecasts.
All ML techniques are data-driven. Therefore, proper selection and preparation of data are essential to gain good and generalizable results. Data selection should aim to capture the full variability of the predictor variables, avoid too much redundant information, and allow the network to capture relations among variables, from which a prediction can be made. For the sake of brevity, we will not discuss data selection in more detail, but instead focus on data preparation aspects in the following.   Modern supervised ML studies generally divide the available data into three different datasets to train, develop and evaluate an ML model [113]. The training set is the largest and is used to update the model parameters by back propagation or other learning algorithms. The second set, which is often referred to as the validation or development set, is used exclusively for hyperparameter tuning. The hyper-parameters, i.e. number of layers, type of layer, activation function, loss function, learning rate etc. are set manually by the model developer. A key target of this hyper-parameter tuning is the optimization of the network's generalization capabilities to ensure that the network will function well on previously unseen data. Both parameters and hyperparameters are essential for building a suitable DL model. The third dataset is the test set, a collection of previously unseen data which is used to evaluate the network after the tuning to assess the true generalization capability of the network. The three datasets should be independent of each other, but at the same time they should reflect the same statistical distribution. Therefore, one has to be careful how to split the data before starting to train a new network, especially if, as in meteorological time series, the data are auto-correlated. Figure 4 shows four different data split strategies for a hypothetical time series of meteorological data. In order to enable an NN to forecast the next k time steps, one will generally feed an input vector of l past time steps to the network as input. In many DL applications, it is standard to draw random samples from a huge database of mutually independent data records (e.g. images) and arbitrarily assign these samples to the train, validation and test sets, respectively. This would correspond to figure 4a, where each drawn 'slice' has a length of one sample. However, as noted in §4, meteorological data constitutes a continuous time series with auto-correlation on different time scales. Therefore, randomly drawn samples would overlap and therefore no longer be independent. Consequently, results obtained with such a test set over-estimate the true generalization capability of the NN, because the test set contains information already used for training. When researching for this article, we found several studies on ML and DL for environmental data analysis, where this principle was violated and which therefore made overly optimistic conclusions concerning the capabilities of (often simple) NNs.
Another point of concern, which also has implications for the data preparation, is the multiscale aspect of data in the time domain. While a typical forecast application considers time scales of a few hours to several days, there are longer-term quasi-periodic patterns, such as the El Niño Southern Oscillation (ENSO), and also continuous trends such as global warming. When training NNs with long-term data series (so that a sufficient number of samples becomes available), it is not trivial to find a good data split, which on the one hand fulfils the requirement of independence, while on the other hand allows the network to be trained on as many parts of the underlying data distribution as possible. For example, the model developer should ensure that all seasons are sampled appropriately, and, when using multi-year data, that the training data contains different phases of ENSO as one example out of many other oscillations.
To solve these issues, we propose a random block sampling strategy ( figure 4b-d), where the train, validation and test sets all contain several coherent blocks of length L, and L is much larger than the auto-correlation time. Multiple runs of this random block sampling should be carried out to assess the robustness of the results.
The second aspect which we would like to discuss in this section concerns model evaluation. While it is relatively straightforward to evaluate the success of an image classification (e.g. [8,69]) or a video prediction task (cf. [12,79]), the evaluation metrics used in these studies (e.g. MSE or Peak Signal to Noise Ratio, PSNR) are usually not appropriate for weather and climate applications. Quantifying the success of a weather prediction model is highly non-trivial and an area of active research. Meteorological centres have developed a plethora of scores and skill scores over the last decades, which elucidate different aspects of weather forecast quality (cf. [21,91,114]). In addition to verification methods based on point by point comparisons (e.g. [115]), various methods have been introduced to account for the intrinsic spatial and temporal correlation in meteorological datasets (e.g. [116][117][118]). Other verification metrics also account for the stochastic nature of meteorological quantities by estimating probabilities of binary events, such as rain, norain [119]. Evaluating spatiotemporal patterns, for example, precipitation forecasts, with the help of radar data is particularly challenging due to the double penalty problem (cf. [116,120]). Indeed, verification of precipitation forecasts is still a hot topic in the meteorological community [121]. The evaluation of extreme events suffers from the 'forecaster's dilemma', which discredits skilful forecasts when they are evaluated only under the condition that an extreme event occurred. This conditioning on outcomes and observations violates the theoretical assumptions of forecast verification methods [21,122].

Physical constraints and consistency
As has been demonstrated in other application areas of ML (e.g. [123,124]), NNs can be prone to learning spurious relationships in data. A purely data-driven model for weather forecasting might fail to adhere to the underlying physical principles and thus generate false forecasts as it lacks understanding of the fact that every atmospheric process obeys physical laws described in terms of conservation of momentum, enthalpy and mass. The incorporation of physical laws in NNs is becoming a vibrant area of research (e.g. [125][126][127][128]) and is now often denoted as scientific ML.
One of the first studies to demonstrate that physical constraints can efficiently reduce systematic biases in lake temperature predictions, while at the same time enhancing generalization capability, was Karpatne et al. [129]. They included numerical model results as constraints for sparse observations and added a loss term to punish non-physical behaviour of the DL model. De Bézenac et al. [127]  to assimilate the inputs. Effectively, this leads to the encoding of physical laws in the latent space of their model. Other approaches to introduce physical constraints into the latent space of DL models are the adversarial autoencoder by Makhzani et al. [130] and the non-parametric Bayesian method by Goyal et al. [131].
It may be useful to reflect on the potential and necessity of physically constraining DL models from an abstract point of view. In spite of their complexity and dimensionality, DL models still adhere to the fundamental principles of (data-driven) statistical modelling. This implies that there must be some rules in place to constrain the future, because otherwise extrapolation will be unbound. Classical statistical modelling tries to strike a balance between a sufficiently explicit formulation of the time-dependent system evolution and the remaining degrees of freedom to accommodate the intrinsic variability of the data. For example, to fit hourly temperature observations, a classical statistical model will usually include (at least) two periodic terms to capture the diurnal and seasonal cycles. In addition, there may be terms to describe correlations of temperature with other variables. On the other hand, the statistician will avoid over-fitting the data by adding too many terms into the statistical model.
Even though in DL one expects the NN to learn many of the inherent data relations by itself, the system nevertheless must get some guidance to be able to identify meaningful patterns. Knowingly or not, the researcher always imposes constraints on the NN, for example through data selection and choice of NN architecture. NNs learn faster when patterns are clearly visible in the data. However, with meteorological data the most obvious patterns are usually the least interesting ones, therefore it makes sense to let the NN know in advance that such patterns will occur. A similar argument applies to physical constraints: if the NN is forced, for example to conserve mass, it will not need to waste parameters and training cycles on learning this rule and can instead concentrate on analysing relevant and less obvious relations. Many meteorological studies show that it is often necessary to pre-select or filter data and build a good statistical model before meaningful relations become apparent. It is, therefore, likely that an end-to-end DL weather forecast system will only be successful if it contains at least some a priori knowledge in the form of engineered data features and physical constraints. Just how much of this is needed remains to be seen.
Newcomers to the field of NWP often think that such numerical simulation models are inherently self-consistent, because they are based on a well-defined set of differential equations. However, the discretization of partial differential equations describing the flow dynamics in NWP models is not always fully mass or energy conserving, and parameterization schemes, which are needed to incorporate the effects of unresolved sub-grid scale processes on the grid-scale variables, may lead to spurious competition between grid-scale and sub-grid-scale processes, for example in cloud schemes [132,133]. Furthermore, there can be a grey zone between different parameterizations describing related aspects, such as the classical distinction of shallow and deep convection. Also, physically related parameterization schemes may be derived from different empirical data. Furthermore, internal consistency of classical NWP no longer applies if one considers the entire NWP workflow, i.e. if statistical models are used to post-process the model output, remove biases and apply other, non-physical corrections to the model forecasts. This discussion does not intend to devalue classical NWP, but it should inspire some reflection on the exact meaning and the value of consistency in the weather forecasting and DL communities. An end-to-end DL weather forecast system will generate consistency among forecast products only to the extent that this is already embedded in the data, unless the system will be governed by physical constraints as discussed above. To what extent consistency is needed to obtain a 'good' forecast will be a worthwhile question to study as it may deepen the understanding of the problem at hand and the potential which DL can bring to weather forecasting.

Uncertainty estimation
The final discussion point of this article concerns the estimation of forecast uncertainty. Owing to increased computer power, it has become possible in recent years to produce ensemble forecasts operationally. Ensemble approaches have also been introduced in DA (see §2). In a nutshell, ensemble forecasts aim to estimate the probability density function of the forecast variables. Ensembles are most often generated by varying the initial conditions of the model simulation, but there are also attempts to sample the parameter space of empirical model parameterizations.
In the field of DL research ensemble methods are used less often because they are computationally expensive. Statistical concepts such as Gaussian process (GP), and probabilistic graphical models (PGM) excel at probabilistic inference and uncertainty estimation. However, these methods do not scale well for high-dimensional and high-volume data [134][135][136]. Therefore, Bayesian deep learning (BDL) has been developed and applied across several scientific and engineering domains, for example in medical diagnosis [137] or autonomous driving [138]. In essence, these methods estimate a probability density function of the DL model parameters. As side effects BDL increases the robustness against over-fitting and allows training of the NN from relatively small sample sizes [139]. Modern BDL methods include variational inference, Markov Chain Monte Carlo (MCMC) sampling [136], and Monte Carlo dropout [140].
Some recent studies explored the BDL concept for weather forecasting applications. A model built on GRU and 3D CNN, along with variational Bayesian inference for estimating posterior parameter distributions, has been presented by Liu et al. [141] for probabilistic wind speed forecasting of up to 3 h. A study from Vandal et al. [92] demonstrates the use of BDL to capture the uncertainty from observation data and unknown model parameters in the context of statistical downscaling of precipitation forecasts. These are relevant contributions, but a lot remains to be done before the uncertainty of DL weather forecasts can be assessed at a level similar to current NWP ensemble systems.

Conclusion
In this article, we discussed the potential of modern DL approaches to develop purely data-driven end-to-end weather forecast applications. While there have been some stunning success stories from DL applications in other fields and initial attempts were made to apply DL to meteorological data, this research is still in its infancy. As we argue in §4, there are specific properties of weather data which require the development of new approaches beyond the classical concepts from computer vision, speech recognition, and other typical ML tasks. Even though DL solutions for many of these issues are being developed, there is no DL method up to now which can deal with all of these issues concurrently as it would be required in a complete weather forecast system.
We expect that the field of ML in weather and climate science will grow rapidly in the coming years as more and more sophisticated ML architectures are becoming available and can easily be deployed on modern computer systems. What is largely missing in the field of meteorological DL are benchmark datasets with a specification of appropriate baseline scores and software frameworks which make it easy for the DL community to adopt a meteorological problem and try out different approaches. One notable exception is Weatherbench [142]. Such benchmark datasets and frameworks are well established in the ML community (e.g. MNIST [143] or ImageNet [144]) and they contributed substantially to the rapid pace of DL developments in application areas such as image recognition, video prediction, speech recognition, gaming and robotics. While a lot of meteorological data is freely available from weather centres and research institutions, proper use of these data requires some knowledge about Earth system science and the data formats and tools, which are used by the environmental research community. It might help if tools for reading and working with these datasets were integrated in major ML frameworks.
When reflecting on the ultimate goal of replacing computationally expensive NWP models with DL algorithms, it is important to reconsider the objectives of weather forecasting and carefully define the requirements, which must be met by any potential alternative method. Certain criteria, which we now consider essential for a 'good' weather forecast, may in fact be conceptions, which are resulting from our experiences with numerical models, and they may not be applicable to forecasting systems based on DL. One particular aspect in this regard is self-consistency of forecast results, which is often taken for granted by numerical modellers, even though it is not strictly fulfilled in current NWP forecast systems. In this article, we consciously propose thinking about a replacement of the entire core NWP workflow including the DA, numerical modelling, and output processing, because the task of weather forecasting can then be described as a huge Big Data problem of mapping a plethora of Earth system observations onto a well-defined set of specific end-user weather forecast products. Seen in this way, the problem of weather forecasting is more amenable to DL methods than a replacement of the actual NWP model itself with its grid structure, operator concepts etc. which are tied to the very concept of classical numerical modelling. We expect that the success of DL weather forecast applications will hinge on the consideration of physical constraints in the NN design. Taken to the extreme, portions or variants of current numerical models could eventually end up as regulators in the latent space of deep neural weather forecasting networks. So, to answer the question posed in the title of this article, we can only say that there might be potential for end-to-end DL weather forecast applications to produce equal or better quality forecasts for specific end-user demands, especially if these systems can exploit small-scale patterns in the observational data which are not resolved in the traditional NWP model chain. Whether DL will evolve enough to replace most or all of the current NWP systems cannot be answered at this point.
Data accessibility. This article has no additional data. Authors' contributions. M.G.S. designed and conceived the study. All authors jointly drafted, read and approved the manuscript.
Competing interests. We declare we have no competing interests.