Nonlinear stochastic modeling with Langevin regression

Many physical systems characterized by nonlinear multiscale interactions can be effectively modeled by treating unresolved degrees of freedom as random fluctuations. However, even when the microscopic governing equations and qualitative macroscopic behavior are known, it is often difficult to derive a stochastic model that is consistent with observations. This is especially true for systems such as turbulence where the perturbations do not behave like Gaussian white noise, introducing non-Markovian behavior to the dynamics. We address these challenges with a framework for identifying interpretable stochastic nonlinear dynamics from experimental data, using both forward and adjoint Fokker-Planck equations to enforce statistical consistency. If the form of the Langevin equation is unknown, a simple sparsifying procedure can provide an appropriate functional form. We demonstrate that this method can effectively learn stochastic models in two artificial examples: recovering a nonlinear Langevin equation forced by colored noise and approximating the second-order dynamics of a particle in a double-well potential with the corresponding first-order bifurcation normal form. Finally, we apply the proposed method to experimental measurements of a turbulent bluff body wake and show that the statistical behavior of the center of pressure can be described by the dynamics of the corresponding laminar flow driven by nonlinear state-dependent noise.


Introduction
It is widely accepted in physics that nominally deterministic systems with many degrees of freedom can often be modelled more effectively from a statistical point of view. In many complex multiscale systems, a variety of processes lead to emergent large-scale structures whose dynamics are described by a relatively small set of macroscopic variables [1,2]. The influence of the unresolved degrees of freedom can be approximated with random forcing in the spirit of statistical mechanics [3][4][5]. This stochastic treatment of unresolved variables has become commonplace in fields including climate science [6], ecology [7], epidemiology [8], protein folding [9], neuroscience [10] and turbulence [11].
The stochastic evolution of a state x is often represented with Langevin dynamicṡ The deterministic 'drift' dynamics f (x) describe the evolution of the slow macroscopic variables, while the fluctuations are paramterized by the diffusion term σ (x)w(t), where w(t) is typically assumed to be a Gaussian white noise process. If this model cannot be derived from first principles, a model can sometimes be inferred from observations of the natural dynamics of the system, as illustrated in figure 1. A significant challenge in constructing approximate stochastic models is that many of these systems are far enough from thermal equilibrium that even when the microscopic governing equations are known, fundamental principles such as detailed balance and the fluctuationdissipation theorem cannot be readily applied. For example, widely separated time scales for forcing and dissipation prevents viscous turbulence from approaching a state of equipartition [11,12]. This scale separation is captured by the Reynolds number, which can be interpreted as a ratio of the time scales characterizing the energetic large-scale dynamics and the small-scale viscous motions [13].
As with equilibrium statistical physics and quantum mechanics, there are several ways to represent stochastic dynamics, as exemplified by the differing treatments of Brownian motion by Einstein [14] and Langevin [15]. Einstein's theory is constructed around a diffusion equation governing the evolution of the distribution of particles, while Langevin's describes an individual trajectory of a particle subject to friction and a random fluctuating force. This duality persists in the modern theory; the same stochastic process can be represented with a generalized Langevintype differential equation governing trajectories or a Fokker-Planck equation for the evolution of the probability distribution [3].
Linear dynamics can be analysed from either perspective, but in the nonlinear case the most convenient representation often depends on the application. Nonlinear Langevintype stochastic differential equations are difficult to treat analytically, but fit more naturally with low-dimensional modelling and control objectives. On the other hand, Fokker-Planck equations replace nonlinear trajectory dynamics with a linear partial differential equation for the probability distribution. The ensemble perspective also facilitates comparison with long time series measurements of ergodic systems.
In this work, we seek to exploit these equivalent representations to identify Langevin dynamics by using both the forward and adjoint Fokker-Planck equations to ensure consistency with observations. We propose a framework for identifying nonlinear stochastic models from noisy experimental data, building on previous work in sparse regression [16,17] and adjoint-based parameter estimation [18]. After presenting background material on stochastic dynamics in §2, we describe the method in §3. Finally, we demonstrate its application to three example systems in §4: a cubic model driven by coloured noise, a particle in a double-well potential, and a symmetry-breaking instability in a turbulent wake.

(a) Related work
As we study increasingly complex systems that deviate from the restrictive near-equilibrium conditions of classical statistical mechanics, fully empirical system identification becomes more appealing. The primary goal of these methods is not to simply fit the statistics, but to identify a minimum-complexity mechanistic model that describes the important interactions in the system [19]. The resulting models stand in for those derived via traditional analysis and can be used to gain physical insight into the system. Secondary goals in engineering fields can include the design of control systems [20] or parametric surrogate models for design and optimization [21]. Predicting response to exogenous inputs or parametric variation is clearly more difficult, but a dynamic model of the autonomous behaviour is a necessary prerequisite to these aims.
Stochastic systems can be broadly categorized according to the type of dynamics, noise and stochasticity. Dynamics may be linear or nonlinear, the noise process may be white (uncorrelated in time) or coloured (time-correlated), and the strength of the fluctuations may be constant (additive diffusion) or state-dependent (multiplicative diffusion). Similarly, stochastic model identification methods can be similarly categorized by the type of models they are able to construct.
For example, realization algorithms are a mainstay of engineering disciplines, although these methods are restricted to linear input/output systems with additive white noise [22][23][24]. Perhaps the most general and successful nonlinear approach is the NARMAX framework, which can construct nonlinear models driven by state-dependent coloured noise [19]. However, NARMAX models typically cannot be transformed to continuous time, which is often the most natural setting for physical problems, making them difficult to interpret. More recent work has explored a variety of strategies for modelling nonlinear stochastic systems, including operator theoretic methods [25], optimal transport [26], deep learning [27][28][29] and identifying distribution evolution equations [30], although none of these pursues a representation in terms of nonlinear state-space dynamics. On the other hand, recent work has demonstrated that a stable linear system driven by coloured noise can accurately reproduce the second-order turbulent statistics [31].
Recent advances have made significant inroads towards continuous-time model discovery in deterministic nonlinear systems [32,33]. For example, sparse identification of nonlinear dynamics (SINDy) approximates time derivatives with a sparse linear combination of candidate functions [16]. However, even without the difficulties of estimating time derivatives from noisy data, a major challenge for extending deterministic methods to stochastic modelling lies in disambiguating the macroscopic dynamics from the unresolved degrees of freedom. In cases where the dynamics can be closely approximated by one-dimensional dynamics forced by additive white noise, parameters may be identified by regression to analytic solutions of the probability density function (PDF) [34,35]. For more general systems, recent work has approached the parameter estimation problem with inference methods based on ensemble Kalman filtering [36] and information theory [37,38]. Alternatively, Boninsegna et al. introduced a major contribution to stochastic modelling by demonstrating that SINDy could be extended to stochastic systems without Monte Carlo approximation via the conditional moments used in the Kramers-Moyal expansion [17]. This stochastic SINDy method was capable of recovering the correct model structure and parameters from large libraries of candidate functions.
Approximating stochastic dynamics from data with the Kramers-Moyal average has a long history of successful modelling in a wide range of fields [4]. However, as with many theoretical results, it is predicated on the assumption that the dynamics are driven by Gaussian white noise. As recognized by Einstein, even the molecular forcing involved in Brownian motion has some finite decorrelation time since it is a continuous physical system [14]. Moreover, omitting degrees of freedom from an otherwise Markovian 1 system generally leads to explicit memory effects in the dynamics [39]. We therefore expect that all systems will have some characteristic 'Einstein-Markov' time scale over which the time evolution of macroscopic variables may depart significantly from the standard assumptions of stochastic modelling [4]. For Brownian motion, this time scale is of the order of picoseconds, while for complex, multiscale, far-from-equilibrium systems it may even be longer than experimental sampling rates.
These considerations make the sampling rate used to construct stochastic models from experimental time series an important choice. Theoretical difficulties introduced by timecorrelated forcing and non-Markovian effects can be avoided to some extent by deliberately subsampling. This has been established in finance [40,41] and shown for artificial dynamics with two widely separated time scales [42]. Qualitatively, coarse sampling allows the unresolved degrees of freedom to decorrelate, while ideally still resolving the coherent macroscopic scales (figure 2). If the fluctuations appear uncorrelated in time, standard theoretical tools, such as the Kramers-Moyal average, may once again be applied. However, coarse sampling leads to distorted estimates of the conditional moments used in the Kramers-Moyal expansion [43], although these finite-time effects can be accounted for using the adjoint Fokker-Planck equation [44].
The relevance of this result for parameter estimation in stochastic models was realized by Honisch & Friedrich, who proposed an optimization framework designed to correct for finite sampling-rate effects [18]. This technique has recently been refined and applied to parameter estimation for amplitude equations describing several different physical systems by  .6)). However, if the macroscopic dynamics have a much slower characteristic timescale ω α, we may be able to choose a sampling rate τ −1 which can simultaneously resolve the dominant dynamics and allow the unresolved scales to decorrelate. For example, the power spectrum of the radial centre of pressure of the turbulent wake is shown at bottom along with the subsampling rate used in §4c.
et al. [45][46][47]. As with the majority of work in nonlinear time series analysis [32,48,49], existing studies have focused on modelling scalar observables, although the theory readily generalizes to complex-or vector-valued systems.

(b) Contributions
Here we show that these finite-time corrections generalize the stochastic SINDy method [17] to the broad class of systems for which the forcing cannot be treated as white noise. Specifically, we explore systems for which the fast scales have nontrivial dynamics, the exclusion of which formally breaks the Markovian properties of the full physical system. This includes coloured noise, latent variables and 'microscopic' degrees of freedom with significantly non-zero time correlations. This non-Markovian character is a fundamental problem for stochastic systems with unresolved degrees of freedom that have internal dynamics. Although in principle it could be more effective to identify a nonlinear, continuous-time, non-Markovian model of a stochastic system, it would be significantly more difficult when the underlying evolution operator is unknown, since many of the fundamental tools of stochastic analysis (e.g. the Fokker-Planck equation) rely on the assumption of Markovian behaviour. This problem can be mitigated against by coarsely sampling in time, allowing the forcing to decorrelate. This introduces distortions to the finite-time statistics, which can be corrected with parameter estimation based on the adjoint Fokker-Planck equation. We extend this adjoint-based optimization problem with the forward steady-state solution to enforce consistency between the model and the empirical probability distribution.
The proposed modelling framework, which we refer to as Langevin regression, is designed to identify nonlinear Langevin-type equations directly from noisy experimental data. This method combines the advantages of three previously distinct approaches: adjoint-based parameter estimation with the Kramers-Moyal average [18,44], learning unknown model structure with sparse regression [16,17], and steady-state PDF fitting [34]. Central aspects of our approach have been introduced in these previous works, but here we develop a single, unified framework. The main contributions of this work are detailed in §3 and may be summarized as follows: (i) Unresolved degrees of freedom often have non-trivial time correlations. We show that with deliberate subsampling and finite-time corrections, macroscopic variables can be modelled by low-order nonlinear dynamics driven by uncorrelated white noise. (ii) Finite-time effects due to coarse sampling rates can be corrected with parameter estimation based on the adjoint Fokker-Planck equation. We extend this optimization with the forward solution, enforcing consistency with the steady-state probability distribution. (iii) If the form of the stochastic model is unknown, its structure can be automatically identified with an iterative SINDy-type model selection procedure. We show that this stepwise sparse regression can be straightforwardly combined with the Fokker-Planck optimization.
We explore the proposed nonlinear stochastic model identification method on several example systems. First, we illustrate the importance of judicious subsampling and finite-time corrections for correlated forcing by recovering a nonlinear Langevin equation driven by coloured noise. We then show that the Langevin regression can construct a reduced-order model approximating the second-order dynamics of a particle in a double-well potential with a first-order bifurcation normal form. Both of these illustrative examples avoid the latent variable problem for the unresolved degrees of freedom by learning stochastic closure models. An implementation of Langevin regression along with code to reproduce the results from the simulated system is available on GitHub. 2 Finally, we apply Langevin regression to experimental measurements of a turbulent bluff-body wake; the sparse model selection procedure identifies a model similar to that proposed by Rigas et al. [34], but with an additional nonlinear noise term that improves the correspondence with both the empirical probability distribution and power spectral density. Langevin regression draws from both the long legacy of stochastic modelling and recent advances in data-driven methods to form a flexible and general framework for approximating complex nonlinear dynamics with statistically consistent stochastic models.

Background on stochastic dynamics
This section briefly reviews relevant theoretical concepts in stochastic modelling, including the Fokker-Planck equation, the Kramers-Moyal conditional average, and adjoint corrections for finite-time sampling effects. For more comprehensive background on the topics of nonequilibrium statistical mechanics and the physical applications of stochastic differential equations we refer the reader to excellent reference texts such as [3,5,39].
In this work, we seek to model the macroscopic variables x ∈ R d with Langevin dynamics of the form given by equation (1.1). The influence of unresolved degrees of freedom on the standard deterministic dynamicsẋ = f (x) is modelled with the diffusion term σ (x)w(t), where w(t) is a white noise process. The diffusion is called additive if σ (x) is a constant, or multiplicative if it is state-dependent. The majority of stochastic models assume additive noise, since is easier to treat analytically and is often a reasonable approximation for systems without strong coupling across scales. For instance, additive process noise is the standard assumption for linear state-space models of electronic circuits subject to thermal fluctuations. However, for systems such as fluid dynamics where the quadratic nonlinearity leads to bidirectional coupling between the coherent and turbulent degrees of freedom, state-dependent noise may improve the model [50].
Perhaps the most restrictive assumption is that placed on the noise process w(t). In order to treat Langevin dynamics analytically, the forcing is typically taken to be Gaussian-distributed and delta-correlated in time, i.e. w(t)w(t ) = δ(t − t ). If this is not the case, the following discussion becomes more complicated [3,39]. For a macroscopic or integral quantity, the assumption of Gaussian statistics can be argued by appealing to the central limit theorem, but the timecorrelation requirement is generally more difficult to justify.

(a) Fokker-Planck equation
The Langevin equation (1.1) describes individual trajectories, but due to the variability inherent in stochastic dynamics it is often more natural to approach stochastic dynamics from the ensemble perspective. For the Langevin dynamics given by equation (1.1), conservation of probability requires that the PDF p(x, t) evolves in time according to the Fokker-Planck equation 3 : where the diffusion tensor a(x) is given by a ij (x) = σ i (x)σ j (x)/2 and the subscripts indicate Einstein summation notation. As an alternative approach to Langevin equations, the multiscale interactions in complex spatiotemporal systems can be modelled explicitly with Fokker-Planck equations [51].
Often we are interested in the case where the system is statistically stationary, so that Lp = 0 and p = p(x). For example, if x is a scalar and the diffusion σ is a constant, the steady-state solution can be determined analytically: where the constant C is determined by the normalization condition p(x) dx = 1. However, solving the Fokker-Planck equation for general nonlinear dynamics is challenging and typically must be approached approximately or numerically [3].

(b) Kramers-Moyal average
The Fokker-Planck equation may also be derived by expressing the time evolution of a general PDF as a Taylor series of conditional finite-time moments m This leads to the Kramers-Moyal expansion. For scalar x, Viewing the conditional mean in equation (2.3) as a finite-difference formula, the first moment m (1) τ (x) gives the average displacement over a time interval τ if the system is in state x. Likewise, the second moment m (2) τ (x) gives a conditional variance of the short-time evolution. However, according to Pawula's theorem, if the system is driven by Gaussian white noise all moments n ≥ 3 vanish [3], leading to the Fokker-Planck equation (2.1). The leading moments are related to the drift and diffusion functions of the corresponding Langevin equation in the limit of vanishing time interval, i.e. and We refer to these relationships as the  Figure 3. Schematic of Kramers-Moyal coefficient estimation for the first moment (drift). The drift estimate is determined by the conditional mean of the state evolution over the interval, while the diffusion is given by the conditional variance. The conditional moments can be approximated by dividing a long time series into histogram bins and taking the mean and variance within each bin. For example, the Kramers-Moyal drift estimate gives an approximate discretized vector field for the deterministic component of the dynamics (right).
In principle, these averages could offer a way to approximate unknown drift and diffusion functions from data by binning the time series into histograms and computing (2.5a,b) with the sampling rate τ = 1/f s , as illustrated in figure 3. This is a classic approach to constructing approximate Langevin equations from data [4,52]. Boninsegna et al. also demonstrated that the Kramers-Moyal average can be combined with SINDy sparse regression to discover analytic drift and diffusion equations from data [17].

(c) Finite-time effects
The Kramers-Moyal average has chiefly been a theoretical tool; its application to time-series analysis depends strongly on the assumption of Gaussian white noise and fast enough sampling rates to approximate τ → 0. In practice, these two requirements tend to be in tension; if the forcing originates with unresolved scales, then increasing the sampling rate leads to stronger correlations in the 'noise'. The assumption of uncorrelated forcing is never strictly satisfied for continuous physical systems. Einstein recognized this in his work on Brownian motion [14], although in that case the separation of scales is pronounced enough that experimental sampling rates run little risk of capturing correlation effects in the molecular forcing. In complex systems of modern interest, the scale separation is typically much less obvious.
As a simple illustrative model, consider a system driven by a coloured noise process η(t): where w(t) is a true Gaussian white noise process. In this case, the forcing η(t) is characterized by a decorrelation time α −1 . When f (x) is a stable Navier-Stokes operator linearized about a turbulent mean profile, coloured noise forcing has been shown to accurately reproduce turbulent statistics [31]. A generalized Langevin equation driven by time-correlated forcing can also be derived from the Euler equations using the direct-interaction approximation [11]. This perspective is also popular in climate modelling, where models of this form can be derived from the governing equations using perturbation arguments [50]. Despite the formally non-Markovian nature of the resolved variables, it is advantageous to approximate the forcing as white-in-time, since most foundational theoretical tools are based on this assumption. One strategy to avoid dealing with the latent variable η(t) is deliberate subsampling, as illustrated in figure 2. If the macroscopic dynamics have a characteristic timescale ω, we may be able to choose a sampling rate f s = τ −1 such that ω f s α and the forcing appears decorrelated. That is, η(t + kτ )η(t) t ≈ δ k0 . In this case the forcing appears to be (band-limited) white noise.   Figure 4. PDF evolution from the Fokker-Planck operator L. The distribution used to evaluate the conditional finite-time moments (2.3) can be interpreted as the evolution of a delta function initial condition over the sampling interval τ , where the state is known to be x at time t.
The minimum τ for which this is true is often called the Einstein-Markov scale [4]. This time scale is not necessarily related to the autocorrelation time of the macroscopic dynamics x, but may be identified from data by checking the Markov property at different sampling rates (see appendix A and [4]). This strategy of subsampling at the approximate Einstein-Markov scale was proposed to model spatial fluctuations in turbulence [53] and the cosmic microwave background [54].
In order to sample the resolved dynamics coarsely enough that the forcing is decorrelated, finite-time effects in the Kramers-Moyal estimates of drift and diffusion must be accounted for [43]. However, these effects can be determined exactly in terms of the adjoint Fokker-Planck operator [44]. For notational clarity, we give the result for scalar x, although the result generalizes naturally to higher dimensions.
The conditional moments defined in equation (2.3) can be written equivalently as where p(x , t + τ |x, t) indicates the conditional joint probability that the system is in state x at time t + τ given that it was in state x at time t. If the drift and diffusion are not time-dependent, then the conditional probability p(x , t + τ |x, t) can be interpreted as the propagation of uncertainty over an interval τ if the state x is known at time t, as shown in figure 4. According to equation (2.1), this evolution is given by the Fokker-Planck equation acting on a Dirac delta function: Using the definition of the adjoint operator and evaluating the integral with the delta function, where the adjoint Fokker-Planck operator is given in tensor summation notation by Thus, Lade [44] showed that the effect of coarse sampling rates can be understood as the adjoint Fokker-Planck operator evolving the moments of the distribution in time. Honisch and Friedrich demonstrated the use of this relationship to optimize free parameters in a Langevin model [18]; the proposed method in §3 builds on this result.
The close correspondence between the Fokker-Planck and Liouville operators also suggests an interpretation of this result in terms of Koopman theory [55,56]. The adjoint Fokker-Planck operator is a linear generator of the time evolution of observable functions, even when the underlying dynamics are nonlinear [25,39]. In this case, the observables are the conditional moments (x − x) n . These corrections are difficult to compute analytically for nonlinear dynamics, so equation (2.9) is typically approximated on a discretized domain. The matrix exponential e τ L † is relatively inexpensive in one dimension, but more generally it may be helpful to interpret the correction as the solution of a PDE. That is, if w (n) (x, t) is the solution to the adjoint Fokker-Planck equation then by linearity of L † the finite-time conditional moments m τ (x) are given by

Proposed Langevin regression method
The decomposition of a multiscale system into dominant deterministic dynamics and stochastic forcing is conceptually simple. However, the gulf between detailed first-principles descriptions and a simple Langevin model is large enough that developing accurate stochastic models is difficult. This is especially true when the dynamics are nonlinear and the unresolved scales have internal dynamics and cannot be treated as true Gaussian white noise.
Owing to the intrinsic volatility of individual trajectories, model identification methods typically rely on ergodicity and exploit the connection to ensemble properties via the Fokker-Planck equation. For example, a scalar model with additive noise can be determined by fitting to the analytic steady-state probability distribution given by equation (2.2). However, this fitting procedure cannot independently estimate drift and diffusion; an additional quantity, such as mean-square displacement, must also be used [34,35]. This presents a challenge for modelling multiplicative noise, which some studies have suggested is important for capturing interactions between the resolved and unresolved degrees of freedom [50,52]. Furthermore, the use of a time average to estimate steady-state statistics destroys temporal information, so that oscillatory dynamics cannot be resolved.
Seeking to address a number of these challenges, several recent studies have proposed methods whereby stochastic models may be derived from the Kramers-Moyal average [17,18,45]. Here we synthesize and extend these methods into a single optimization framework for identifying sparse, interpretable Langevin-type models from experimental data by constraining the model to both the stationary PDF and the Kramers-Moyal average. In particular, we generalize the stochastic SINDy method proposed by Boninsegna et al. [17] with finite-time corrections that enable modelling systems whose forcing is not approximately given by Gaussian white noise.
(a) Known model structure: Fokker-Planck optimization Estimating the Kramers-Moyal coefficients (2.5a) and (2.5b) by binning a long time series is an attractive option for estimating drift and diffusion functions directly from data. However, as discussed in §2c, the sampling rate must be chosen carefully for unresolved dynamics to decorrelate. This subsampling can introduce significant finite-time distortion to the empirical Kramers-Moyal coefficients. This distortion is given in terms of the adjoint Fokker-Planck equation by equation (2.9), although constructing the Fokker-Planck operator itself requires the drift and diffusion.
Based on these considerations, Honisch & Friedrich suggested an iterative procedure to estimate free parameters of drift and diffusion functions [18], which may be summarized as follows. If the Langevin model is given in terms of a set of parameters ξ , so thaṫ   More concretely, the optimal ξ solves the following problem on a discrete domain of N points x i : (3.2) Here the weights w (n) i reflect pointwise uncertainty in the empirical estimate of the moments. Owing to the diffusive nature of the Fokker-Planck equation, it is not clear that this problem is necessarily well posed. That is, when the system is sampled coarsely there may be a range of parameters that are consistent with the observed conditional moments within experimental uncertainty.
We propose 'regularizing' the optimization problem (3.2) as proposed by Honisch & Friedrich [18] with the Kullback-Leibler (KL) divergence D KL between the empirical PDFp(x) and the solution of the steady-state Fokker-Planck equation p(x, ξ ), given by equation (2.1). The modified cost function is where η is the relative weight of the two contributions to the cost function. The KL divergence is a statistical measure of the difference between two probability distributions p and q, defined as This regularization ensures that the resulting Langevin model is consistent with both the finitetime Kramers-Moyal coefficients and the asymptotic steady-state probability distribution. The optimization problem in Langevin regression is shown schematically in figure 5. Because the typical dimension of ξ is relatively small compared to the dimension x, we find that gradientfree optimization methods such as a Nelder-Mead simplex search are more efficient than those designed for large parameter spaces and relatively inexpensive cost function evaluations, such as automatic differentiation. The resulting optimization problem requires solution of both the forward and adjoint Fokker-Planck equations at each evaluation of the cost function. For a scalar variable x, the steady-state solution to the forward equation can be computed directly with equation (2.2). In higher dimensions, there is no analytic steady-state solution to the forward equation; we describe several solvers in appendix B. We solve the adjoint equation with a second-order finite difference method. In one and two dimensions, the resulting matrix exponential is relatively inexpensive, while for higher dimensions a time-stepping approach that exploits the sparse operator structure will be more efficient.

(b) Unknown model structure: stochastic SINDy
If the form of the model can be assumed up to a set of unknown parameters, the Fokker-Planck optimization problem detailed in the previous section is sufficient to estimate the free parameters. However, in many cases we might have some partial prior assumptions about the model structure (e.g. the model consists of polynomials with a particular symmetry, or that one variable is forced by another), but the exact form is unknown. The SINDy method has recently shown promise for obtaining nonlinear reduced-order models of laminar flows [16,[57][58][59] from data. However, SINDy typically relies on estimated time derivatives, which is a significant barrier to modelling experimental data or multiscale systems. In related work, SINDy has recently been leveraged for turbulence closure modelling [60].
Boninsegna et al. [17] recently proposed a stochastic SINDy algorithm based on the Kramer-Moyal average without an adjoint correction for finite-time effects. Empirical estimation of conditional moments gives point estimates of drift and diffusion at each histogram bin; if the moments are estimated reliably, then the system identification problem reduces to fitting a curve through these points. In its simplest form, a parsimonious model can be chosen using the SINDy framework, where the model parameters are a coefficient vector for a "library" matrix whose columns consist of candidate functions [16]. For instance, the library Θ(x) might consist of polynomials in x; then we look for polynomial representations of the drift f (x) and diffusion σ (x), so that where ξ f and ξ σ are sparse vectors that have as many zero entries as possible while still capturing the observed dynamics. Standard sparse regression algorithms can be used to select a set of functions balancing parsimony and accuracy.
In the deterministic SINDy algorithm, this regression problem is constructed by concatenating column vectors of an estimated time derivativeẋ and the evaluations of the candidate functions. In the present case, however, the regression is performed over the discretized spatial domain rather than a long time series. The conditional finite-time coefficients are estimated by computing equation (2.3) over observations that fall into each spatial histogram bin, as visualized in figure 3. The regression problem is constructed over these bins rather than the direct time series. In practice, this typically reduces the length of the column vectors from O(10 6 ) to O(10 2 ).
One consequence of this formulation is that standard sparse regression algorithms, such as thresholded least squares or forward regression orthogonal least squares, do not work well. A simple alternative is the reverse-greedy stepwise sparse regression (SSR) [17]. With this method, terms are sequentially removed from the model according to some criteria. In the original SSR algorithm, the coefficients were identified with a simple least squares and terms with the smallest absolute value were removed. However, in general the smallest coefficient does not necessarily imply the least important contribution. For this reason, we sequentially remove terms corresponding to the smallest increase in cost function. The cost function itself can then serve as a model-selection criterion; for a Pareto-optimal model the cost function should jump significantly from a near-minimum value once important terms begin to be discarded.
When combined with the forward/adjoint Fokker-Planck optimization described in the previous section, this model selection procedure represents a flexible and general framework for identifying stochastic approximations to multiscale nonlinear dynamics, which we refer to as Langevin regression. In the following section, the Fokker-Planck parameter estimation is demonstrated on two example systems for which the form of the model is clear. The ability of Langevin regression to simultaneously identify the structure and parameters of a model from data is demonstrated in §4c for experimental measurements of a turbulent wake.

Results
Here we apply Langevin regression to three example problems of increasing complexity. First, we show that the coarse sampling and scale separation ideas discussed in §2 enable the identification of stochastic systems with time-correlated forcing. For this example, we assume knowledge about the structure of the model in order to highlight the effects of coloured noise and parameter estimation in the case where the correct structure is known.
Second, we demonstrate the construction of a statistically consistent reduced-order model by approximating the second-order dynamics of a particle in a double-well potential with the corresponding first-order bifurcation normal form. A stochastic normal form model can be derived analytically for this system, although its accuracy quickly degrades away from the bifurcation point. We fix the structure of the model and show that Langevin regression can maintain statistical accuracy even far from the bifurcation point.
Finally, we derive an accurate and efficient stochastic model for the turbulent flow in the wake of an axisymmetric bluff body from experimental measurements. This example presents several challenges, including partial and noisy measurements of a multiscale system, and has relevance to numerous industrial applications [20]. Although the structure of the drift dynamics may be inferred from laminar stability analysis, we instead apply sparse model selection to discover this structure entirely from data. Our procedure identifies a simple and interpretable nonlinear model with a multiplicative noise term that improves the correspondence with the empirical power spectrum and probability distribution compared with previous stochastic modelling results.
The synthetic examples in §4a,b are simulated using the SRIW1 stochastic Runge-Kutta method [61], available in the DifferentialEquations.jl package [62]. Langevin regression is performed on Kramers-Moyal coefficients and empirical PDFs computed from a time series of 10 7 points sampled at t = 10 −2 . The turbulent wake model in §4c is based on the aerodynamic centre of pressure, a global integral quantity estimated from 64 evenly spaced pressure taps [63]. The model is estimated from a time series of 8.9 × 10 6 experimental measurements of the centre of pressure sampled at 225 Hz. Monte Carlo evaluation of this model is performed in Python using a standard Euler-Maruyama numerical integration scheme at the same sampling rate. The coarse subsampling rate for Kramers-Moyal averaging is chosen for each system according to the criteria discussed in appendix A. The relative weight η of the KL divergence in the optimization function is a multiple of 10, which is chosen to be roughly equal to the minimum cost function with η = 0. In other words, the Kramers-Moyal coefficients and the PDF are given roughly equal weight in the optimization.

(a) Pitchfork bifurcation normal form
The normal form for a supercritical pitchfork bifurcation provides a canonical example of bistability, where an eigenvalue with zero imaginary part crosses the real axis as a parameter is varied. For example, this normal form describes the amplitude equation governing the symmetrybreaking mode of the wake behind a circular disc, which can be derived by a weakly nonlinear stability analysis [64]. This result inspired the use of a stochastically forced pitchfork normal form to model the turbulent evolution of the centroid of the base pressure distribution on the back of an axisymmetric bluff body [63] and the bistability of a three-dimensional Ahmed body wake [35]. However, unresolved degrees of freedom in a turbulent flow do not typically resemble deltacorrelated white noise. Here we investigate the impact of correlated noise on stochastic system identification by considering the supercritical pitchfork normal form forced by coloured noise: Here η is an Ornstein-Uhlenbeck process with characteristic relaxation time α −1 , which acts as an effective low-pass filter on the white noise process w(t). We choose μ = λ = β = 1, α = 10 2 and σ = 0.5α so that the typical amplitude of η is around 0.5. The non-zero relaxation rate for the Ornstein-Uhlenbeck process introduces temporal correlations that invalidate many of the standard analytic approaches to stochastic modelling if η is not directly observed. However, if the relaxation rate is much larger than any natural dynamics of the slow variable (i.e. α λ) we can appeal to the dual scale separation idea of figure 2 and approximate the fast scales as uncorrelated noise.
As demonstrated in figure 6, the correlated forcing destroys the Kramers-Moyal average as the sampling interval τ → 0. This can be mitigated by sampling coarsely enough that the noise decorrelates and appears to whiten. For instance, if we choose τ = 0.5 = 50α = 0.5λ (slower than the noise decorrelation but faster than the drift dynamics), the Kramers-Moyal average is of the correct order of magnitude. However, finite-time sampling rates now significantly deform the observed drift and diffusion, even introducing apparent state dependence in the diffusion ( figure 6, bottom middle). This observation by Ragwitz & Kantz called into question many earlier attempts to use the Kramers-Moyal average for modelling without accounting for finite-time effects [43].  Table 1. True parameters for the pitchfork normal form forced by coloured noise along with those estimated from data. Without the adjoint Fokker-Planck corrections for finite sampling rates, the true coefficients are underestimated by a factor of 2. On the other hand, Langevin regression with full adjoint-based optimization identifies a statistically consistent model driven by white noise forcing, where the drift coefficients are a close match to the true system. Langevin regression accounts for these distortions with the adjoint Fokker-Planck operator, recovering a one-dimensional model nearly identical to equation (4.1a), but forced by additive white noise. The identified coefficients (given in table 1) differ from the true values by around 5%, but the model closely matches both the observed finite-time conditional moments and the empirical probability distribution (figure 6, bottom row). This suggests that the proposed subsample-and-correct approach is capable of identifying statistically consistent Langevin models, even in the presence of correlated noise. This result depends fundamentally on the dual scale separation principle of figure 2; the success of this approach may be limited when these timescales cannot be clearly separated with the coarse sampling rate.

(b) Double-well potential
In many cases, Langevin-type stochastic models are intended to be reduced-order approximations of the large-scale dynamics of a complex system, rather than faithful representations of firstprinciples physics. The 'microscopic' degrees of freedom in these systems generally have finite correlation times, as with the coloured noise in the previous example. Eliminating these variables from the model leads to explicit memory effects in the Langevin equations [39], unless the scale separation principle can be employed to identify a memory-free reduced-order stochastic model from data. Low-order polynomial dynamics, such as normal forms, can arise naturally in this context as a way to describe the macroscopic behaviour, even when the underlying physical description appears completely different [65].
For example, the one-dimensional motion of a particle of unit mass in a general potential U(x) subject to thermal fluctuations is given bÿ where γ is the damping ratio, k B is Boltzmann's constant, T is the temperature, and w(t) is a white noise process [3]. We consider the double-well potential An example trajectory of this system is shown in figure 7, displaying both small oscillations within each well and random large jumps between wells.
The equation of motion generated by this potential can be non-dimensionalized to the form d dt appendix C) gives a first-order model based on the pitchfork bifurcation normal form: As figure 7 shows for = 20, even far from the bifurcation the bistability still dominates the dynamics, although the invariant manifold approximation used to reduce the order of the normal form model no longer holds. A first-order equation of the form of equation (4.5) can capture this bistability at the cost of ignoring the small oscillations within each potential well. In other words, the goal is to coarse-grain the dynamics while preserving the statistical properties of the system.
Constructing a first-order Langevin model for the position x is made difficult by the fact that the time series is smoothed by integration of the thermal fluctuations forcingẋ in equation (4.4). The neglected degree of freedom introduces non-Markovian behaviour and confounds the Kramers-Moyal average, which tends towards zero with fast sampling rates (figure 8). However, by choosing a coarse enough sampling rate so the subsampled dynamics appear Markovian and correcting for the finite-time effects, Langevin regression is able to identify a first-order model that captures both the probability distribution p(x) and the distribution of residence times in each metastable well.
As shown in figure 9, the Langevin regression model has similar fidelity to the analytic normal form close to the bifurcation, but the data-driven model maintains statistical accuracy well beyond the region where the normal form is valid. Also shown are results for the same model structure, but with parameters estimated by regression to the empirical PDF (appendix Ba). At each value of the bifurcation parameter , the second-order dynamics are driven by noise σ = ( √ + )/2 so that the mean dwell time maintains a similar order of magnitude throughout the range of comparison.

(c) Turbulent axisymmetric wake
Turbulence is a notoriously challenging problem that exemplifies many of the difficulties of stochastic modelling. A turbulent fluid flow is deterministic in principle, but the large and continuous range of spatio-temporal scales often necessitates statistical analysis. However, unlike Brownian motion, turbulence is far enough from equilibrium that it does not obey the principle of detailed balance; the machinery of statistical mechanics cannot easily be applied to turbulence [11]. Nevertheless, many turbulent flows are dominated by large-scale coherent structures whose regular evolution is suggestive of low-dimensional dynamics, despite the unpredictability introduced by strong coupling to the smaller scales in the flow. In particular, high Reynolds number flows are characterized by a wide separation between the slow macroscopic dynamics and the faster turbulent fluctuations [13]. In this context, Langevin regression is a natural extension of the data-driven modelling methods that have proven successful at identifying low-dimensional dynamics in laminar flows [57][58][59]66]. Just as these methods are capable of generalizing the near-bifurcation results of weakly nonlinear analyses [64,67] and POD-Galerkin models [68], here we aim to further extend this philosophy to turbulent flows by modelling all but the most important degrees of freedom as stochastic forcing. We demonstrate stochastic model identification on the experimentally measured pressure distribution on an axisymmetric bluff body, visualized in figure 10 and described in detail in [63]. The Reynolds number based on the body diameter is ∼ 2 × 10 5 . This flow is a stereotypical configuration that exhibits several features important to drag reduction applications.
The low Reynolds number laminar wake behind an axisymmetric bluff body is symmetric, but undergoes a supercritical pitchfork bifurcation at ∼ 10 2 so that the centre of pressure is offset at a non-zero radial amplitude [64]. Since the unstable symmetric wake has lower drag than the asymmetric configuration, stabilizing the symmetric state is a major goal of flow control studies [35,69]. Simple low-dimensional models that accurately represent process noise, energy transfers and frequency dynamics could significantly improve closed-loop control schemes. The symmetry-breaking instability continues to dominate the wake dynamics in the turbulent regime, although the location of the centre of pressure tends to wander randomly [34,63,70].
As a macroscopic proxy for the amplitude of the symmetry-breaking, we model the evolution of the centre of pressure, or the centroid of the pressure distribution on the back of the body. The base pressure distribution is measured using 64 evenly spaced taps, as shown in figure 10. Since the centre of pressure is a global integral quantity, we expect that fluctuations will be roughly Gaussian, based on the central limit theorem, although time correlations in the forcing still necessitates the use of subsampling and finite-time corrections.
A simple dynamical model that captures the symmetry-breaking behaviour is the normal form of the pitchfork bifurcation forced by Gaussian white noise [34]. The radial component of the Langevin equation for a symmetric two-dimensional pitchfork bifurcation forced by additive white noise isṙ where the 1/r term appears as a consequence of Itô's lemma for a change of variables in stochastic systems. In one dimension, the steady-state Fokker-Planck equation can be solved analytically for this model and the free parameters can be identified based on a fit to the empirical probability distribution and mean-square displacement, as described in appendix Ba. As shown in figure 11, this model agrees reasonably well with the observed statistics for the centre of pressure, suggesting that the stochastic modelling approach is a promising description for the leading global degrees of freedom. However, it is difficult to extend this modelling methodology to more complex systems. Even in one dimension, the drift and diffusion do not  Figure 11. Statistical evaluation of the axisymmetric wake models. The Langevin regression model (blue) better matches both the power spectral density (PSD) (a) and tails of the PDF (b) compared with the pitchfork normal form (4.6) with coefficients estimated by PDF fitting, as in [34]. The models are similar, but Langevin regression identifies a quadratic state-dependent noise ( figure 12). The power spectrum is premultiplied by Strouhal number St = fU/D, a dimensionless frequency. The large peak at St ≈ 0.2 corresponds to vortex shedding, which is essentially indistinguishable from the symmetry-breaking instability in the base pressure distribution [35]. (Online version in colour.) appear independently in the analytic steady-state PDF; additive noise can be estimated from the mean-square displacement, but this poses a challenge for multiplicative noise. In higher dimensions, the Fokker-Planck equation does not have an analytic solution for general drift and diffusion, although model parameters might be estimated by optimizing the solution of a numerical solver. This approach cannot resolve oscillatory behaviour such as vortex shedding, since temporal information is lost in the steady-state distribution.
Since we do not have a known form of the model in this case, besides the intuition for the pitchfork normal form, we apply the stochastic SINDy procedure described in §3b. Based on symmetry considerations, we include only odd polynomials in the library of drift functions. The 1/r term in the drift, due to the representation in polar coordinates, is also accounted for in the optimization routine. The model selection criteria shown in figure 12 show a clear Pareto-optimal model of the formṙ = λr − μr 3 + σ 2 2r + (σ 0 + σ 1 r 2 )w(t).
(4.7) Figure 12 also shows that the finite-time Kramers-Moyal coefficients predicted by this model closely match those estimated by the finite-time conditional average. The identified model is similar to that proposed by Rigas et al. [34], with the addition of quadratic multiplicative noise. This modification better matches the tails of the probability distribution, as shown in figure 11. Monte Carlo simulation of the Langevin models also shows that the multiplicative noise leads to a more accurate power spectral density than the model based on fitting the PDF. Quadratic multiplicative noise was previously proposed as an important modification for a spatial Langevin model of turbulence [43]. Multiplicative terms may be a result of neglecting degrees of freedom with bilinear coupling to the macroscopic variables [50].
This model for the symmetry-breaking instability does not resolve the peak in the power spectrum near St ≈ 0.2, which is related to vortex shedding in the wake [63,71]. Analysis of the corresponding laminar flow indicates that a more complete model of the wake might include three complex amplitudes to capture this periodic component and its interactions with the symmetrybreaking mode [64]. However, the vortex shedding is only weakly observable from base pressure sensors, since it mainly takes place downstream from the body [35]; for the same reason it is less important aerodynamically than the symmetry-breaking.
Langevin regression is therefore able to clearly identify a simple low-dimensional model from experimental measurements of turbulence. The sparse stochastic model is consistent with both known flow physics and empirical statistics, suggesting that approximating the evolution of global variables with nonlinear Langevin dynamics may be a promising direction in the low-dimensional modelling of turbulent flows.  The reverse-greedy sparse stepwise regression identifies a hierarchy of candidate models with varying tradeoffs between accuracy and complexity. The optimal model has the fewest terms before the cost function begins to climb, indicating the remaining terms are essential. In this case, the optimal model is a pitchfork bifurcation normal form forced by quadratic multiplicative noise (a). The model to the right of the optimal model includes only additive noise and corresponds to the model proposed in [34], while additional terms leads to a higher-order Stuart-Landau equation. The selected model closely matches the empirical finite-time Kramers-Moyal coefficients (b), the state PDF, and the power spectral density ( figure 11). (Online version in colour.)

Discussion
There is a long history in the physical sciences of approximating complex multiscale systems with reduced-order models driven by stochastic forcing. However, as this approach has spread in popularity, it is not always clear that the assumptions underpinning the rigorous treatment of non-equilibrium statistical mechanics continue to hold. For example, the subscale degrees of freedom in systems like turbulent fluid flows often do not decorrelate fast enough to appear as delta-correlated white noise. Nevertheless, experience suggests that simple stochastic models are often good approximations to systems that violate some of these assumptions. Data-driven modelling has gained significant attention in this context for its ability to construct consistent empirical models without the restrictions of classical analytic approaches. In this work, we have integrated and generalized three previously disparate approaches to data-driven stochastic modelling, combining sparse model selection based on the Kramers-Moyal average [17] with finite sampling-rate corrections [18,44] and steady-state PDF fitting [34]. The proposed modelling framework is designed to identify nonlinear Langevin-type dynamics from noisy experimental data, optimizing parameter estimates with both the forward and adjoint Fokker-Planck equations. Critically, the finite-time corrections allow the method to model systems driven by correlated noise, or fast deterministic degrees of freedom, while sparse stepwise regression can be used to discover the structure of the model when it is unknown a priori. We have demonstrated the flexibility and generality of Langevin regression on three examples. First, the method reconstructs a noisy Stuart-Landau oscillator from partial observations, illustrating the capability to model higher-dimensional systems. Second, we investigated reducing the second-order dynamics of a particle in a one-dimensional double-well potential to the corresponding first-order normal form. The data-driven model closely matches the statistical behaviour of the system far from the bifurcation, well beyond the region where the analytic normal form is valid. Finally, we apply Langevin regression to experimental measurements of a turbulent bluff body wake, identifying a model for the evolution of the base centre of pressure. In this case the SSR model selection procedure identifies a model consistent with previous work [34], but modified with a quadratic, state-dependent noise term that better approximates the power spectrum and the long tails of the PDF.
These results indicate that the proposed method is capable of accurately modelling a broad range of systems from limited experimental observations. However, we recognize two limitations of the method as presented here. First, we can currently only construct first-order models. In principle, higher-order dynamics can be recast as a system of the first-order equations, provided generalized coordinates and velocities can be measured, but this is often not the case in practice. It may, therefore, be easiest to model systems with stereotypical macroscopic dynamics reminiscent of a normal form or amplitude equation. Second, although the theory generalizes naturally to higher dimensions (and we have demonstrated a two-dimensional system), the practical limitation lies in constructing n-dimensional histograms for Kramers-Moyal coefficients and in solving the Fokker-Planck equations on the resulting grid. This challenge might be addressed either with more efficient Fokker-Planck solvers or with recently-proposed methods that avoid the need for histograms at the price of enforcing consistency with the PDF [36][37][38].
Despite these limitations, the proposed method can still be readily applied to a broad range of systems; nonlinear stochastic systems with one or two degrees of freedom have been used to model the global behaviour of systems ranging from neuroscience [72] to molecular dynamics [73] and aerodynamics [34]. The success of simple heuristics combined with stochastic models for feedback control (e.g. [35]) also suggests that these data-driven models could be integrated in a control scheme, although this will require modelling the effects of actuation on the system in addition to the homogeneous behaviour.
Another goal of modelling is to uncover the latent low-dimensional structure of macroscopic dynamics. In fluid dynamics, for instance, this is often conceptualized as a small set of global modes whose amplitudes evolve according to low-dimensional nonlinear dynamics [74]. Moving beyond integral quantities such as the centre of pressure, Langevin regression could be used to identify a data-driven, stochastic counterpart to Galerkin-type reduced-order models and resolve important nonlinear interactions between the large-scale structures of the flow.
In a broader context, the ability to identify reduced-order stochastic models from noisy experimental measurements of multiscale nonlinear dynamics will unlock a powerful set of tools for a much wider range of systems. Data-driven methods allow for the treatment of not only systems which break the strict assumptions of classical stochastic modelling, but also data from ecology, epidemiology, and neuroscience for which first-principles governing equations are unavailable. Tools such as the Kramers-Moyal average already have a history of success in a variety of fields. It is our hope that the proposed method will build on this legacy and extend these successes to an even more extensive class of complex systems.
Data accessibility. The data and codes used in this study are openly available at 'https://github.com/ dynamicslab/langevin-regression'.  10 -2 10 -1 1 Figure 13. Diagnostic statistics for finite-time sampling rate. The coarse sampling rate allows the fast fluctuations to decorrelate while still resolving the macroscopic dynamics. For a good choice of sampling rate, the autocorrelation function will take on an intermediate value and the KL divergence between the three-time PDF and its Markov approximation will be near a minimum. As seen for the turbulent wake data, an appropriate value may need to balance these two requirements.