Bayesian autoencoders for data-driven discovery of coordinates, governing equations and fundamental constants

Recent progress in autoencoder-based sparse identification of nonlinear dynamics (SINDy) under $\ell_1$ constraints allows joint discoveries of governing equations and latent coordinate systems from spatio-temporal data, including simulated video frames. However, it is challenging for $\ell_1$-based sparse inference to perform correct identification for real data due to the noisy measurements and often limited sample sizes. To address the data-driven discovery of physics in the low-data and high-noise regimes, we propose Bayesian SINDy autoencoders, which incorporate a hierarchical Bayesian sparsifying prior: Spike-and-slab Gaussian Lasso. Bayesian SINDy autoencoder enables the joint discovery of governing equations and coordinate systems with a theoretically guaranteed uncertainty estimate. To resolve the challenging computational tractability of the Bayesian hierarchical setting, we adapt an adaptive empirical Bayesian method with Stochatic gradient Langevin dynamics (SGLD) which gives a computationally tractable way of Bayesian posterior sampling within our framework. Bayesian SINDy autoencoder achieves better physics discovery with lower data and fewer training epochs, along with valid uncertainty quantification suggested by the experimental studies. The Bayesian SINDy autoencoder can be applied to real video data, with accurate physics discovery which correctly identifies the governing equation and provides a close estimate for standard physics constants like gravity $g$, for example, in videos of a pendulum.


Introduction
Calculus-based models fundamentally relate the rates of change of quantities of interest in time and space through differential and partial differential equations.From population models to turbulence, physics and engineering principles are rooted in such governing equations.In the modern era of big data, there is a growing demand to transform rich spatio-temporal data into descriptive physical models in an automated, data-driven fashion.Video data, for example, contain optical snapshots of an observed dynamical system described by some specific governing equations.To understand the underlying physics, it is important not only to identify the equations, but to discover dependent state variables (sparse representations) directly from the video frames.To discover such an underlying sparse representation, it is essential to find a correct coordinate transformation (latent space or manifold) to compress the data into a low-dimensional space.Principal Component Analysis (PCA) is often applied to obtain a low-dimensional subspace with linearity constraints.For nonlinear transformations, Autoencoders with neural networks are frequently applied as a nonlinear extension of PCA [5,39].To identify the governing equation given the latent sparse representations, one can apply data-driven methods via sparse regression [51,74,75].The sparse regression-based model discovery enables a computationally efficient way of identifying governing equation with convergence guarantees [104].In this case of physics discovery from videos, Champion et al. [15] propose SINDy autoencoders which can jointly discover on coordinates and equations for synthetic video data.More recently, Chen et al. [19] introduce the Neural State Variables learning from video data, which enables automated discovery of the latent state variables from the underlying dynamical system.[19].
The automated discovery of coordinates and equations in real video is significantly more challenging compared to these prior works [16,19].First, the temporal derivatives are not directly available from the video data.To resolve the missing temporal derivatives, one has to find an approximation for the temporal derivatives from discrete video frames, which is frequently numerically unstable and will inject a high level of noise.Indeed, lighting effects alone can greatly compromise derivate estimates.Second, the sample size of the real video dataset may not be sufficiently large in comparison with synthetic video [15].These pragmatic constraints hinder the real video data from having a statistically sufficient sample size for learning.Thus automated discovery from real video data is much more challenging due to being in the low-data and high-noise limit.
In this low-data and high-noise regime, Bayesian sparse regression methods generally have significant advantages from both a theoretical and practical perspective [31,64,79].The spike-andslab prior [44] (e.g.Bernoulli-Gaussian, Bernoulli-Laplace [3]) with hierarchical Bayesian settings has a proven success for both sparse variable selection and uncertainty quantification [71].In data-driven model discovery, Bayesian SINDy [40] with the spike-and-slab prior has also shown an advantage for correct model identification in the case study of Lynx-hare population [38] under the very low-data limit.However, a significant limitation of Bayesian methods comes from its high computational cost that hinders both speed and scalability.The MCMC-based hierarchical Bayesian model sampling requires a considerably long run because the underlying stochastic binary search grows exponentially with the number of parameters.Additionally, it is very costly to compute the full gradient of the entire video dataset for the MCMC sampling.Therefore, even if the Bayesian methods are much more powerful in the low-data regime, it is highly nontrivial to design a feasible Bayesian solution for the discovery of governing equations and coordinate systems given the computationally intractability.
In this paper, we propose Bayesian SINDy autoencoders that extends SINDy Autoencoder [15] into a Bayesian learning framework.Bayesian SINDy autoencoders can perform a joint discovery for governing equations and coordinate systems in a computationally tractable manner.Specifically, we apply the Spike-and-slab Gaussian-Laplace (SSGL) prior to the intermediate SINDy module to accelerate the sparse identification process of governing equation discovery.To resolve the computational burden arising from the hierarchical Bayesian model, we consolidate the Bayesian sampling procedure via Stochastic Gradient Langevin Dynamics (SGLD) with an adaptive empirical Bayesian variable selection method using Expectation-maximization.Instead of computing the full batch gradient, SGLD evaluates mini-batch gradients with injected random Gaussian noise, which is theoretically valid to generate Langevin-based proposal distribution [17,95].The mini-batch gradient learning naturally fits into the training of deep neural networks and relaxes the scalability issue at the same time.On the other hand, the adaptive Bayesian Expectation-maximization variable selection (EMVS) performs variable selection by optimizing the latent inclusion probability of each variable, which avoids the previously required lengthy binary stochastic search.Adapting these two ideas into the SINDy module, we can simultaneously perform an accelerated governing equation identification under the low-data, high-noise limit with a trustworthy uncertainty quantification through the power of Bayesian estimation.
In our numerical experiments, Bayesian SINDy autoencoders achieve accelerated discovery in governing equations and coordinate systems under the SSGL prior.In addition to the model discovery in synthetic datasets, from a real video data on a single moving pendulum, we obtain correct governing equation discovery with a close estimate of the gravity constant ĝ = −9.876with only 390 data snapshots.With precise understanding of the underlying physics, Bayesian SINDy autoencoder enables explainable, trustworthy, and robust video predictions.In summary, the contribution of this paper is threefold:

Background
The current work is built upon two primary mathematical innovations: (i) sparse regression used in the SINDy algorithm, including learning latent representations, and (ii) Bayesian learning with sparse priors.A quick review of each is given in order to better inform the reader how they are combined into our Bayesian SINDy autoencoder framework.

Sparse identification of nonlinear dynamics
We review the sparse identification of nonlinear dynamics (SINDy) [10] algorithm, which utilizes sparse regression to identify the latent dynamical system from snapshot data.SINDy takes snapshot data x(t) ∈ R n and aims to discover the underlying dynamical system The snapshots are collected by measurements at time t ∈ [t 1 , t m ], and the function f characterizes the dynamics.Assuming the temporal derivatives of the snapshot are available from data, SINDy forms data matrices in the following way: A common choice of candidate functions are polynomials in x targeting common canonical models of dynamical systems [34].The Fourier library is also very common with sin(•) and cos(•) terms.In summary, we build a model between X and Ẋ that where the unknown matrix The sparse inference on Ξ enables sparse identification of the dynamical system f .For high-dimensional systems, the goal is to jointly identify a low-dimensional state z = ϕ(x) with dynamics ż = g(z).The standard SINDy approach uses a sequentially thresholded least squares algorithm to perform sparse inference [10], which is a proxy for 0 optimization [107] with convergence guarantees [104].Small et al [81] and Yao and Bollt [102] previously formulated the dynamical system identification without sparsity constraints.These methods provide a computationally efficient counterpart to other model discovery frameworks [78,91].
In an alternative approach, Hirsh et al. [40] utilize Bayesian sparse regression techniques in model discovery, which have improved the robustness of SINDy under high-noise and low-data settings.Bayesian sparse inference models the sparsity with various probabilistic priors p(Ξ) like the Spike-and-slab [44], regularized horseshoe [13,14], Laplace priors [67,88] and so on.After defining the Bayesian likelihood function with sparsifying priors, Bayesian SINDy generates posterior distribution via the No-U-Turn MCMC sampler [42].Since MCMC for sparse inference can be extremely computationally demanding, [71,72] propose coordinate descent sparse inference methods via Spike-and-slab prior, targeting the mode detection.As an approximation to Bayesian inference, Fasel et al. [29] combine ensembling techniques via bootstrapping to perform uncertainty estimation and accelerated variable selection for system identification.
A related and important extension to the SINDy framework is the SINDy autoencoder [15], which embeds SINDy into the training process of deep autoencoders.The SINDy autoencoder achieves remarkable performances on high-dimensional synthetic data to jointly discover coordinate systems and governing equations.Due to over-parametrization and non-convexity, deep neural networks do not generally have explainable guarantees for inference and predictions.Dynamical system learning and forecasting is typically an extrapolatory problem by nature.Therefore, the interpretability of models is very important to understand.In this case, SINDy autoencoder is satisfactory due to the transparency of the learning process.Specifically, the encoder and decoder focus on the specific task of learning only a coordinate transformation, with the SINDy layer targeting the inference of latent dynamical systems.The SINDy autoencoder can not only identify the governing equation from high-dimensional data but can also perform trustworthy predictions based on future dynamics.The limitations of the SINDy autoencoder are exactly what our Bayesian framework addresses.

Bayesian and sparse deep learning
Bayesian deep learning achieves outstanding success in various machine learning tasks like computer vision [35], physics-informed modeling [101], and complex dynamical system control [94].From previous works, the main advantages of Bayesian deep learning come from two parts.First, the Bayesian framework can unify uncertainty quantification in deep learning.This includes the uncertainty from the neural network parameters, task-specific parameters, and exchanging information [1,90], which is applicable to computer vision [49], spatiotemporal forecasting [99], weather forecasting [89], and so on [108].Second, the Bayesian framework allows theoretically grounded ensemble neural networks via model averaging [96].Bayesian neural networks apply weights distribution to neurons [8].It is important to avoid computational requirements from Bayesian when applying it to deep learning.To perform Bayesian sampling in deep learning, Welling and Teh establish Stochastic Gradient Langevin Dynamics as an MCMC sampler in mini-batch settings [95].Neal introduces MCMC via Hamiltonian dynamics [66], and Chen et al. extend into deep learning via Stochastic gradient Hamiltonian Monte Carlo (SGHMC) [20,59].Other techniques include Nosé-Hoover thermostat [27], replica-exchange SGHMC [23], cyclic SGLD [105], contour SGLD [24], preconditioned SGLD [92], and adaptively weighted SGLD [25].Variational methods could also help to perform approximate inference for Bayesian deep learning [8,37].
Sparse deep learning is of emerging interest due to its lowered computational cost and improved interpretability.The discussion on sparse neural networks dates back decades [28,65].Glorot et al. use Rectifier Neurons with sparsity constraints to obtain sparse representations [32].Liu et al. extend to Convolutional neural networks under sparse settings [54].Then, several follow-up works aimed to train sparse neural networks efficiently [58,83].By adapting Bayesian sparse inference methods, the sparse inference process could be accelerated with valid uncertainty quantification via various prior options [26,85,93].Variational methods are also applied for efficient Bayesian inference [4].There have also been theoretical discussion of Bayesian sparse deep learning [68,84,86].Overall, sparse deep learning has the potential to make computationally tractable the construction of a Bayesian model.

Bayesian SINDy Autoencoders
This section presents the SINDy autoencoder with a Bayesian learning process that incorporates sparsifying priors and posterior sampling.We first introduce the Sparse identification of nonlinear dynamics (SINDy) autoencoder in Sec.3.1.Then, we propose a Bayesian learning framework that includes SINDy autoencoder in the likelihood function and specifies various setups for different sparsifying priors.Finally, we discuss Stochastic Gradient Langevin Dynamics (SGLD) to generate posterior samples from the Bayesian learning model.

Likelihood setting for Bayesian SINDy Autoencoders
The SINDy autoencoder enables a joint discovery of sparse dynamical models and coordinates.Figure 1 (a) provides an overview of an autoencoder.The input data x(t) ∈ R d is mapped by an encoder function f θ 1 (•) to a latent space z(t) ∈ R dz , d z < d.This latent space z(t) contains sufficient information to recover x(t) via a decoder function g θ 2 (•).
SINDy autoencoder combines SINDy with autoencoders by constraining the latent space governed by a sparse dynamical system.The encoder function f θ 1 (•) performs coordinate transformation to map the high-dimensional inputs into an appropriate latent subspace.The latent space z(t) = f θ 1 (x(t)) has an associated sparse dynamical model governed by where ] is a library of candidate basis functions, and a set of coeffi- A statistical understanding of the model formulates the SINDy autoencoder as a parametric model M θ where θ = {θ 1 , θ 2 , Ξ} contains all parameters.The likelihood of this model p(D|θ) is defined as reconstruction loss SINDy loss in ẋ SINDy loss in ż SINDy prior The log-likelihood of this statistical model is similar to the setting in [15].In order to promote sparsity on Ξ, we consider two sets of priors in what follows.
Laplace prior.The Laplace prior can be understood as a Bayesian LASSO [67,88].We define the Laplace prior such that where L(•, •) denotes the Laplace distribution defined as f We can see the equivalence of the Laplace prior and the LASSO in the negative log-likelihood, where Spike-and-slab Gaussian-Laplace (SSGL) prior.We define the SSGL prior as where γ j is a binary variable, Ξ ∈ R p , σ, v 0 , v 1 ∈ R, L(•, •) denotes a Laplace distribution and N (•, •) denotes a Normal distribution.We assign a Bernoulli prior to γ ∼ Ber(δ), δ ∈ [0, 1].The prior for θ 1 , θ 2 is specified with a Gaussian, which is equivalent to an 2 regularization implementation-wise.For simplicity, we set δ, σ, v 0 , v 1 as tunable hyperparameters.
Prior selection.In general, there is no optimal Bayesian prior for all statistical models since every prior has its own advantages and drawbacks.Therefore, the prior selection typically requires an assessment of both theory and experimental outcomes.If one selects the prior of Ξ to be a Laplace distribution, the setting will be identical to the basic SINDy autoencoder model, which is equivalent to adding a 1 regularization term (c.f.Eqn.(7) and Figure 1 in [15]).Even if the Laplace prior has a benefit in computation, its performance suffers for small sample sizes and large observation noises.Different from the Laplace prior, if one selects the SSGL prior for Ξ, it typically requires a slightly increased cost in computation.However, the SSGL prior typically has dominating performances for cases with very large noise and limited sample size, which is preferred in our case to learn the actual video data.
Bayesian formulation.Using the Bayes formula, we can construct the posterior distribution from the likelihood function and prior that We aim to depict the posterior distribution p(Ξ|D) via the joint distribution p(θ|D) under the sparsifying SSGL prior.The approximation of p(Ξ|D) is accessible from posterior samples of p(θ|D) when dropping θ 1 , θ 2 .In the setting of deep neural networks, we can sample the posterior distribution via Stochastic Gradient methods using mini-batches.The mini-batch setting not only naturally fits into the training of deep neural networks but also could accelerate the Bayesian posterior sampling process.

Stochastic Gradient Langevin Dynamics
To perform posterior sampling in mini-batch settings, Stochastic Gradient Langevin Dynamics (SGLD) is a popular method that combines stochastic optimization and Langevin Dynamics [95].Denote the learning rate at epoch t by (t) which decreases to zero, and the dataset D = {d i } N i=1 .The mini-batch setting estimates the gradient ∇ θ L(θ) from a subset (batch) B = {d i } n j=1 .The injected noise from mini-batches facilitates the generation of posterior samples while reducing the high computational cost in computation for full-batch gradients.We follow the classical SGLD setting where Here, p(θ t ) denotes the prior specified in Eqn. ( 4), (5) and p(X i |θ t ) denotes the data likelihood.Prior works study the asymptotic convergence of SGLD to the target distribution, which validates SGLD for posterior sampling in theory [41,87,106].An advantageous property of SGLD in posterior sampling is that with decaying step size (t) , SGLD automatically transfers from a stochastic optimization algorithm to a posterior sampling procedure (c.f.Sec.4.1 [95]).The Metropolis-Hasting correction can be ignored since the rejection rate for sampling goes to zero asymptotically, resulting from (t) → 0 when t → ∞.The discretization error similarly decreases as goes to zero.
Cyclical SGLD.The cyclical SGLD method [105] consists of exploration and sampling stages via a cyclical step-size schedule for (t) .In the training of deep autoencoders, the optimization process is highly non-convex with a very complex loss landscape.The cyclical step-size schedule helps to explore the parameter space when is large as well as sample local mode when is small.It is also possible to understand cyclical SGLD from a parallel SGLD perspective [23].We apply this idea in our experiments to perform better inference.

Empirical Bayes Variable Selection in SINDy Autoencoder
The empirical bayesian method infers prior hyperparameters from data.In this case, we aim to optimize γ (ignoring the uncertainty) and sample θ|D.The posterior distribution of Ξ|D can be derived from posterior samples of θ|D.
Using the mini-batch setting, the posterior distribution follows π(θ, γ|B) ∝ p(B|θ) The term p(B|θ) can be evaluated from the loss of SINDy autnencoder for the current mini-batch B; the term p(θ 1 ), p(θ 2 ) can be computed from the 2 loss; the term p(Ξ|γ) can be computed from Eqn. (5); and the term p(γ) can be known from the Bernoulli prior setting.As a binary variable, γ is difficult to be optimized for due to non-continuity and non-convexity.An important trick to perform the optimization on γ is to alternatively optimize the adaptive posterior mean E γ|θ (k) ,D [π(θ, γ|D)] which treats γ as a latent variable.Following previous works with similar settings [23,71], we could indirectly evaluate π(θ, γ|D) by a strict lower bound The inequality holds by Fubini's theorem and Jensen's inequality (c.f.Eqn. ( 7) in [23]).
The variable Q(θ|θ (k) ) can be decomposed into where and C ∈ R is a constant.Notice here γ is treated as a latent variable, and we only consider the expectation given the conditional distribution of γ given θ (k) , D. The ρ i could be considered as an inclusion probability estimate of Ξ i .In this way, ρ softens γ from a binary variable into a continuous one.
The term κ performs an elastic net-like approach which adaptively optimizes the 1 and 2 coefficients.Suppose Ξ i is identified with a high probability to be a sparse variable (e.g., ρ i < 0.05), the term κ i0 will be large, strengthening the sparsity constraints.Otherwise, if Ξ i is identified as a non-sparse variable (e.g.ρ i > 0.95), the term κ i1 will be small, and the 2 constraint will be dominated instead.

Stochastic Approximation from Expectation Maximization of ρ and κ
We could derive an asymptotically correct posterior distribution on π(θ, κ, ρ) following the steps in [23]: 2. Perform stochastic approximation to latent variables ρ, κ that Input frame

Prediction of Dynamical System
Video prediction Figure 2: Add more things to this figure including: encoder (with its shape), decoder (with its shape), and Bayesian SINDy.
The EM estimation of the inclusion probability ρ i is where i |γ i = 0)P (γ i = 0|δ).In our case, the term π(Ξ (k) i given the Gaussian prior distribution, and the term π(Ξ i from the Laplace distribution.The latter term P (γ i = 1|δ) = δ given Bernoulli prior in the previous setting.Following a similar process, for κ, we have

Prediction with Bayesian SINDy Autoencoder
Bayesian SINDy autoencoder has a trustworthy application in our application of video prediction whereby we learn the dynamics and coordinates of the latent spae.Based on an inference process established in previous sections, the video prediction can precisely understand the underlying dynamical system, which allows accurate, robust, and interpretable future forecasting.Additionally, the Bayesian framework enables precise uncertainty quantification for video prediction from posterior samples.We visualize the prediction process in Fig. 2. In this case of a pendulum video, the uncertainty of video prediction grows with larger t, and gradually fails to predict by showing a ring-like prediction.
From the procedure in Sec.3.3, we can generate samples from p(θ|D), which is the posterior distribution of neural network parameters θ given observed data D. Suppose the posterior samples are Ξ 1 , Ξ 2 , ..., Ξ m ∼ p(θ|D).The full posterior predictive distribution is defined as Therefore, we can approximately generate samples from p(x(t)|x 0 , D) using Monte Carlo estimation.For simplicity, we only consider the Maximum likelihood Estimation θ MLE 1 , θ MLE 2 .We have the following process: 1. From the input image x 0 , compute z(0) = f θ MLE 1 (x 0 ).

Experiments
In the following subsections, we conduct four case studies on Bayesian SINDy autoencoder for governing equations and coordinate system discovery for video data.In Sec.4.1, we study similar cases to those in [15] with synthetic high-dimensional data generated from the Lorenz system, reaction-diffusion, and a single pendulum.We explore the Laplace prior for chaotic Lorenz system and the SSGL prior for reaction-diffusion and single pendulum.In Sec.4.2, we study real video data that consists of 390 video frames of a moving rod.This setting is particularly challenging due to the high dimensionality in video data, noisy observation, missing temporal derivatives, and prior setting for correct learning.In both experiments on synthetic and real video data, we observe the Bayesian SINDy autoencoder can accurately perform nonlinear identification of dynamical system under the correct setting of preprocessing, prior, and training parameters.All the experiments are implemented and run on a single NVIDIA GeForce RTX 2080 Ti.

Chaotic Lorenz System via the Laplace prior
We consider the chaotic Lorenz system in the following where z = [z 1 , z 2 , z 3 ] ∈ R 3 and σ, ρ, β are constants.The Lorenz system is very representative of the chaotic and nonlinear system, which is an ideal example of applying model discovery techniques.In the numerical simulation, we first set σ = 10, ρ = 28, β = −2.7.We only generate partial Lorenz via time range t = [0, 5] with ∆t = 0.02 for 1024 different Lorenz systems from random initial conditions.The initial condition follows a uniform distribution centered at [0, 0, 25] with width [36,48,41] respectively.
From the underlying dynamical system, we create a high-dimensional dataset via six fixed spatial models given by Legendre polynomials that u 1 , u 2 , ..., u 6 ∈ R 128 .We transfer from the low-dimensional dynamical system into a high-dimensional dataset via the following rule:  We set the autoencoder with latent dimension d = 3 corresponding to the latent system in the 3D coordinate system with z A , z B , z C .We include polynomials with the highest order 3 composing a library Via the autoencoder, we wish to identify the correct active terms as well as the value of the coefficients.The coefficient of Ξ is uniformly initialized from constant 1.The loss coefficients are λ 1 = 0.0, λ 2 = 1 × 10 −4 .For the encoder and decoder, we use the sigmoid activation function with widths [64,32].For optimization, we select Adam optimizer with learning rate 1 × 10 −3 and the batch size to be 1024.For the Laplace prior setting, we set λ 3 = 1 × 10 −5 .
By training with 5, 000 epochs following the setting from [15], in terms of the error metrics, the best test error of the decoder reconstruction achieves 2 × 10 −5 of the fraction of the variance from the input.The fraction of unexplained variances are 2 × 10 −4 for the reconstruction of ż, and 1.3 × 10 −3 for the reconstruction of ẋ.Notice here decoder reconstruction is better compared to SINDy autoencoder without uncertainty quantification, but the reconstruction of ż is slightly worse compared to point estimation for SINDy autoencoder.
The coefficient estimate from the Bayesian SINDy autoencoder is shown in Fig. 3. From the figure, we can observe the uncertainty quantification in the parameter space.The outcome of this model identifies 7 active terms marked with deeper blues.It is known that the identification of Lorenz dynamics suffers from the symmetry in the coordinate system as described in [15].The discovered governing equation could be equivalently transformed via the affine group transformation on the coefficients and a permutation group transformation on the latent variables.Therefore, the identification is still correct in Fig. 3 Stepping further from the uncertainty quantification of coefficients, we can see the uncertainty quantification in the prediction space in Fig. 4 (a) (b).In general, the uncertainty coverage correctly captures the true trajectory when the model fails to predict correctly and mildly covers the trajectory when the model prediction is confident.

Reaction Diffusion via Spike-and-slab prior
Reaction-diffusion is governed by a partial differential equation (PDE) that has complex interactions between spatial and temporal dynamics.We define a lambda-omega reaction-diffusion system by which For the synthetic data, we first generates the latent dimensions u(x, y, t) and v(x, y, t) by 21 from t = [0, 500] with time step ∆t = 0.05.Then, we generate the snapshots of the dynamical system spatially in xy-domain from its latent dimensions into a video with shape (10000, 100, 100).We obtain a training dataset with size 9, 000.We also generate a testing dataset with 1, 000 samples.
We follow the same setting in [15].The SSGL prior only requires 1, 500 epochs for training while LASSO setting frequently needs more than 3, 000 epochs for training.The best test error is 2.1×10 −3 for decoder loss, 2.1 × 10 −5 for the reconstruction of ż, and 1.7 × 10 −4 for the reconstruction of ẋ.The fraction of unexplained variance of decoder reconstruction is 9.1 × 10 −5 .For SINDy predictions, the fraction of unexplained variance of ẋ is 0.013.The fraction of unexplained variance of ż is 0.001.These results all improve from LASSO based SINDy Autoencoder [15].The posterior samples from Bayesian SINDy are shown in Fig. 5.The discovered governing equation from the Bayesian SINDy autoencoder is  After these preprocessing steps, we obtain a training dataset with 390 samples and shape (27,24).

Bayesian masked autoencoder setup
We set up the autoencoder with latent dimension d = 1, with a second-order library of functions similar to synthetic video data.We remove the constant term for simplicity of the inferential process.The loss coefficients are λ 1 = 5e −7 , λ 2 = 5e −8 , and λ 3 = 1 × 10 −4 .For the encoder and decoder, we use the sigmoid activation function with widths [64,32,16].For optimization, we select Adam optimizer with a learning rate 1e −3 , and batch size to be 10.Real video data is much harder compared to synthetic video data since it contains rich information, including object colors, background, and processing noises.At the same time, only 390 snapshot samples are available to train the Bayesian autoencoder.To resolve this problem, we involve masked autoencoder [36], which applies 50% random masking to the input data.The effect of masked autoencoder could be understood as data augmentation, which allows the encoder to focus more on the key information [100].

Results from the Bayesian discovery via SSGL prior
We discuss both Laplace and Spike-and-slab priors for learning real moving rod data.In this case with very low and noisy data, the Laplace prior identifies an incorrect term z over than the true target on sin(z).We provide detailed discussion of the Laplace prior in the Supplemental material (5).The SSGL prior could perform correct dynamical identification yet with a slightly large coefficient estimation.For the setting of SSGL prior, we set δ = 0.09, v 0 = 1.0, v 1 = 5.0, ω (k) = 0.01 × (0.999) k .The success discovery rate is 100% from 10 training instances, which improves from 70% success discovery rate in the LASSO setting.Bayesian SINDy Autoencoder requires 1, 500 epochs for training.The best test error is 0.767 for decoder loss, 0.305 for the reconstruction of ẍ, and 0.0263 for the reconstruction of z.The fraction of the unexplained variance of decoder reconstruction is 0.003.For SINDy predictions, the fraction of unexplained variance of ẍ and z reconstruction are 0.081 and 0.961.
The posterior samples from Bayesian SINDy are shown in Fig. 8 The latent dimension z with SINDy correctly identifies sin(z) as the active term and generates an uncertainty estimate.Utilizing the uncertainty estimation in the parameter space, we could perform uncertainty quantification on the prediction space as in Fig. 8 (b).The training outcome of Bayesian SINDy autoencoder are shown in Fig. 8, 9. First, from Fig. 8 (c), we note that the reconstruction of the Bayesian SINDy autoencoder is very promising.The top line of figures is input data, and the second line of figures is generated by the autoencoder.Even if the scale is slightly different, it is convincing that the overall reconstructions on x, ẋ and ẍ are very good.This can be suggested by Fig. 9 that the log loss converges with more training epochs.Throughout these experiments, we uniformly set training epochs to be 1, 500 and refinement epochs to be 1, 500.This decision is suggested by the observation in a longer run (7, 500 training epochs and 1, 500 refinement epochs) that the log loss remains stable after 1000 epochs.For the reconstruction of z, having longer training epochs will harm the testing error due to overfitting.We note here the discovery by SSGL prior has dominant performance in testing loss comparing to SINDy autoencoder as shown in 5.This suggests the superiority of the SSGL prior in the real video data setting.
As suggested in Sec.3.4, the Bayesian SINDy autoencoder enables a trustworthy solution in deep learning for video prediction.In Fig. 8 (b), we predict the following frames of video using the decoder and the simulation of latent dynamical systems.We observe that in the very beginning of future time frames, the prediction is very confident and accurate.Yet as we move to a longer time interval, the prediction becomes very uncertain, and we can observe this interesting phenomenon in Fig. 8 (b).It is essential to understand the latent dimension of the autoencoder in order to know the validity of the coordinate system discovery.Therefore, we create the following analysis on the latent dimension.We first manually label all angles θ from all video frames by humans, as shown Time (t) θ(t) in the grey dashed curve in Fig. 10.The blue curve represents the rescaled latent dimension z, and we observe these two curves match mostly perfect to each other.We apply rescaling in this process due to the coordinate system for the angles θ are equivalent after arbitrary transformation in its scale.For example, the angles represented by radians and degrees are equivalent to each other under different scales.
On the discovery of the standard gravity constant g The SSGL prior accurately discovers the model with sin(z) term and reveals the standard gravity constant g.In our experimental setting, the length of the rod in our experiment is 123 cm with the initial condition at 75 • .We have an estimate of g given these conditions that ĝ = −9.876.The estimation ĝ is slightly overconfident and biased which concentrates at (−9.889, −9.865).The bias is more likely to be removed with an enlarged video datasets.It is also possible to use the Laplace prior discovery to estimate the standard gravity constant because the discovered dynamical system with the term z empirically has the form that z = − g 2L z (c.f.[15] S2.3).This would result in an estimation of the standard gravity g to be g = −8.733.However, the estimation g is more biased, and lacks rigor in the context of the physics discovery.
Given this low-data and high-noise setting, the estimate from SSGL prior ĝ is very remarkable given the true value that g ≈ −9.807.Even if the discovery of the coordinate system could be up to an arbitrary scaling, the frequency of the dynamical system could uniquely identify the coefficient of sin(z).This validates the discovery under random scaling group transformation in the latent dimension.

Discussion
In this paper, we design and implement a Bayesian SINDy autoencoder for automated coordinate and governing equation discovery from high-dimensional data.Using the Bayesian learning framework and sparsity-promoting priors, the proposed model identifies sparse dynamics in the latent dimension.Through experimental studies on synthetic Lorenz, a reaction-diffusion system, a pendulum, and real video of moving pendulum.We successfully perform model discovery under a learned coordinate system.In the small-data and high-noise regime for real video data, we identify the correct physical law with close estimation of the standard gravity constant g.Besides the model and physics discovery, in terms of video prediction, the Bayesian SINDy autoencoder provides uncertainty-aware future predictions with an exact understanding of its underlying physics, which enables a trustworthy alternative for deep learning-based video prediction.
From our perspective, the Bayesian SINDy autoencoder has great potential to enable the concept of 'GoPro physics' [97], given the initial accomplishment in real video data discovery.However, the current framework does have several limitations, which hinder a more general application in various problems.The largest limitation comes from the current framework requiring a high manual workload on hyperparameter tuning.The Bayesian SINDy autoencoder has over ten hyperparameters, including the encoder-decoder size, loss function weights, and the sparsitypromoting prior.The current inferential process relies heavily on a reasonable hyper-parameter setting.There are several potential strategies to improve the hyper-parameter tuning part to improve the robustness of the training outcome.
1.The training process will be unstable and converge to a suboptimal solution if λ 1 , λ 2 , λ 3 is misspecified.To solve this problem, one can directly solve the multi-objective optimization for encoder-decoder architecture using methods like Multiple Gradient Descent Algorithm [80].Such adaptations can not only work in this paper on Bayesian SINDy autoencoder, but will also be able to generalize into a wider context for physics-informed machine learning.
2. The hyperparameter setting for the sparsity level δ, spike-and-slab distribution variances ν 0 , ν 1 , and environmental noise σ can be tackled using a full Bayesian inference setup as described in [26,71].However, this will exceedingly enlarge the computational requirement and the difficulty in implementation.
3. Under minor misspecification of λ 1 , λ 2 , λ 3 , one may consider to utilize an ensemble of trained Bayesian SINDy autoencoder initialized from various starting points [96,99].This process could ultimately construct a posterior distribution of the SINDy coefficient Ξ, and we can perform subset selection via stability selection or inclusion probability thresholding [29,62].Another consideration would be applying variational autoencoders [50] to avoid the hardthresholding procedure.Applying variational autoencoders could also help to perform better subset selection with Bayesian uncertainty estimation to the SINDy coefficient [50].
4. The global minima exists, representing the real governing equation with the best coordinate system, but the nonconvex nature of neural network training makes it difficult to be found.The nonconvexity in neural network loss space is an open problem with very limited theoretical guarantees.To illustrate the existence of the global minima, from the case study for the moving rod (pendulum) video, the discovery with the sin(z) term leads to the minimal aggregated loss (1.098) throughout all experimental trails.All other suboptimal discoveries have a larger aggregated loss (for example via the Laplace prior, the discovery of z has aggregated loss 1.559).It is also observable from the data reconstruction (like in Fig. 8) where the result from the Spike-and-slab prior is visually the best.However, there is no guarantee that the neural network training could always converge to the global minima.Again, applying the Bayesian Model Averaging initializing from various random initial conditions (as mentioned in the previous point) could be helpful, and this will be an interesting future direction.

5.
The training of encoder-decoder structure requires fixing an approximately correct neural network depth and width.The current study on autoencoders still lacks a complete understanding of the structure.In general, an oversized encoder-decoder will result in an underdetermined system which is likely to result in an overfitted encoder-decoder.This will hinder the autoencoder from extracting the correct latent dimension as expected.Convolutional filters could be helpful in reducing the possibility of learning a wrong coordinate system by focusing more on the 2D features.
In conclusion, the Bayesian SINDy autoencoder is an improved option compared to the SINDy autoencoder especially in the small-data and high-noise limit.The Bayesian learning framework not only accelerates the training but also has dominating performance in various cases.Additionally, the Bayesian SINDy autoencoder enables uncertainty estimation, which is essential for data-driven decision-making purposes.Experimentally, we demonstrate the effectiveness of the Bayesian SINDy autoencoder using both synthetic and real video data.With only 14 seconds of real moving pendulum video, the Bayesian SINDy autoencoder successfully discovers the coordinate system and governing equation and estimates the fundamental constant g.In future works, we hope to extend the current methodology into more complex scenarios with real video to explore further the power of Bayesian SINDy autoencoder in broad applications.

Figure 1 :
Figure 1: General structure of the Bayesian SINDy autoencoder architecture.(a) An autoencoder architecture is used to discover intrinsic coordinates z from high-dimensional input data x.The encoder f (x) transforms the high-dimensional input into the a low-dimensional space z, and the decoder g(z) reconstructs x from the low-dimensional subspace.(b) A Bayesian SINDy model infers the underlying dynamics of the latent dimension with uncertainty estimation.The active terms are identified by Ξ.The color represents the inclusion probability [0.0, 1.0] from grey (non-active) to blue (active).The loss function encourages the network to minimize both the reconstruction error and the SINDy loss in z and x.The Bayesian prior works as a regularization term on Ξ for sparse inference.

Figure 3 :
Figure 3: Bayesian estimation and uncertainty quantification visualization of SINDy coefficient for Lorenz system under Laplace prior.The color bar represents the inclusion probability estimate given the coefficient magnitude.
from the following transformations.(a) We inversely transform the permutation group via assigning z C to z 1 , z B to z 2 , and z A to z 3 .(b) We inversely perform affine transformation by z 1 = 1.0, z 2 = −0.94z 2 , and z 3 = 0.55z 3 − 2.81.We demonstrate the effectiveness of the affine transformation process in Fig. 4 (c) that the discovered model (c.1) can be transformed

1 Figure 4 :
Figure 4: (a) Trajectory prediction for in-distribution data with uncertainty quantification generated by posterior samples from Bayesian inference on SINDy coefficient.(b) Trajectory prediction for out-of-distribution data with uncertainty quantification generated by posterior samples from Bayesian inference on SINDy coefficient.(c.1) Discovered Lorenz systems and (c.2) transformation to the standard space.(c.3) Lorenz system generated from the ground truth.

Figure 5 :
Figure 5: (a) Bayesian estimation and uncertainty quantification visualization of SINDy coefficient for reaction-diffusion under Spike-and-slab prior.(b) Visualization of in-distribution (top) and out-of-distribution (bottom) predicted dynamics.(c) The generated attractor from the mean of Bayesian inference.

Figure 6 :
Figure 6: (a) Bayesian estimation and uncertainty quantification visualization of SINDy coefficient for pendulum data under Spike-and-slab prior.(b) Visualization of predicted dynamics with uncertainty quantification from 6 different initializations.

Figure 7 :( 4 )
Figure 7: Preprocessing pipeline to high-dimensional video data to remove auxiliary information and reduce the size of video frames.
(a), and the prediction of trajectory with uncertainty quantification of latent dynamics is shown in the bottom figure of Fig. 8 (b).The discovered governing equation from the SSGL prior is z = −16.06sin(z).

Figure 8 :
Figure 8: (a) Bayesian estimation and uncertainty quantification visualization of SINDy coefficient for real moving rod data under Spike-and-slab prior.(b) Visualization of prediction in both latent dynamics and pixel space with uncertainty quantification.(c) Reconstruction of Bayesian SINDy autoencoder for video frame, and temporal derivatives.

Figure 9 :
Figure 9: Left: by running the experiment 5 times with 2000 training epochs and 1500 refinement epochs, we plot the mean and standard deviation of testing loss under the log scale.Right: by running the experiment once with 7500 training epochs and 1500 refinement epochs, we plot the testing loss under the log scale.

Figure 10 :
Figure 10: Latent dimension visualization after rescaling (blue curve) versus manually labelled moving rod angle by human (grey curve).

Figure 11 :
Figure 11: (a) Bayesian estimation and uncertainty quantification visualization of SINDy coefficient for real moving rod data under Laplace prior.(b) Latent dimension visualization after rescaling (blue curve) versus manually labelled moving rod angle by human (grey curve).(c) Visualization of predicted dynamics in latent space with uncertainty quantification.