Pushing coarse-grained models beyond the continuum limit using equation learning
Abstract
Mathematical modelling of biological population dynamics often involves proposing high-fidelity discrete agent-based models that capture stochasticity and individual-level processes. These models are often considered in conjunction with an approximate coarse-grained differential equation that captures population-level features only. These coarse-grained models are only accurate in certain asymptotic parameter regimes, such as enforcing that the time scale of individual motility far exceeds the time scale of birth/death processes. When these coarse-grained models are accurate, the discrete model still abides by conservation laws at the microscopic level, which implies that there is some macroscopic conservation law that can describe the macroscopic dynamics. In this work, we introduce an equation learning framework to find accurate coarse-grained models when standard continuum limit approaches are inaccurate. We demonstrate our approach using a discrete mechanical model of epithelial tissues, considering a series of four case studies that consider problems with and without free boundaries, and with and without proliferation, illustrating how we can learn macroscopic equations describing mechanical relaxation, cell proliferation, and the equation governing the dynamics of the free boundary of the tissue. While our presentation focuses on this biological application, our approach is more broadly applicable across a range of scenarios where discrete models are approximated by approximate continuum-limit descriptions.
1. Introduction
Mathematical models of population dynamics are often constructed by considering both discrete and continuous descriptions, allowing for both microscopic and macroscopic details to be considered [1]. This approach has been applied to several kinds of discrete models, including cellular Potts models [2–5], exclusion processes [6–9], mechanical models of epithelial tissues [10–17], hydrodynamics [18,19] and a variety of other types of individual-based models [1,20–27]. Continuum models are useful for describing collective behaviour, especially because the computational requirement of discrete models increases with the size of the population, and this can become computationally prohibitive for large populations, which is particularly problematic for parameter inference [28]. By contrast, the computational requirement to solve a continuous model is independent of the population size, and generally requires less computational overhead than working with a discrete approach only [15]. Continuum models are typically obtained by coarse-graining the discrete model, using Taylor series expansions to obtain continuous partial differential equation (PDE) models that govern the population densities on a continuum or macroscopic scale [10,11,29,30].
One challenge with using coarse-grained continuum limit models is that while the solution of these models can match averaged data from the corresponding discrete model for certain choices of parameters [10,17,31], the solution of the continuous model can be a very poor approximation for other parameter choices [13,15,32,33]. More generally, coarse-grained models are typically only valid in certain asymptotic parameter regimes [31,33,34]. For example, suppose we have a discrete space, discrete time, agent-based model that incorporates random motion and random proliferation. Random motion involves stepping a distance with probability per time step of duration . The stochastic proliferation process involves undergoing proliferation with probability per unit time step. The continuum limit description of this kind of discrete process can be written as [33]
Regardless of whether choices of parameters in a discrete model obey the asymptotic restrictions imposed by coarse-graining, the discrete model still obeys a conservation principle, which implies that there is some alternative macroscopic conservation description that will describe population-level features of interest [35,36]. Equation learning is a means of determining appropriate continuum models outside of the usual continuum limit asymptotic regimes. Equation learning has been used in several applications for model discovery. In the context of PDEs, a typical approach is to write , where is the population density, is some nonlinear function parametrized by , is a collection of differential operators and are parameters to be estimated [37]. This formulation was first introduced by Rudy et al. [37], who extended previous work in learning ordinary differential equations (ODEs) proposed by Brunton et al. [38]. Equation learning methods developed for the purpose of learning biological models has also been a key interest [39,40]. Lagergren et al. [41] introduce a biologically informed neural network framework that uses equation learning that is guided by biological constraints, imposing a specific conservation PDE rather than a general nonlinear function . Lagergren et al. [41] use this framework to discover a model describing data from simple in vitro experiments that describe the invasion of populations of motile and proliferative prostate cancer cells. VandenHeuvel et al. [42] extend the work of Lagergren et al. [41], incorporating uncertainty quantification into the equation learning procedure through a bootstrapping approach. Nardini et al. [32] use discrete data from agent-based models to learn associated continuum ODE models, combining a user-provided library of functions together with sparse regression methods to give simple ODE models describing population densities. Regression methods have also been used as an alternative to equation learning for this purpose [43].
These previous approaches to equation learning consider various methods to estimate the parameters , such as sparse regression or nonlinear optimization [32,37–39,41,42], representing as a library of functions [37–39], neural networks [41], or in the form of a conservation law with individual components to be learned [41,42]. In this work, we introduce a stepwise equation learning framework, inspired from stepwise regression [44], for estimating from averaged discrete data with a given representing a proposed form for the continuum model description. We incorporate or remove terms one at a time until a parsimonious continuum model is obtained whose solution matches the data well and no further improvements can be made to this match. Our approach is advantageous for several reasons. Firstly, it is computationally efficient and parallelizable, allowing for rapid exploration of results with different discrete parameters and different forms of for a given dataset. Secondly, the approach is modular, with different mechanistic features easily incorporated. This approach enables extensive computational experimentation by exploring the impact of including or excluding putative terms in the continuum model without any great increase in computational overhead. Lastly, it is easy to examine the results from our procedure, allowing for ease of diagnosing and correcting reasons for obtaining poor fitting models, and explaining what components of the continuum model are the most influential. We emphasize that a key difference between our approach and other work, such as the methods developed by Brunton et al. [38] and Rudy et al. [37], is that we constrain our problem so that we can only learn conservation laws rather than allow a general form through a library of functions, and that we iteratively eliminate variables from rather than use sparse regression. These important features are what support the modularity and interpretability of our approach.
To illustrate our procedure, we consider a discrete, individual-based one-dimensional toy model inspired from epithelial tissues [10,17]. Epithelial tissues are biological tissue composed of cells, organized in a monolayer, and are present in many parts of the body and interact with other cells [45], lining surfaces such as the skin and the intestine [46]. They are important in a variety of contexts, such as wound healing [47,48] and cancer [49,50]. Many models have been developed for studying their dynamics, considering both discrete and continuum modelling [10–16], with most models given in the form of a nonlinear reaction–diffusion equation with a moving boundary, using a nonlinear diffusivity term to incorporate mechanical relaxation and a source term to model cell proliferation [12,13,16]. These continuum limit models too are only accurate in certain parameter regimes, becoming inaccurate if the rate of mechanical relaxation is slow relative to the rate of proliferation [13,15,33]. To apply our stepwise equation learning procedure, we let the nonlinear function be given in the form of a conservation law together with equations describing the free boundary. We demonstrate this approach using a series of four biologically motivated case studies, considering problems with and without a free boundary, and with and without proliferation, with each case study building on those before it. The first two case studies are used to demonstrate how our approach can learn known continuum limits, while the latter two case studies show how we can learn improved continuum limit models in parameter regimes where these known continuum limits are no longer accurate. We implement our approach in the Julia language [51], and all code is available on GitHub at https://github.com/DanielVandH/StepwiseEQL.jl.
2. Mathematical model
Following Murray et al. [10] and Baker et al. [16], we suppose that we have a set of nodes describing cell boundaries at a time . The interval represents the th cell for , where we fix and . The number of nodes, , may increase over time due to cell proliferation. We model the mechanical interaction between cells by treating them as springs, as indicated in figure 1, so that each node experiences forces from nodes , respectively, except at the boundaries where there is only one neighbouring force. We further assume that each of these springs has the same mechanical properties, and that the viscous force from the surrounding medium is given by with drag coefficient . Lastly, assuming we are in a viscous medium so that the motion is overdamped, we can model the dynamics of each individual node , fixing , by [16]
The dynamics governed by (2.1)–(2.2) describe a system in which cells mechanically relax. Following previous work [10,14,16], we introduce a stochastic mechanism that allows the cells to undergo proliferation, assuming only one cell can divide at a time over a given interval for some small duration . We let the probability that the th cell proliferates be given by , where for some length-dependent proliferation law . As represented in figure 1c, when the th cell proliferates, the cell divides into two equally sized daughter cells, and the boundary between the new daughter cells is placed at the midpoint of the original cell. Throughout this work, we use a logistic proliferation law with , where is the intrinsic proliferation rate and is the carrying capacity density; we consider other proliferation laws in electronic supplementary material, §S5. The implementation of the solution to these equations (2.1)–(2.2) and the proliferation mechanism is given in the Julia package
In practice, macroscopic models of populations of cells are described in terms of cell densities rather than keeping track of the position of individual cell boundaries. The density of the th cell is . For an interior node , we obtain a density by taking the inverse of the average density of the cells left and right of , giving
3. Continuum-discrete comparison
We now consider four biologically motivated case studies to illustrate the performance of the continuum limit description (2.6). These case studies are represented schematically in figure 1d–g. Case Studies 1 and 3, shown in figure 1d,f, are fixed boundary problems, where we see cells relax mechanically towards a steady state where each cell has equal length. Case Studies 2 and 4 are free boundary problems, where the right-most cell boundary moves in the positive -direction while all cells relax towards a steady state where the length of each cell is given by resting spring length . Case Studies 1 and 2 have so that there is no cell proliferation and the number of cells remains fixed during the simulations, whereas Case Studies 3 and 4 have so that the number of cells increases during the discrete simulations. To explore these problems, we first consider cases where the continuum limit model is accurate, using the data shown in figure 2, where we show space–time diagrams in the left column of figure 2 and a set of averaged density profiles for each problem in the right column of figure 2. Case Studies 1 and 3 initially place nodes in and nodes in , or equivalently with 28 cells in and cells in , spacing the nodes uniformly within each subinterval. Case Studies 2 and 4 initially place equally spaced nodes in .
The problems shown in figure 2 use parameter values such that the solution of the continuum limit (2.6) is a good match to the averaged discrete density profiles. In particular, all problems use , , and, for Case Studies 3 and 4, , and . The accuracy of the continuum limit is clearly evident in the density profiles in figure 2, where, in each case, the solution of the continuum limit model is visually indistinguishable from averaged data from the discrete model. With proliferation, however, the continuum limit can be accurate when is not too much larger than , and we use Case Studies 3 and 4 to explore this.
Figure 3 shows further continuum-discrete comparisons for Case Studies 3 and 4 where we have slowed the mechanical relaxation by taking . This choice of means that and are no longer on the same scale and thus the continuum limit is no longer well defined, as explained in the discussion of (1.1), meaning the continuum limit solutions are no longer accurate. In both cases, the solution of the continuum limit model lags behind the averaged data from the discrete model. In electronic supplementary material, §S1, we show the confidence regions for each curve in figure 3, where we find that the solutions have much greater variance compared with the corresponding curves in figure 2 where .
We are interested in developing an equation learning method for learning an improved continuum model for problems like those in figure 3, allowing us to extend beyond the parameter regime where the continuum limit (2.6) is accurate. We demonstrate this in Case Studies 1–4 in §4 where we develop such a method.
4. Learning accurate continuum limit models
In this section, we introduce our method for equation learning and demonstrate the method using the four case studies from figures 1–3. Since the equation learning procedure is modular, adding these components into an existing problem is straightforward. All Julia code to reproduce these results is available at https://github.com/DanielVandH/StepwiseEQL.jl. A summary of all the parameters used for each case study is given in table 1.
case study |
||||||
---|---|---|---|---|---|---|
parameter | 1 | 2 | 3a | 3b | 4a | 4b |
— | — | |||||
— | — | |||||
— | — | |||||
— | — | |||||
— | — | |||||
— | — | — | ||||
(a) Case Study 1: fixed boundaries
Case Study 1 involves mechanical relaxation only so that there is no cell proliferation and the boundaries are fixed, implying and in (2.6), respectively, and the only function to learn is .
Our equation learning approach starts by assuming that is a linear combination of basis coefficients and basis functions , meaning can be represented as
The aim is to estimate in (4.1). We use ideas similar to the basis function approach from VandenHeuvel et al. [42], using (4.1) to construct a matrix problem for . In particular, let us take the PDE (2.6), with and , and expand the spatial derivative term so that we can isolate the terms,
(i) | Let the vector denote the solution to subject to the constraint that each inactive coefficient is zero, meaning for for a given set of active coefficients . We compute by solving the reduced problem in which the inactive columns of are not included. The vector with at step is denoted . With this definition, we compute the sets 4.5 is the set of all coefficient vectors obtained by making each active coefficient at step inactive one at a time. , is similar to except we make each inactive coefficient at step active one at a time. We then define , so that is the set of all coefficient vectors obtained from activating coefficients one at a time, deactivating coefficients one at a time, or retaining the current vector . | ||||
(ii) | Choose one of the vectors in by defining a loss function , 4.6 where is the solution of the PDE (2.6) with and uses the coefficients in (4.1), is the linear interpolant of the PDE data at evaluated at , and is the number of non-zero terms in . This loss function balances the goodness of fit with model complexity. If, for some , within , which we check by evaluating at equally spaced points in , we set . With this loss function, we compute the next coefficient vector
4.7 If , so that all the coefficients are inactive, we instead take the vector that attains the second-smallest loss so that a model with no terms cannot be selected. | ||||
(iii) | If , then there are no more local improvements to be made and so the procedure stops. Otherwise, we recompute and from and continue iterating. |
Let us now apply the procedure to our data from figure 2, where we know that the continuum limit with is accurate. We use the basis functions for so that
step | loss | |||
---|---|---|---|---|
1 | −5.97 | 70.73 | −27.06 | |
2 | −1.46 | 47.11 | 0.00 | −4.33 |
3 | 0.00 | 43.52 | 0.00 | −5.18 |
step | loss | |||
---|---|---|---|---|
1 | −1.45 | 42.48 | 13.76 | |
2 | 0.00 | 37.79 | 19.69 | |
3 | 0.00 | 49.83 | 0.00 |
To improve our learned model, we introduce matrix pruning, inspired from the data thresholding approach in VandenHeuvel et al. [42], to improve the estimates for . Visual inspection of the space–time diagram in figure 2a shows that the most significant density changes occur at early time and near to locations where changes in the initial condition, and a significant portion of the space–time diagram involves regions where is almost constant. These regions where has minimal change are problematic as points which lead to a higher residual are overshadowed, affecting the least-squares problem and consequently degrading the estimates for significantly, and so it is useful to only include important points in the construction of . To resolve this issue, we choose to only include points if the associated densities fall between the and quantiles for the complete set of densities, which we refer to by density quantiles; more details on this pruning procedure are given in electronic supplementary material, §S4. This choice of density quantiles is made using trial and error, starting at and , respectively, and shrinking the quantile range until suitable results are obtained. When we apply this pruning and reconstruct , we obtain the improved results in table 3 and associated densities in figure 5b. Compared with table 2, we see that the coefficient estimates for all lead to improved losses, and our final model now has , which is much closer to the continuum limit, as we see in figure 5b) where the solution curves are now visually indistinguishable everywhere. Moreover, we show in figure 4b how is updated at each iteration, where we see that the learned nonlinear diffusivity functions are barely different from the expected continuum limit result. These results demonstrate the importance of only including the most important points in .
(b) Case Study 2: free boundaries
Case Study 2 extends Case Study 1 by allowing the right-most cell boundary to move so that . We do not consider proliferation, giving in (2.6).
The equation learning procedure for this case study is similar to Case Study 1, namely we expand as in (4.1) and constrain . In addition to learning , we need to learn and the evolution equation describing the free boundary. In (2.6), this evolution equation is given by a conservation statement, with . Here, we treat this moving boundary condition more generally by introducing a function so that
In addition to the new matrix system in (4.15), we augment the loss function (4.6) to incorporate information about the location of the moving boundary. Letting denote the leading edge from the solution of the PDE (2.6) with parameters , the loss function is
Let us now apply our stepwise equation learning procedure with (4.15) and (4.16). We consider the data from figure 2, where we know in advance that the continuum limit with , and is accurate. The expansions we use for , and are given by
step | loss | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ||
2 | 0.00 | 0.00 | 25.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
We proceed by restricting our data collection to , now saving the solution at equally spaced times between and . Keeping the same quantiles for the matrix pruning, the new results are shown in table 5 and figure 6. We see that the densities and leading edges are accurate for small time, but the learned mechanisms do not extrapolate as well for , for example in figure 6b does not match the discrete data. To address this issue, we can further limit the information that we include in our matrices, looking to only include boundary points where is neither too large nor too small. We implement this by excluding all points from the construction of in (4.13) such that is outside of the or quantiles of the vector , called the velocityquantiles.
step | loss | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.00 | 0.00 | 0.00 | 0.00 | . | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | −3.37 |
2 | 0.00 | 0.00 | 0.00 | 0.00 | −0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | −2.37 | |
3 | 0.00 | 0.00 | 0.00 | −0.03 | 0.00 | 0.00 | 0.00 | 8.74 | 0.00 | 0.00 | −3.68 | |
4 | 0.00 | 47.38 | 0.00 | −0.03 | 0.00 | 0.00 | 0.00 | 8.74 | 0.00 | 0.00 | −4.02 | |
5 | 0.00 | 47.38 | 0.00 | 8.41 | −1.69 | 0.00 | 0.00 | 0.00 | 8.74 | 0.00 | 0.00 | −8.14 |
Implementing thresholding on leads to the results presented in figure 7. We see that the learned densities and leading edges are both visually indistinguishable from the discrete data. Since and are only ever evaluated at , and for , we see that and only match the continuum limit at , which means that our learned continuum limit model conserves mass and is consistent with the traditional coarse-grained continuum limit, as expected. We discuss in electronic supplementary material, §S5, how we can enforce to guarantee conservation mass from the outset; however, our approach in figure 7 is more general in the sense that our learned continuum limit is obtained without making any a priori assumptions about the form of .
(c) Case Study 3: fixed boundaries with proliferation
Case Study 3 is identical to Case Study 1 except that we incorporate cell proliferation, implying in (2.6). This case is more complicated than with mechanical relaxation only, as we have to consider how we combine the repeated realizations to capture the average density data as well. For this work, we average over each realization at each time using linear interpolants as described in electronic supplementary material, §S4. This averaging procedure gives points between and at each time , , with corresponding density value . The quantities and play the same role as and in the previous case studies.
To apply equation learning, we note there is no moving boundary, giving in (2.7). We proceed by expanding and as follows:
(i) Accurate continuum limit
Let us now apply these ideas to our data from figure 2, where we know that the continuum limit with and is accurate. The expansions we use for and are given by
Table 6 shows that we find and , which are both very close to the continuum limit. Figure 8 visualizes these results, showing that the PDE solutions with the learned and match the discrete densities, and the mechanisms that we do learn are visually indistinguishable with the continuum limit functions (2.7) as shown in figure 8b,c.
step | () | () | () | loss | |||||
---|---|---|---|---|---|---|---|---|---|
1 | −11.66 | 147.43 | −191.51 | 0.13 | −0.00 | −0.00 | 5.83 | −11.30 | |
2 | −2.24 | 60.86 | 0.00 | 0.13 | −0.00 | −5.72 | 2.62 | −3.49 | −0.71 |
3 | −2.25 | 60.90 | 0.00 | 0.14 | −0.01 | 0.00 | −1.25 | 5.95 | −1.92 |
4 | 0.00 | 52.95 | 0.00 | 0.14 | −0.01 | 0.00 | −1.36 | 6.49 | −3.35 |
5 | 0.00 | 53.02 | 0.00 | 0.15 | −0.01 | 0.00 | 0.00 | 0.32 | −4.98 |
6 | 0.00 | 52.97 | 0.00 | 0.15 | −0.01 | 0.00 | 0.00 | 0.00 | −5.70 |
(ii) Inaccurate continuum limit
We now extend the problem so that the continuum limit is no longer accurate, taking to be consistent with figure 3a. Using the same basis expansions in (4.22), we save the solution at equally spaced times between and , averaging over 1000 realizations with . We find that we need to use the 25% and 75% density quantiles rather than the 10% and 90% density quantiles, as in the previous example, to obtain results in this case. With this configuration, the results we find are shown in table 7 and figure 9.
step | () | () | loss | ||||||
---|---|---|---|---|---|---|---|---|---|
1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | −0.33 |
2 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 |
3 | 0.00 | 0.00 | 0.00 | 0.11 | −0.01 | 0.00 | 0.00 | 0.00 | 0.20 |
4 | 0.00 | 0.11 | 0.00 | 0.11 | −0.01 | 0.00 | 0.00 | 0.00 | −0.04 |
5 | 0.00 | 0.12 | 0.00 | 0.13 | −0.01 | 1.59 | 0.00 | 0.00 | −0.46 |
6 | 0.00 | 0.12 | 0.00 | 0.16 | −0.02 | 7.49 | −1.69 | 0.00 | −1.13 |
Results in table 7 show , which is reasonably close to the continuum limit with . The reaction vector, for which the continuum limit is so that is a quadratic, is now given by , meaning the learned is a quartic. Figure 9 compares the averaged discrete densities with the solution of the learned continuum limit model. Figure 9c compares the learned source term with the continuum limit. While both terms are visually indistinguishable at small densities, we see that the two source terms differ at high densities, with the learned carrying capacity density, where , reduced relative to the continuum limit. This is consistent with previous results [15].
(d) Case Study 4: free boundaries with proliferation
Case Study 4 is identical to Case Study 2 except that we now introduce proliferation into the discrete model so that in (2.6). First, as in Case Study 3 and as described in electronic supplementary material, §S4, we average our data across each realization from our discrete model. This averaging provides us with points between and at each time , , where is the average leading edge at , with corresponding density values , where and is the number of knots to use for averaging. We expand the functions , , and as
With this problem, it is difficult to learn all mechanisms simultaneously, especially as mechanical relaxation and proliferation occur on different time scales since mechanical relaxation dominates in the early part of the simulation, whereas both proliferation and mechanical relaxation play a role at later times. This means and cannot be estimated over the entire time range as was done in Case Study 3. To address this, we take a sequential learning procedure to learn these four mechanisms using four distinct time intervals , , and :
(i) | Fix and learn over , solving . | ||||
(ii) | Fix and and learn over , solving . | ||||
(iii) | Fix , , and and learn over , solving . | ||||
(iv) | Fix , , and and learn over , solving . |
(e) Accurate continuum limit
We apply this procedure to data from figure 2, where the continuum limit is accurate with , , and . The expansions we use are
To now learn the mechanisms, we apply the sequential procedure described for learning them one at a time. For each problem, we apply pruning so that points outside of the and density quantiles or the and velocity quantiles are not included. We find that , , and . The results with all these learned mechanisms are shown in figure 10. We see from the comparisons in figure 10a,b that the PDE results from the learned mechanisms are nearly indistinguishable from the discrete densities. Similar to Case Study 2, only matches the continuum limit at . Note also that the solutions in figure 10a go up to , despite the stepwise procedure considering only times up to .
(f) Inaccurate continuum limit
We now consider data from figure 3b where the continuum limit is inaccurate. Here, and the continuum limit vectors are , , and . Using the same procedures and expansions as figure 10, we average the data over realizations. The time intervals we use are , , and , using time points for and time points for , and . We use knots for averaging the solution over , and knots for averaging the solution over , and .
To apply the equation learning procedure, we prune all matrices so that points outside of the and temporal quantiles are eliminated, where the temporal quantiles are the quantiles of from the averaged discrete data, and similarly for points outside of the and velocity quantiles. We find , , and . Interestingly, here we learn is quadratic with coefficients that differ from the continuum limit. The results with all these learned mechanisms are shown in figure 11. We see from the comparisons in figure 11 that the PDE results from the learned mechanisms are visually indistinguishable from the discrete densities. Moreover, as in figure 10, the learned and match the continuum results at , which confirms that the learned continuum limit conserves mass, as expected. Note also that the solutions in figure 11a go up to , despite the stepwise procedure considering only times up to , demonstrating the extrapolation power of our method.
5. Conclusion and discussion
In this work, we presented a stepwise equation learning framework for learning continuum descriptions of discrete models describing population biology phenomena. Our approach provides accurate continuum approximations when standard coarse-grained approximations are inaccurate. The framework is simple to implement, efficient, easily parallelizable and modular, allowing for additional components to be added into a model with minimal changes required to accommodate them into an existing procedure. In contrast to other approaches, like neural networks [41] or linear regression approaches [43], results from our procedure are interpretable in terms of the underlying discrete process. The coefficients incorporated or removed at each stage of our procedure give a sense of the influence each model term contributes to the model, giving a greater interpretation of the results, highlighting an advantage of the stepwise approach over traditional sparse regression methods [32,37,38]. The learned continuum descriptions from our procedure enable the discovery of new mechanisms and equations describing the data from the discrete model. For example, the discovered form of can be interpreted relative to the discrete model, describing the interaction forces between neighbouring cells. In addition, we found in Case Study 4 that, when so that the continuum limit is inaccurate, the positive root of the quadratic form of the source term is greater than the mean field carrying capacity density , as seen in figure 11. This increase suggests that, when the rate of mechanical relaxation is small relative to the proliferation rates, the mean field carrying capacity density in the continuum description can be different from that in the discrete model.
We demonstrated our approach using a series of four biologically motivated case studies that incrementally build on each other, studying a discrete individual-based mechanical free boundary model of epithelial cells [10,11,15,16]. In the first two case studies, we demonstrated that we can easily rediscover the continuum limit models derived by Baker et al. [16], including the equations describing the evolution of the free boundary. The last two case studies demonstrate that, when the coarse-grained models are inaccurate, our approach can learn an accurate continuum approximation. The last case study was the most complicated, with four mechanisms needing to be learned, but the modularity of our approach made it simple to apply a sequential procedure to learning the mechanisms, applying the procedure to each mechanism in sequence. Our procedure was able to recover terms that conserved mass, despite not enforcing conservation of mass explicitly. The procedure as we have described does have some limitations, such as assuming that the mechanisms are linear combinations of basis functions, which could be handled more generally by instead using nonlinear least squares [42]. The procedure may also be sensitive to the quality of the data points included in the matrices, and thus to the parameters used for the procedure. In appendix A, we discuss a parameter sensitivity study that investigates this in greater detail. In this parameter sensitivity study, we find that the most important parameters to choose are the pruning parameters. These parameters can be easily tuned thanks to the efficiency of our method, modifying each parameter in sequence and using trial and error to determine suitable parameter values.
There are many avenues for future work based on our approach. Firstly, two-dimensional extensions of our discrete model could be considered [55,56], which would follow the same approach except the continuum problems would have to be solved using a more detailed numerical approximation [57–59]. Another avenue for exploration would be to consider applying the discrete model on a curved interface, which is more realistic than considering an epithelial sheet on a flat substrate [60,61]. Working with heterogeneous populations of cells, where parameters in the discrete model can vary between individuals in the population, is also another interesting option for future exploration [14]. Uncertainty quantification could also be considered using bootstrapping [42] or Bayesian inference [62]. Allowing for uncertainty quantification would also allow for noisy datasets to be modelled, unlike the idealized, noise-free data used in this work. We emphasize that, regardless of the approach taken for future work, we believe that our flexible stepwise learning framework can form the basis of these potential future studies.
Data accessibility
Julia implementations of all computations are available on GitHub at https://github.com/DanielVandH/StepwiseEQL.jl.
Supplementary material is available online [63].
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors' contributions
D.J.V.: conceptualization, data curation, formal analysis, investigation, methodology, resources, software, writing—original draft; P.R.B.: conceptualization, formal analysis, funding acquisition, project administration, supervision, writing—review and editing; M.J.S.: conceptualization, formal analysis, funding acquisition, methodology, project administration, resources, supervision, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
M.J.S. and P.R.B. are supported by the Australian Research Council DP230100025.
Appendix A. Parameter sensitivity study
In this appendix, we provide a brief parameter sensitivity study, exploring the impact of parameters such as the pruning parameters and the number of time points on the results of our stepwise learning framework. We use Case Study 3 for this purpose, taking the case so that the continuum limit is inaccurate. The parameters we consider are , the duration between time points; , the number of identically prepared realizations; , the final time, noting that ; , the number of knots used for averaging; and , the pruning parameter for the density quantiles. We only vary each parameter one at a time, so that the default values for each parameter are , , , and while a given parameter is beingvaried.
To assess the results for each set of parameters, we use the loss of the learned model, . To further examine the results, we divide the results into two categories: those that learn , and those that learn . The results of the study are shown in figure 12. We see that there is little dependence of the results on , or equivalently on the number of time points. Figure 12b shows that needs to be sufficiently large, around , in order for any diffusion terms to be selected, although the loss does not change significantly once terms are identified. The final time is important, where only final times in give reasonable results. The number of knots is not too important according to figure 12d, so long as there are not too many or too few. The most impactful parameter is , where we need to obtain an adequate learned model; for other case studies which involve other pruning parameters, such as on the derivatives or on the leading edge, we also find that these parameters are the mostinfluential.
Overall, figure 12 shows that and are the most important parameters for this problem. This is consistent with what we have found for the other case studies, where the choice of pruning parameters is crucial and the time horizon needs to be carefully chosen so that can be identified. Choosing these parameters can be quite difficult, and trial and error is needed to identify appropriate terms, as well as understanding why a certain model is failing to give good results. For example, in Case Study 2, we determined that we had to shrink the time interval used for learning the results, and that we needed to use velocity quantiles, by determining what mechanisms are failing to be learned and seeing where the model fails to extrapolate. The values that we used for these parameters, though, had to be chosen with trial and error. Our procedure is efficient enough for this trial-and-error procedure to be performed quickly, but future work could examine these issues in more detail to simplify the selection of theseparameters.