Parameter identifiability and model selection for partial differential equation models of cell invasion

When employing mechanistic models to study biological phenomena, practical parameter identifiability is important for making accurate predictions across wide ranges of unseen scenarios, as well as for understanding the underlying mechanisms. In this work, we use a profile-likelihood approach to investigate parameter identifiability for four extensions of the Fisher–Kolmogorov–Petrovsky–Piskunov (Fisher–KPP) model, given experimental data from a cell invasion assay. We show that more complicated models tend to be less identifiable, with parameter estimates being more sensitive to subtle differences in experimental procedures, and that they require more data to be practically identifiable. As a result, we suggest that parameter identifiability should be considered alongside goodness-of-fit and model complexity as criteria for model selection.


Introduction
Partial differential equation (PDE) models have been widely employed across many areas of biology as a means to understand mechanisms driving observed behaviours, as well as for making predictions of future behaviours.The complexity of a typical biological system means that often it is not clear which mechanisms should be incorporated into a PDE model, to what detail should they be described, and what functional form should they take.As a result, there can be multiple models proposed to describe the same system.For example, a variety of growth models has been proposed and adapted to describe the growth of tumors, coral reefs, and microbial growth [11,25,36], and their advantages and disadvantages have been extensively studied.To be able to both make accurate predictions for unseen scenarios and elucidate underlying mechanisms it is crucial to be able to conduct model selection.
Criteria for model selection have mostly focused on balancing model complexity with the ability of a model to accurately reproduce experimental observations.Typical examples are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) [1,6].In this work, we will show that attention should also be paid to the issue of parameter identifiability in the process of model selection, which is ignored by information criteria.Parameter identifiability refers to the extent to which the parameters of a model can be accurately estimated from available data.Non-identifiable models can give rise to misleading conclusions about the nature of the underlying mechanisms, as well as inaccurate predictions.As such, it is important to ensure that model selection considers identifiability.
In this paper, we will explore the connections between model selection and parameter identifiability through the use of in vitro experimental models of collective cell invasion, and four related candidate PDE models to describe them.We will use profile likelihoods for identifiability analysis, and investigate the impact of model complexity, data resolution, and experimental design, on parameter identifiability.

Parameter identifiability
The issue of parameter identifiability revolves around the question of whether it is possible to use available data to accurately estimate model parameters that describe the underlying mechanisms driving the biological system described by the model, accounting for observational errors.This inquiry revolves around two primary notions of identifiability found in the literature.The first notion is structural identifiability, which refers to the ability to uniquely determine the values of the model parameters given an infinite amount of data [46].In the context of PDE models, this is equivalent to ensuring that distinct sets of parameter values do not yield identical solutions.While several methods exist for determining structural identifiability for ordinary differential equation (ODE) models [7,26], methods for PDE models are restricted to specific classes of models, such as age-structured PDEs [30] or systems of linear reaction-advection-diffusion equations [5].
This paper, however, focuses on the more useful notion of practical identifiability.A model is considered practically identifiable if the parameters can be confidently identified using available data [46].Practical identifiability is a stronger condition than structural identifiability, and its focus on the available data makes it more relevant for real-world applications.The main distinction between the two notions is that structural identifiability can be formulated as a property of the PDE model itself, while practical identifiability is a property of the combination of the PDE model, the error model, and data [26].In this paper, we will define a parameter to be practically identifiable if the 95% confidence region obtained using profile likelihoods for this parameter is finite, and the profile likelihood function is unimodal.Moreover, for parameters that satisfies the above definition, we call a parameter more (or less) identifiable if its confidence interval is narrower (or wider) than another parameter.
Parameter identifiability directly affects the reliability of model predictions and the ability of a mechanistic model to precisely pin down biological mechanisms, since non-identifiability may lead the modeller to place confidence in a set of parameter values that, while able to produce model solutions close to experimental data under the conditions in which the experiments were conducted, may lead to model predictions that diverge from the behaviour of the biological system under a different set of conditions.
As practical parameter identifiability is dependent on both the model and data, our investigation will focus on both of these aspects.From a data-oriented perspective, we will investigate how the quality and quantity of data impacts parameter identifiability.Specifically, we will explore the extent to which experimental design, data analysis and processing impact parameter identifiability.The results from these investigations will provide guidance for the design of experiments, as well as for data collection and processing procedures, for the purpose of improving identifiability.We will also examine the relationship between data resolution and parameter identifiability.

Model selection
When we build a model for a given biological system, it is often difficult to determine the appropriate level of complexity, more specifically, which mechanisms to include, and to what level of detail, and which mechanisms to simplify or neglect.When the purpose of the model is to investigate a particular mechanism, it is sensible to choose a model that revolves around that mechanism [4], which provides a lower bound on complexity.Guidelines from [11] argue that, while phenomenological models are acceptable for making predictions, a model built for the purpose of understanding the biological system should focus on mechanisms that can be concretely derived from the biology.This provides an upper bound on complexity, but in between there remains room for choice.
Increasing the level of complexity of a model by either including additional mechanisms, or using more parameters to fine-tune the description of the existing mechanisms, will potentially result in a model that is capable of fitting the given data more accurately.The downside of a more complicated model, besides over-fitting and making analysis more difficult, is that the model parameters often become less identifiable, since a change in one parameter can be more easily compensated for by changes in other parameters.This trade-off between complexity and identifiability leads us to consider the problem of model selection, where we aim to choose the most appropriate model from a collection of models of varying complexity.
Traditional tools used for addressing the dilemma between goodness-of-fit and model complexity include the AIC [19,Ch. 3.4] and BIC [19,Ch. 9].These information criteria assign a score to models that reward goodness-of-fit, while penalising model complexity.In this paper, we instead focus on using parameter identifiability as a tool for model selection, exploring the relation between complexity and identifiability in further detail.We show that the model selected using information criteria, such as AIC or BIC, might contain parameters that are harder to identify than parameters in competing models that score less well, despite the wide-spread usage of these information criteria for model selection in mathematical biology, such as in cell invasion [44], HIV infection [3], and ecology [1].We propose that identifiability analysis should be performed prior to model selection, and that models that are identifiable given the data available should be favoured.

Identifiability in the context of cell invasion
Collective cell invasion plays a central role in development, wound healing, and cancer, and there has been considerable interest in understanding and quantifying the mechanisms that drive, promote or hinder this phenomenon.In this work, we will use in vitro collective cell invasion as a prototypical biological system to explore the ways in which identifiability issues can arise, and how the profile likelihood method can be used to analyse them.The data we will use come from a suite of barrier assay experiments, and we will use the Fisher equation and its generalisations as candidate models.
There is a long line of research on model fitting and parameter identification for cell invasion that leads up to this work.In an early work, Sherratt and Murray (1990) [34] used the Fisher model, and its generalisation, the Porous Fisher model, to describe wound healing from a qualitative perspective.These models were fitted to data from cell proliferation assays by Sengers et al. (2007) [33].Later, Treloar and Simpson (2013) [40] and Simpson et al (2013) [38] showed that the Fisher model can be fitted to cell invasion data, using only the information on the location of the edge of the cell population.In contrast, Jin et al. (2016) [17] fitted the Fisher and Porous Fisher models to cell invasion data consisting of the cell density profile.The paper employed the maximum likelihood method for parameter estimation, and goodness-of-fit as the main criterion for model selection, but it did not address parameter identifiability directly.In a later work, Vittadello et al. ( 2018) [43] used information on cell cycle dynamics, in addition to cell density data, to estimate model parameters.The same data were used by Simpson et al. (2020) [37] to explore parameter identifiability by quantifying the uncertainty in parameter estimates using both profile likelihoods and Bayesian inference.
The method we use for parameter inference, profile likelihoods, has been used in a variety of studies in identifiability analysis.It is built upon the maximum likelihood estimator (MLE), and has often been compared to Markov Chain Monte Carlo (MCMC)-based methods for Bayesian inference.MCMC methods provide more information on parameter values, but are more expensive to compute.Raue et al. (2009) [28] discussed the application of profile likelihoods to detect nonidentifiable parameters, illustrated with an example in the context of ODE models.This work continued in Raue et al. (2013) [27], which suggests a method that combines MCMC methods and profile likelihoods to inform data collection and iteratively refine parameter estimates.A comparison between the methods of MCMC and profile likelihoods was carried out in [42] for ODE models of varying complexity.The application of profile likelihoods in model selection was discussed in Simpson et al. (2022) [36], which used three different ODE models to describe coral growth, and discussed the importance of parameter identifiability in model selection.Model selection for cell invasion models has been discussed by Warne et al. (2019) [44], who used Bayesian methods for model fitting, and information criteria for model selection.The authors emphasized the importance of model complexity as a factor in model selection, and identified BIC as the best overall criterion.However, the authors realised that information criteria cannot account for all aspects of model comparison.

Outline
The contribution of this paper is the investigation of parameter identifiability for multiple PDE models using profile likelihoods, and a discussion as to the implications of practical identifiability in model selection.We also investigate the effects of data resolution and experimental design on parameter identifiability.Lastly, we demonstrate a link between parameter identifiability and model robustness in terms of consistency of parameter estimates, which motivates a mixed effects view of the system.Under this view, each experimental replicate can be seen as taking a sample of the parameter values from a certain distribution, which tends to be more dispersed for less identifiable models.The rest of the paper proceeds as follows.We describe the experimental procedures for the barrier assay, as well as data processing procedures, in Section 2. We introduce the suite of mathematical models, as well as the profile likelihood method for parameter inference and identifiability analysis, in Section 3. The results are presented in Section 4, and we discuss their significance with respect to model selection and experimental design in Section 5.

Experimental methods and data
Barrier assays, also known as tissue expansion assays, are a way to observe cell invasion in vitro.Briefly speaking, a barrier assay involves preparing a plate with a barrier that encloses a central region which cells cannot move in or out of.Cells are planted within the central region, and then given sufficient time to proliferate to form a collective monolayer within the central region.The barrier is then removed, allowing the cells to propagate outward, and invade the rest of the plate, which is initially absent of cells.For this work, eight experiments were carried out in total, with the first four (Experiment 1, 2, 3, 4) using a circular barrier, and the later four (Experiment 5, 6, 7, 8) using a triangular barrier.The experiments were otherwise identical.Selected images from the experiments are shown in Fig. 1, along with corresponding schematic representations of the experiment, and snapshots of model solutions at the corresponding time point.The detailed experimental protocols are provided in Sections 2.1 and 2.2.

Cell culture
Madin-Darby Canine Kidney (MDCK-II) cells are cultured in low-glucose Dulbecco's modified Eagle medium (DMEM) with 10% FBS and 1% Penicillin-Streptomycin. Cells were split every 1 to 2 days, depending on the conditions of the cells.MDCK cells were treated with TrypLE for 8 minutes to be detached from the TC plastic dish.Detached cells were 1:2 diluted with the culture media and centrifuged at 300 g at RT for 3 minutes.Cell pellets were then re-suspended with culture media and seeded in the new dish.

Barrier assay
The tissue-culture treated plastic dish is coated with 50 µg/mL Collagen 4 in PBS at 37 • C for 30 min.A 250 µm thick PDMS (polydimethylsiloxane) membrane cut with a Silhouette Cameo vinyl cutter was deposited on the TC (tissue-culture treated) plastic dish, which confines the cells to the initial region.A 2.25 × 10 6 cells/mL MDCK cell suspension was seeded in each stencil.The seeding volume was 0.44 µL per mm 2 of tissue area.The TC plastic dish was incubated at 37 • C for 30 minutes to induce cell attachment to the collagen surface.Subsequently, the dish was flooded with the culture medium, and cells were incubated for 19 hours to form a confluent monolayer.Stencils were removed from the dish with tweezers at the start of the barrier assay.The expansion of tissue monolayer was imaged from this point onward every 20 minutes for 24 hours with 4× phase objective.

Image analysis
From the 4× phase images, nuclei locations were reconstructed using a tool based on convolutional neural networks [20].The local cell density was calculated by counting the number of nuclei centroids in each density measuring box.The box size is 58.4 µm×58.4µm, with 50% overlap between neighboring boxes.
3 Mathematical methods

PDE models of cell invasion
To describe the cell invasion process observed in the barrier assay experiments, we will use the Fisher-KPP equation, and its extensions, to model the evolution of cell density over time.The original Fisher model was proposed in [9], and studied in a more general context in [39].More recently in cell biology, this model and its generalisations have been used to model a diverse range of phenomena including wound healing [13,21] and tumour growth [10,16,35].
We summarise the Fisher model and its generalisations as In the context of our application, C(x, y, t) ≥ 0 represents cell density, D(C) ≥ 0 is the diffusion coefficient, and the function f (C) represents net cell proliferation, with K > 0 representing the carrying capacity.The parameters D 0 > 0 and r > 0 scale the rates of diffusion and proliferation, respectively, and α, β, γ, η are non-negative "shape parameters" that allow fine-tuning of D(C) and f (C).We impose zero-flux boundary conditions at x = 0, L and y = 0, L. We consider four different models.The original Fisher model takes D(C) = D 0 , and f (C) = rC(1 − C/K), which corresponds to fixing α = β = γ = 1, and η = 0.The Porous Fisher model generalises the diffusion term by allowing η ≥ 0 to be free.This model has been used to describe cell invasion and wound healing [8,17,22,33,34].The intuitive motivation is that as cell density increases, a crowding effect will push the cells to move into less dense areas more quickly, leading to density-dependent diffusion.The theoretical properties of the model have been studied in many works, including [24,47].The Richards model, proposed in [31], and the Generalised Fisher models both generalise the proliferation term by allowing γ ≥ 0 and α, β ≥ 0 to assume values other than unity, respectively.The Richards and Generalised Fisher models, among other growth laws, were analysed in the context of ODE models in [41].We summarise the four models in Table 1, and the numerical methods used for simulating these models are given in Supplementary Material A.1.

Model name
Free parameters Fixed parameters Standard Fisher Table 1: Summary of models considered, as special cases of Eq. ( 1).

Parameter identifiability with profile likelihoods
For a given model, let P be the number of free parameters, θ denote the vector of free parameters in the model, and θ −i denote the parameter vector with θ i , the i th parameter, removed1 .For the Standard Fisher model, for example, P = 3 and θ = (D 0 , r, K), and θ −2 = θ −r = (D 0 , K).Let C data (x, y, t) be the cell density measurements, and let C model (x, y, t; θ) be the solution to Eq. ( 1) with parameter θ.For simplicity, we assume that the observed cell density is generated by the model with given model parameter set θ, and perturbed by independent and identically distributed (i.i.d.) observation noise at each data point that follows a Gaussian distribution with zero mean and unknown variance σ 2 .That is, Let L(C data |θ, σ) be the likelihood function.Let θ * , σ * be the maximum likelihood estimator (MLE).We define the normalised profile log-likelihood function l i (θ ′ i ) for parameter θ i to be which will be referred to simply as the profile likelihood function for brevity.We can similarly define a bi-variate profile likelihood, l i,j , as Following [23] and [36], we use the profile likelihood function to define an approximate 95% confidence interval for a single variable as and a joint confidence region for two variables as where χ 2 (•; m) denotes the inverse of the cumulative distribution function (cdf) of a χ 2 distribution with m degrees of freedom.The shape of the univariate profile likelihood curve indicates whether the parameter is practical identifiable.A profile likelihood curve that is unimodal, smooth, and decays quickly away from its peak (which occurs at the MLE) suggests that the parameter is identifiable.Non-identifiability can be reflected in a profile likelihood in many different ways, such as multi-modality, slow decay away from the MLE, or a flat top.

Results
We now present the results from the identifiability analysis of the four models using profile likelihoods.A preliminary exploration with synthetic datasets, presented in Supplementary Material Section B, shows that profile likelihoods can recover the ground truth parameter values in the absence of model mis-specification.In Section 4.1, we show that all models considered are practically identifiable.Then, in Section 4.2, we show that slight changes to data processing procedures have a major impact on the parameter estimates for more complicated models, but not for simple models.In Section 4.3, we compare the profile likelihoods across multiple experimental datasets to observe a relation between model complexity and consistency of parameter estimates.Finally, in Section 4.4, we investigate the relation between data resolution and practical identifiability.

All four models are practically identifiable
First, we show that all four models are practically identifiable.We present the profile likelihoods for Experiment 1 in Fig. 2, which is arbitrarily chosen as an illustrative representative for the purpose of this section.The results for the other experiments are qualitatively similar.We present the profile likelihoods for the other experiments, and the MLEs, the 95% confidence intervals, and AIC and BIC for each model for each experiment, in Supplementary Material Section C. Visual inspection indicates that the MLEs for all four models produce solutions that fit the data very well.As such, it is difficult to select between models by such inspection alone.Observe that in Fig. 2, all profile likelihood curves are unimodal, and roughly parabolic in shape, and the confidence intervals are narrow.These observations, and similar results from the other experiments (shown in Supplementary Material Section C), suggest that all models are practically identifiable given available data from any one of the eight experiments.Comparing the Standard Fisher model with the more complicated models we notice that, as model complexity increases, the profile likelihood curves broaden, and therefore the confidence intervals widen, making the parameters less identifiable.This suggests that increases in model complexity tend to lead to decreases in parameter identifiability.Intuitively, this is because more complicated models have more freedom to compensate for a mis-identification in one parameter by adjusting the other parameters.

Complicated models are more sensitive to data processing
We now turn to the question of whether the way we process the data impacts parameter identifiability.In the case where the barrier is circular, the cell population remains approximately radially symmetric throughout the experiment, as expected.In this case, we can average the data radially to obtain a reduced dataset Cdata (ρ, t).It is tempting to work with this reduced dataset instead of the full dataset, since it makes model simulation, and therefore parameter inference and identifiability analysis, computationally much cheaper.
In Fig. 3, we show the profile likelihoods computed using the radially averaged dataset from Experiment 1. Comparing Fig. 3 and Fig. 2, we notice that, in general, the profile likelihood curves are broader for the radially averaged dataset compared to the full dataset.This is expected, since the full dataset has many more data points.A more surprising observation is that, while the MLEs for the parameters in the Standard Fisher model remain consistent between the two datasets, the MLEs for the other three models are much less consistent.These observations show that the more complicated models are more sensitive to the way the data are processed or represented.
In this instance, a likely explanation is that computing the radial average smooths the cell density near the edge of the population.The more complicated models are more sensitive to such subtle differences in the data, which stems from their increased flexibility to fit the data.Another way to look at this is that averaging reduces the observational noise present in the data, therefore preventing the models from over-fitting the data.The more complex models may have a higher propensity to over-fit, which explains why these models display a greater difference in the profile likelihoods between the two datasets.
This sensitivity of parameter estimation to data representation means care must be taken during data processing.If we wish to use a complicated model, then we should use the full density profile because it suffers less information loss.On the other hand, if we are content with a simple model like the Standard Fisher model, then it would be preferable to use a dataset of reduced dimension, because it yields the same result as the full dataset, but makes the computations cheaper.

Consistency of parameter estimates across experimental replicates indicates practical identifiability
By examining and comparing the profile likelihood results across all experimental replicates, we found that the parameter estimates are less consistent for the more complicated models compared to the simpler Standard Fisher model, in the sense that the variances of the eight MLEs of analogous parameters are higher for the more complicated models.In Fig. 4, we present the confidence intervals for all parameters in the four models found across the eight experimental replicates, except for the Generalised Fisher model, where we only present the results from the first four experiments.For all parameters, the widths of the confidence intervals for individual experiments are much narrower compared to the range of estimates across all experiments.For example, the widths of the confidence intervals for D 0 in the Standard Fisher model are on the order of 10 µm 2 /h, whereas the MLEs for D 0 for Experiment 3 and 7 differ by around 700 µm 2 /h.The confidence intervals for the same parameter estimated from different experiments rarely overlap; these variations in parameter estimates reflect the high variability commonly observed in biological processes.This suggests that it would be appropriate to model the system using a mixed effects framework, where the parameter values corresponding to each experiment are sampled from a relatively broad distribution.
For the Standard Fisher model, the MLEs for the parameters show only a mild degree of variability between the experiments.The difference in initial conditions does not lead to significant differences in the estimates of D 0 and r3 , while the estimate for K tends to be slightly lower for the experiment with triangular initial conditions as opposed to circular initial conditions 4 .
While the variability of parameter estimates in the Standard Fisher model across experiments is modest, it is much larger in the other models: for example, the standard deviations for the MLEs for D 0 across the experiments are 2937.8for Porous Fisher, 812.9 for Richards, and 320.4 for the Generalised Fisher models, compared to 238.3 for Standard Fisher, although this comparison might not be entirely fair for the Porous Fisher model, since D 0 has different units in that model than the others, hence different interpretations.Similar results holds for r, while the variability for K is approximately the same across the four models.
For the Porous Fisher model, the experiments with circular initial conditions (Experiments 1-4) estimate a very small η, well below 0.2, suggesting that there is little evidence for nonlinear diffusion in the dynamics, and the estimates for the other parameters (D 0 , r, K) are relatively similar to their estimates in the Standard Fisher model.However, the experiments with triangular initial conditions (Experiments 5-8) estimate a much larger η, along with a higher D 0 5 , suggesting that nonlinear diffusion effects are important.Since the speed of the propagation of the cell colony (which can be identified as the speed of the travelling wave in the model solutions) does not vary much between the experiments, and an increase in η corresponds to a decrease in wave speed, the estimated values for D 0 and r are much higher for these experiments to compensate for a higher η.This compensation is reflected by the slanted ridge seen in the the bi-variate likelihood function with respect to D 0 and η, shown in Fig. 5.We see a similar story for the Richards model.Observe that in the third row of Fig. 4, there seems to be a clear dependence of the estimated parameter values on the experimental initial conditions.Experiments 1-4 (circular initial conditions) in general give lower values for D 0 , r, and γ, and higher values for K, compared to Experiments 5-8 (triangular initial conditions).Especially pursue this technical challenge further since it is beyond the scope of the paper.striking is the case of γ, where Experiments 1-4 give estimates in the range of 0.6-1.7,whereas Experiments 5-8 gave estimates either around eight to nine, or above nine (we use γ = 9 as an upper bound for calculations when searching for the MLE to ensure numerical stability).The Richards model is also the only model that has profile likelihoods which do not have a roughly parabolic shape.For Experiments 5 and 6, the profile likelihoods for γ are bi-modal (but the peaks are close together, and the valley between them lies above the -1.92 threshold, so the confidence region remains a connected interval).This bi-modality can also be seen in the profile likelihoods for the other parameters, but is much less prominent.For Experiments 7 and 8, the profile likelihood functions for γ appear to be monotonically increasing for large γ until the imposed upper bound at γ = 9, suggesting that the MLE is either very large or tends to infinity.A high γ (say, above seven) means that the proliferation function f (C) in Eq. ( 1) is nearly linear for 0 ≤ C < K except for C near K, and it abruptly drops to zero at C = K.An even higher γ makes little difference to the shape of f (C), and therefore has little effect on the model solution.For this reason the Richards model is known to be locally structurally non-identifiable in the high γ regime [37].This explains the divergence to infinity of the confidence intervals for γ for Experiments 7-8.
It is more difficult to explain why the experiments with triangular initial conditions suggest a very large γ, hence nearly linear proliferation, whereas the experiments with circular initial conditions suggest γ is relatively close to one, reflecting dynamics closer to logistic growth.There are prior studies making similar observations.In [17], the authors showed that parameter estimates for similar experiments and models depend strongly on the initial cell density, when the shape of the initial cell population is kept constant.In contrast, however, Jin et al. [18] found that the shape of the initial cell density profile does not seem to have a major effect on the estimates for the parameters in the Standard Fisher model, which agrees with our observations.The variability of the parameter estimates in the Porous Fisher and Richards models, and their dependence on initial conditions, suggests that these two models are mis-specified, since we would not expect such variability if either model is capable of reflecting the true dynamics of the biological process.This mis-specification is likely due to the fact that neither model takes the coupling between tissue mechanics and geometry into account.In an empirical study, Ravasio et al. [29] demonstrated that edge geometry has a significant influence on the rate of wound closure.This effect arises from the exertion of force on the cells by supra-cellular actomyosin cables around the wound edge which, in turn, is regulated by the underlying geometry.
For the Generalised Fisher model, the computations for the profile likelihoods did not finish within a reasonable time limit for Experiments 5-8, so we only present the parameter estimates for Experiments 1-4 in the last row in Fig. 4. For Experiments 5-8, the optimisation procedures struggled much more to find the correct global optima compared to Experiments 1-4, suggesting that the likelihood landscape is much more rough.

Model complexity is limited by data resolution
In order to investigate the relationship between parameter identifiability and data resolution, we repeat the profile likelihood calculation on the reduced dataset from Experiment 1, but progressively down-sample the data to reduce temporal resolution by utilising only an equally-spaced subset of the 77 time slice in our data.We observe three different ways in which the profile likelihood curves change qualitatively as data resolution decreases.We present the profile likelihood for one parameter to illustrate each case, and leave the rest of the parameters to Supplementary Material Section D. In the first case, the peak of the profile likelihood curve (i.e. the MLE) remains mostly in the same location, while the curve itself broadens, but nonetheless remains unimodal.This is illustrated by D 0 in the Standard Fisher model, shown in Fig. 6(a).In this case, despite increases in the uncertainty in parameter estimates, the parameter remains practically identifiable even at the lowest data resolution we considered.This is the case for all parameters in the Standard Fisher model, as well as K in all models, and D 0 , and r except in the Generalised Fisher model.
In the second case, as illustrated by γ, a shape parameter in the Richards model in Fig. 6(b), the location of the MLE dramatically changes, and the shape of the profile likelihood curve changes in such a way that the confidence interval becomes infinite, making γ non-identifiable in the case where data resolution is low.Interestingly, γ is the only parameter that exhibits this behaviour.
The final case is shown in Fig. 6(c), where the profile likelihood curve for β in the Generalised Fisher model becomes bimodal when data resolution is sufficiently low.D 0 , r and α in the Generalised Fisher model also exhibit this behaviour.
Theoretically, we expect the width of the confidence interval for a parameter θ, denoted ∆θ, to be proportional to 1/ √ N , where N is the number of data points.We compare the theoretical and calculated ∆θ's in Fig. 7. Observe that for D 0 in the Standard Fisher model, the calculated The blue curve corresponds to n t = 77, i.e. the profile likelihood calculated with all available data, with a temporal resolution of ∆t = 20 min.The pink curve corresponds to n t = 20 and ∆t = 80 min, and the red curve corresponds to n t = 3 and ∆t = 12 h.The black horizontal line at −1.92 is the threshold for the confidence interval.Note that in (b), the red curve remains above the threshold as γ → ∞, and we chose to truncate the plot at γ = 7 for ease of visualization.
∆D 0 remain close to the theoretical prediction as N → 0, while for the other two parameters, ∆γ and ∆β deviate significantly from the theoretical prediction as N → 0. These observations show that parameter identifiability is limited by data resolution.There is a model-dependent minimum amount of data required for the model to be practically identifiable, and this minimum increases as model complexity increases.Therefore, model selection should be tied to the amount of data available, and as a criterion for choosing a model, we should only use models for which we have sufficient data to make it identifiable.Here, using a proportion of 1/m of data means that only one in every m images were used, so n t = ⌈77/m⌉.The blue circles represent the widths of the confidence intervals calculated by finding the intersection between the profile likelihood curves and the threshold -1.92, while the red line is proportional to 1/ √ n t and normalized so that it goes through the point representing the case where all data were used.

Discussion and future work
In this paper we have carried out an in-depth study of practical parameter identifiability for four related PDE models of cell invasion using the profile likelihood method.We have shown that, given sufficient data, the univariate profile likelihoods of all model parameters are unimodal, with the corresponding confidence intervals relatively narrow (with the exception of the Richards model for datasets with triangular initial conditions), suggesting that the parameters are practically identifiable.Moreover, results obtained using synthetic data (Supplementary Material Section B) demonstrate that the profile likelihood method is capable of recovering the true parameter values in the absence of model mis-specification.
We explored the effects of several aspects of data quantity and quality on parameter identifiability.First, by comparing the profile likelihoods obtained using the full density profile to the ones obtained using the radially-averaged density profile, we found that the parameter estimates obtained for the Standard Fisher model are approximately the same for the two cases, but differ considerably for the more complicated models.This result suggests that simpler models are more robust against subtle changes in data representation.For such models, it is therefore safe to perform inference using the radially averaged dataset to reduce computational costs, whereas the full dataset should be used for parameter inference for the more complicated models which are more sensitive to changes introduced by data processing.
We have shown that for any parameter, the confidence intervals obtained for a single experiment are narrower relative to the spread of the estimated values for the same parameter across experiments.This result suggests that a nonlinear mixed effects framework may be appropriate.In this view, each experimental replicate corresponds to a parameter set sampled from a specified distribution.However, significantly more experiment replicates will be required to determine a suitable form for these random effects.
For the more complicated models, we observed a dependency of the parameter estimates on the experimental initial conditions.This is at odds with our assumption that the only mechanisms driving tissue expansion are cell motility and proliferation, and that the only factor modulating these mechanisms is cell density.A likely explanation for this dependency is that mechanical properties of the cell tissue may play a significant role in cell invasion.As such, an interesting avenue for future research entails including mechanical effects at the leading edge in the models.
Another model assumption worth questioning is the independence of observation noise at every point.Reviewing the data, we observe localised bursts of proliferation activity, seemingly at random.This suggests that stochastic models might be more appropriate.However, inference for stochastic models introduces computational challenges, making them unwieldy for many situations.A potentially useful approach is to retain the deterministic model, but assume that the observation noise at points close to each other is correlated, with the magnitude of the noise possibly timedependent.This approach entails prescribing a correlation kernel with additional parameters to be fitted, which increases complexity, and therefore increases computational cost and likely decreases identifiability, but with the benefit of being more realistic.This approach is discussed in [12] in the context of generalised linear models.Another approach that could be taken is to avoid making assumptions on the noise by replacing the likelihood function (which must be formulated with a noise model in mind) by a generalised profile likelihood function, which relies less on the noise model, but instead must be calibrated by bootstrapping.The details of this approach are discussed in [45].
Our findings reveal the pivotal impact of data resolution on parameter identifiability.As the amount of available data decreases, the confidence intervals for all parameters widen.However, certain parameters remain practically identifiable even at low data resolution, while others become non-identifiable if data resolution is sufficiently low.This observation indicates that each model has its own minimum data requirement for practical identifiability, which tends to be higher for more complicated models.
Finally, we return to the problem of model selection.For most datasets, AIC and BIC select the same model as each other, which is usually the Richards or the Generalised Fisher models (see Supplementary Material Section C for details).However, as we have seen, these models require more data to be identifiable, and they are also more sensitive to subtle changes in the data introduced by data processing (e.g.full density profile versus radial averaging).Therefore, we argue that model selection methods should also take the amount of available data into account.For the problem considered in this work, more complicated models can be considered if there are sufficient data to render them practically identifiable, whereas the Standard Fisher model should be favoured if the data resolution is low, or if we have only radially averaged data.
In the future, we will extend our investigation to optimise experimental design with the aim to generate the most useful data for identifying model parameters.It would also be worthwhile to extend the scope of the investigation to stochastic models, which can be more suitable for systems with significant process noise.Finally, for cell invasion, it would be interesting to see if the addition of cell cycle data, such as those obtained via the FUCCI reporter, can enhance parameter identifiability, or if the additional complexity introduced by including the cell cycle in the model hinders parameter identifiability instead.
Competing interests: We declare no competing interests.

A Numerical methods
This section provides details of the numerical methods for model simulation and calculation of the profile likelihoods.

A.1 Numerical solutions of the PDE models
We use a finite difference method to simulate the general form of the model, given in Eq. ( 1).For simulations in two spatial dimensions, the size of the domain, corresponding to the size of the image, is L x = L y = 4380 µm.This domain is discretized into n x = 150 by n y = 150 squares, each with side length ∆x = ∆y = 29.2µm.We used ∆t = 1/30 h.
Let C i,j,k denote C(x i , y j , t k ), where x i = (i − 1)∆x, y j = (j − 1)∆y, t k = (k − 1)∆t are the mesh points.The scheme we used follows = [44], and can be written as follows: where The discretization in the y direction is completely analogous.Zero flux boundary conditions are imposed at x = 0, L x and y = 0, L y .We use an implicit-explicit (IMEX) scheme [2,32] for time-stepping, where the nonlinear diffusion coefficient, D(C), and the proliferation term, f (C), are treated with the explicit Euler method, and the diffusion term overall is treated with the implicit Crank-Nicolson method, which has second order convergence.The advantage of this scheme is that the explicit treatment of the nonlinear components of the equation allows us to avoid having to solve a nonlinear root-finding problem at every time step, which would be necessitated by a fully implicit scheme.The implicit treatment of the diffusion term improves the stability of the scheme, and [2] showed that this class of schemes has reasonably low relative errors when the diffusion term is not vanishingly small, which is the case in this work.The IMEX Crank-Nicolson time stepping scheme can be written as We have verified that the scheme is convergent by successively halving ∆x or ∆t and recomputing the model solutions with the default parameter values in Eq. (SM.2), and check that the norm of the difference between successive model solutions decreases almost linearly on a log-log plot with respect to ∆x or ∆t.
To justify that the discretisation we have chosen is sufficiently fine, let C 1 model denote the model solution computed with ∆x = 29.2µm and ∆t = 1/30 h, C 2 model , C 3 model be the model solution computed with ∆x or ∆t halved, respectively.Then the difference between C 1 model and C 2 model , averaged over all grid points, is 0.448, while that between C 1 model and C 3 model is 4.224, both much smaller than the averaged magnitude of the model solutions, which is on the order of 10 3 , therefore we conclude that the numerical scheme is suitably accurate.

A.2 Optmisation procedure for MLE and profile likelihoods
To solve the optimisation problems for finding the MLEs and evaluating the profile likelihood functions, we use three algorithms, all implemented in MATLAB: the built-in fmincon and globalsearch, and Covariance Matrix Adaptation Evolution Strategy (CM-AES) [14], with the implementation obtained at [15].
The optimisation procedure is initialized with the following default parameter values: We impose the following bounds for the parameters to guide the optimisation procedures: 500 cell/mm 2 < K < 5000 cell/mm 2 , 0 < α, β, η < 3, 0 < γ < 9. (SM.3) We use globalsearch to find the MLEs, and fmincon to evaluate points on the profile likelihood functions.In the case where fmincon struggles to find the true maximum, we use CM-AES instead.

B Profile likelihoods for synthetic datasets
In this section, we present the profile likelihoods for each model for two sets of synthetic data.The main purpose of this exercise is to verify that the profile likelihoods behave as expected under ideal conditions.The synthetic data are generated by simulating the model, Eq. ( 1) of the main text, in one spatial dimension, using the parameter values in Eq. (SM.2), and perturbing by adding Gaussian noise to the model solution.The "low noise" dataset uses σ = 20, while the "high noise" dataset uses σ = 400.In comparison, the σ * estimated from real data ranges between 380 -460, depending on the dataset and the model.
The profile likelihoods for the high noise dataset are presented in Fig. 8, which shows that all profile likelihood curves are unimodal with a finite confidence interval, and the MLEs are close to the true parameter values.For the low noise dataset, the profile likelihood curves are very narrow, and centered almost exactly at the true parameter values.These results verify that the profile likelihoods can recover the true parameter values, at least in a highly idealized case, as the theories suggest.
The profile likelihood curves for the parameters of the Richards and Generalised Fisher models tend to be broader compared to those of the Standard Fisher model, which reflect the greater flexibility of the more complicated models to compensate for a change in one parameter value by shifting the other parameter values.1) and Table 1 of the main text, for a synthetic dataset generated with Eq. ( 1) and parameter values in Eq. (SM.2), perturbed as in Eq. ( 2) of the main text with σ = 400.The dotted vertical lines mark the location of the true parameter values, while the dashed vertical lines mark the MLE for each parameter.The black horizontal line at −1.92 marks the threshold for the 95% confidence interval.The axis scale for the parameters shared between the models (D 0 , r, K) is kept consistent.

C Inference results for all datasets
In this section, we present the MLE and 95% confidence intervals calculated for all experimental datasets in table format.Recall that we have cell density data from eight experiments, which we refer to as the full datasets.Experiments 1-4 have circular initial conditions, while Experiments 5-8 have triangular initial conditions.For Experiments 1-4 we also consider the radially-averaged datasets.All results are given to four significant figures.We also present the profile likelihoods for Experiments 2-8 (those for Experiment 1 are presented in Fig. 2 and Fig. 3 of the main text).

D Profile likelihoods for down-sampled data
In Fig. 19 we present the profile likelihoods for the down-sampled datasets.

Figure 1 :
Figure 1: Illustration of various stages of the barrier assay.In the first row are simplified representations of the cell culture, where the grey areas represent the regions covered by cells, and the red lines represent the barriers.The second row contains images taken from the experiments.The field of view of the camera is limited to a square region of size 4380 µm by 4380 µm in the centre of the plate, while the plate itself extends further beyond.Since the cells tend to stay well within the field of view, except toward the very end of the experiment, edge effects are not important.The plate is sufficiently large such that the cells never reach the edge in any experiments.The third row represents snapshots from simulations of the Standard Fisher model (see Table 1) with the free parameters set to the maximum likelihood estimator (MLE) computed from the corresponding experiments.The colours represents cell density, with black indicating a complete absence of cells, while the cell density increases as the colour progresses from red to light yellow.(a) Initial condition of Experiment 1, where the initial confinement region is circular.The barrier is removed at t = 0 h.(b) The state of the experiment at t = 20 h.The cells have spread out to cover most of the visible domain.(c-d) Shows the initial conditions at t = 0 h and a later stage t = 20 h of Experiment 5.The experiment is similar to Experiment 1, except with a triangular barrier.
Figure 1: Illustration of various stages of the barrier assay.In the first row are simplified representations of the cell culture, where the grey areas represent the regions covered by cells, and the red lines represent the barriers.The second row contains images taken from the experiments.The field of view of the camera is limited to a square region of size 4380 µm by 4380 µm in the centre of the plate, while the plate itself extends further beyond.Since the cells tend to stay well within the field of view, except toward the very end of the experiment, edge effects are not important.The plate is sufficiently large such that the cells never reach the edge in any experiments.The third row represents snapshots from simulations of the Standard Fisher model (see Table 1) with the free parameters set to the maximum likelihood estimator (MLE) computed from the corresponding experiments.The colours represents cell density, with black indicating a complete absence of cells, while the cell density increases as the colour progresses from red to light yellow.(a) Initial condition of Experiment 1, where the initial confinement region is circular.The barrier is removed at t = 0 h.(b) The state of the experiment at t = 20 h.The cells have spread out to cover most of the visible domain.(c-d) Shows the initial conditions at t = 0 h and a later stage t = 20 h of Experiment 5.The experiment is similar to Experiment 1, except with a triangular barrier.

Figure 2 :
Figure 2: Profile likelihoods using data from Experiment 1.The vertical dashed lines mark the MLE for each parameter, and the black horizontal line at −1.92 marks the threshold for the 95% confidence interval.

Figure 3 :
Figure 3: As per Fig.2, except using the radially averaged dataset Cdata (ρ, t).The axes are the same as in Fig.2.Notice that the profile likelihood curves are broader compared to Fig.2, and the location of the MLE for the Porous Fisher, Richards, and Generalised Fisher models are shifted, but remain virtually unchanged for the Standard Fisher model.

Figure 4 :Figure 5 :
Figure 4: Comparison of the 95% confidence intervals for the parameters in the four models obtained using profile likelihoods on the eight experimental datasets.The confidence intervals are represented as ranges by vertical bars.The experiments with circular initial conditions (Experiments 1-4) are labelled in yellow, while the ones with triangular initial conditions (Experiments 5-8) are labelled in blue.Note that some vertical axes do not start from zero.Note that the horizontal dashed lines in the figure for γ in the Richards model indicates breaks in y−axis.

Figure 6 :
Figure 6: Comparison of profile likelihood curves calculated using progressively down-sampled data from Experiment 1, for (a) D 0 in the Standard Fisher model, (b) γ in the Richards model,and (c) β in the Generalised Fisher model.n t denotes the number of time slices in the data we are using.The blue curve corresponds to n t = 77, i.e. the profile likelihood calculated with all available data, with a temporal resolution of ∆t = 20 min.The pink curve corresponds to n t = 20 and ∆t = 80 min, and the red curve corresponds to n t = 3 and ∆t = 12 h.The black horizontal line at −1.92 is the threshold for the confidence interval.Note that in (b), the red curve remains above the threshold as γ → ∞, and we chose to truncate the plot at γ = 7 for ease of visualization.

Figure 7 :
Figure 7: The widths of the confidence intervals of three selected parameters plotted against the proportion of data used, for (a) D 0 in the Standard Fisher model, (b) γ in the Richards model,and (c) β in the Generalised Fisher model.Here, using a proportion of 1/m of data means that only one in every m images were used, so n t = ⌈77/m⌉.The blue circles represent the widths of the confidence intervals calculated by finding the intersection between the profile likelihood curves and the threshold -1.92, while the red line is proportional to 1/ √ n t and normalized so that it goes through the point representing the case where all data were used.

Figure 8 :
Figure8: Profile likelihoods for the four models as described in Eq. (1) and Table1of the main text, for a synthetic dataset generated with Eq. (1) and parameter values in Eq. (SM.2), perturbed as in Eq. (2) of the main text with σ = 400.The dotted vertical lines mark the location of the true parameter values, while the dashed vertical lines mark the MLE for each parameter.The black horizontal line at −1.92 marks the threshold for the 95% confidence interval.The axis scale for the parameters shared between the models (D 0 , r, K) is kept consistent.

Figure 19 :
Figure19: Profile likelihoods for the down-sampled datasets.A subset of these were presented in Fig.6of the main text.

Table 12 :
Experiment 7, full dataset.Note that for γ in the Richards model, the profile likelihood seems to be monotonically increasing up to the upper bound of γ = 9 which we have imposed for numerical stability, therefore the true MLE is likely to be very large or infinite.

Table 13 :
Experiment 8, full dataset.Similar observations for γ as in Experiment 7.