Journal of The Royal Society Interface
Open AccessResearch article

Mechanistic description of spatial processes using integrative modelling of noise-corrupted imaging data

Published:https://doi.org/10.1098/rsif.2018.0600

    Abstract

    Spatial patterns are ubiquitous on the subcellular, cellular and tissue level, and can be studied using imaging techniques such as light and fluorescence microscopy. Imaging data provide quantitative information about biological systems; however, mechanisms causing spatial patterning often remain elusive. In recent years, spatio-temporal mathematical modelling has helped to overcome this problem. Yet, outliers and structured noise limit modelling of whole imaging data, and models often consider spatial summary statistics. Here, we introduce an integrated data-driven modelling approach that can cope with measurement artefacts and whole imaging data. Our approach combines mechanistic models of the biological processes with robust statistical models of the measurement process. The parameters of the integrated model are calibrated using a maximum-likelihood approach. We used this integrated modelling approach to study in vivo gradients of the chemokine (C-C motif) ligand 21 (CCL21). CCL21 gradients guide dendritic cells and are important in the adaptive immune response. Using artificial data, we verified that the integrated modelling approach provides reliable parameter estimates in the presence of measurement noise and that bias and variance of these estimates are reduced compared to conventional approaches. The application to experimental data allowed the parametrization and subsequent refinement of the model using additional mechanisms. Among other results, model-based hypothesis testing predicted lymphatic vessel-dependent concentration of heparan sulfate, the binding partner of CCL21. The selected model provided an accurate description of the experimental data and was partially validated using published data. Our findings demonstrate that integrated statistical modelling of whole imaging data is computationally feasible and can provide novel biological insights.

    1. Introduction

    In the past decades, our understanding of biological processes has been revolutionized by imaging technologies. Nowadays, super-resolved fluorescence microscopy [1], light sheet fluorescence microscopy [2], cryo-electron microscopy [3] and other technologies provide information about cell and tissue structures over a broad range of scales. Multiplexed information about intracellular processes is, for instance, provided by matrix-assisted laser desorption/ionization imaging mass spectrometry [4] and mass cytometry [5]. These imaging data are analysed using tailored image processing pipelines to quantify properties of interest (see [6] and references therein). This provides detailed information about the imaged system, e.g. biological tissues. Yet, mechanisms often remain elusive; for instance, it is usually not evident from imaging data how the observed spatial patterns are established and controlled. However, such insights are necessary to improve the understanding of complex biological systems [7,8].

    Model-based approaches have been introduced to unravel the mechanisms underlying the spatio-temporal organization of tissues [9,10]. Partial differential equation (PDE) models and agent-based models which capture static and dynamic properties of tissue-scale images have been developed [1013]. These models can describe the underlying biological mechanism and allow for the evaluation of competing biological hypotheses.

    Modelling and hypothesis testing, however, mostly employ qualitative information [10] or summary statistics [1215]. Qualitative information is used due to limited image quality caused, among other factors, by limitations of labelling methods. Summary statistics are considered as they are easy to assess using available processing pipelines. Although qualitative abstractions and summary statistics provide only a fraction of the information encoded in the images, they are widely used. A key reason is the use of sequential analysis approaches (figure 1a) which exploit established image processing pipelines.

    Figure 1.

    Figure 1. Illustration of data-driven modelling in image-based systems biology. (a) Sequential analysis relying on image processing and extracted features. (b) Integrated modelling approach combining image processing, information retrieval and modelling in a single step. (Online version in colour.)

    In this paper, we propose an integrated modelling approach for imaging data (figure 1b). The proposed framework combines image processing with the mechanistic description of the biochemical process using PDE models, instead of performing these steps sequentially. To account for outliers and structured measurement noise, e.g. signals generated by biological processes not considered in the model, we employ concepts from robust regression [16,17]. The integrated modelling approach facilitates the simultaneous assessment of the quality of the imaging data, the filtering of outliers and artefacts, and the mechanistic modelling of the biological process. As this integrated framework circumvents preprocessing and the extraction of summary statistics, it avoids a potential information loss and provides a tailored, unbiased filtering. By avoiding the tuning of parameters in the preprocessing, the approach furthermore simplifies the workflow and promises an improved reproducibility of analysis results.

    We implemented the integrated modelling approach and assessed it by studying artificial and experimental data for the formation of gradients of the chemokine (C-C motif) ligand 21 (CCL21), a process relevant in the immune response. Using this process, we demonstrate the loss of information associated with the use of summary statistics as well as the influence of structured noise on estimation results. Subsequently, we demonstrate how the integrated modelling framework facilitates the direct use of noise-corrupted whole imaging data. We exploit the integrated framework to generate novel hypotheses regarding the underlying biochemistry, which are partially validated using data from the literature.

    2. Methods

    In this paper, we present an integrated modelling approach for tissue-scale imaging data. In the following, we outline the considered modelling approaches, data types, and inference methods.

    2.1. Mechanistic model of spatio-temporal biological processes

    We consider spatio-temporal biological processes described by reaction–diffusion equations—a class of PDE models. Reaction–diffusion equations are widely used in systems and computational biology, for instance, to capture the dynamics of intra- and extracellular substances [18].

    The state variable u(x,t)Rn of the PDE model is the abundance of n chemical substances (e.g. their concentrations) at time t and spatial location xΩ. The state is defined on the modelled spatial domain Ω and changes due to diffusion and biochemical reactions. The Laplace operator is denoted by , the matrix of diffusion coefficients by D(θ) and the reaction term by f(u(x, t), x, θ). The unknown parameters in the matrix of diffusion coefficients and the reaction term are denoted by θ. This yields the PDE model

    tu(x,t)D(θ)u(x,t)=f(u(x,t),x,θ),2.1
    with initial condition u(x, 0) = u0(x, θ) and boundary conditions defined on the boundary ∂Ω of the spatial domain Ω, e.g. Dirichlet or Neumann boundary conditions. The initial condition and boundary conditions can also depend on the unknown parameters θ. Unknown parameters are, for instance, binding affinities and degradation rates.

    2.2. Statistical modelling of imaging data

    We consider standard image acquisition technologies which provide intensity averages over pixels (or voxels). The spatial domain of the jth pixel is denoted by Ωj, j = 1, …, np. This yields the observation model

    yij(t)=Ωjhi(u(x,t),x,θ)dx,i=1,,ny,j=1,,np,2.2
    in which yij(t) denotes the average intensity of the ith observable, i = 1, …, ny, in pixel j at time point t. The function hi describes the dependence of the ith observable, e.g. the intensity of a fluorescence probe, on the state variables. As multiple labellings can be combined, e.g. fluorescence probes with different frequency spectra, the number of observables per pixel, ny, can be larger than one. For biological systems which equilibrate fast, only the stationary distribution might be observed (t → ∞). A typical observation in imaging is the measurement of the relative abundance of a biochemical species, yielding hi(u(x, t), x, θ) = s(ul(x, t) + b) with scaling factor s, background b and concentration ul(x, t) of the lth biochemical species. Saturation effects, unequal elimination, cross-reactivity of antibodies and many other effects can be modelled using the function h.

    The intensity values of individual pixels, yij(tk), are corrupted by experimental noise, providing the measured pixel intensities yijkm. In many applications, the measurement noise is assumed to be independent and identically distributed, e.g. multiplicative log-normally distributed measurement noise,

    yijkm=yij(tk)ϵijkwithϵijklogN(0,σijk2),2.3
    with time points tk, k = 1, …, nk. However, the assumption of independent and identically distributed measurement noise is often violated as additional structure is present [19,20]. Labelling artefacts or other biological processes which alter the measured intensities result in spatially structured noise. Adjacent pixels exhibit often similar noise levels and regions of high noise might also possess particular shapes. While this is known, a noise model capturing these effects is currently not available. In the following sections, we propose methods to address such structured noise.

    The collection of all imaging data is in the remainder denoted by D. Furthermore, the unknown observation parameters, i.e. scaling and background, and noise levels are included in the parameter vector θ.

    2.3. Reconstruction of biological processes from imaging data

    To achieve a mechanistic understanding of spatio-temporal biological processes, we want (i) to infer the parameters of model (2.1) and (ii) to perform model selection to compare competing hypotheses. To address these problems, we consider three alternative statistical approaches:

    Direct approach: The presence of outliers is disregarded and the model is fitted to the data using standard noise models (2.3).

    Filtering approach: The measurement data are preprocessed to detect and remove outliers. The model is fitted to the remaining data using standard noise models (2.3).

    Integrated modelling approach: A statistical model for the outlier distribution is formulated. From outlier and noise distribution a likelihood function is derived and used to simultaneously fit the model and quantify the noise level.

    In the following, these approaches are described in further detail.

    2.3.1. Direct approach

    The likelihood of observing the imaging data D given the parameter vector θ is

    p(D|θ)=i=1nyk=1nkj=1nppn(yijkm|yij(tk)),2.4
    in which pn(yijkm|yij(tk)) denotes the noise model for an individual pixel and yij(tk) denotes the parameter-dependent solution of the model (2.1) and (2.2) [21]. For multiplicative log-normally distributed measurement noise (2.3), we obtain
    pn(yijkm|yij(tk))=12πσijkyijkmexp{12(logyijkmlogyij(tk)σijk)2}.2.5
    The likelihood function (2.4) is formulated using the measured intensity values of individual pixels, yijkm, as data points. Alternatively, summary statistics of the pixel intensities can be considered. In the application of gradient formation discussed later, the average intensity as a function of the distance from the nearest vessel is used [22].

    2.3.2. Filtering approach

    To reduce the impact of outliers and structured noise on the estimation results, image data are preprocessed. We consider filtering methods which provide an index set of filtered pixels, Fik{1,,np}, for the individual observables yi and time points tk. These index sets are masks for regions to be excluded from the objective function. Accordingly, the likelihood is only evaluated for the unfiltered pixels, meaning that for the index j in (2.4) only the set j{1,,np}Fik is considered. Appropriate filtering should render parameter estimation more robust against outliers and structured noise.

    Filtering can be performed using a variety of algorithms, most of which possess several tuning parameters which have to be chosen manually or in a semi-automated fashion. The choice of algorithm and tuning parameters depends on the type of structured noise. To remove bright spots from the image, maximally stable extremal region (MSER) filtering [23] can be employed. MSER filtering is based on a water shedding mechanism and has been used successfully in a series of studies (e.g. work by Buggenthin et al. [24]).

    2.3.3. Integrated modelling approach

    We propose to circumvent the selection of filtering algorithms and the manual tuning of filtering parameters by integrating filtering and parameter estimation. Our integrated modelling approach requires a sufficiently flexible statistical model, ideally accounting for standard measurement noise, structured noise and outliers as well as spatial correlation structure (see §2.2). In this study, we follow ideas from robust regression, i.e. ε-contamination models [25], to address these needs. We assume that the intensity measurement for each pixel is with probability wo an outlier/artefact generated by structured noise and with probability 1 − wo no outlier. The outliers are assumed to be distributed according to the density function po while the remaining points are distributed according to the standard noise model (2.3). This yields the likelihood function

    p(D|θ)=k=1nki=1nyj=1np((1wo)pn(yijkm|yij(tk))+wopo(yijkm|yij(tk))).2.6
    Different outlier distributions can be used given the biological application and the imaging technique. Here, we consider the outliers to be log-normally distributed with location parameter logpo(yijkm|yij(tk)+μo and scale parameter σo,
    po(yijkm|yij(tk))=12πσoyijkmexp{12(logyijkm(logyij(tk)+μo)σo)2}.2.7
    The parameters wo, μo and σo ensure the flexibility of the statistical model. This can reduce the bias introduced by measurement artefacts compared to using only the standard noise models (wo = 0). The inclusion of these additional parameters in the parameter vector θ allows for the simultaneous calibration of the models for the biological and the measurement processes.

    Conceptually, integrated statistical modelling weights the impact of a data point on the model fit, while the standard filtering approach employs a hard cut-off. The weighting depends on the model–data agreement in different regions of the image, providing a context-dependent filter.

    2.4. Parameter estimation and model selection

    The analysis of measurement data D using the different statistical approaches requires the estimation of the parameters θΘRnθ. For this, we use maximum-likelihood (ML) estimation. The ML estimate of the parameter vector, θ^, is the solution of the PDE-constrained optimization problem

    maximizeθΘlogp(D|θ)subject to(2.1)and(2.2)}2.8
    with log-likelihood function logp(D|θ). The log-likelihood function varies between approaches while the models of the biological process (2.1) and the measured intensities (2.2) remain the same.

    Optimization problem (2.8) is usually nonlinear and can possess multiple local optima. To determine the global optimum, we employ a multi-start local optimization method. The starting points are sampled from Θ using latin hypercube sampling. For local optimization, an interior point algorithm is used, which is supplied with gradients computed using forward sensitivity equations. This multi-start approach is computationally efficient and reliable for a broad range of applications [26,27]. Instead of multi-start local optimization, also evolutionary and genetic algorithms [28], particle swarm optimizers [29] or hybrid optimizers [30] could be employed. For a comprehensive survey and evaluations, we refer to the work of Moles et al. [31] and Raue et al. [26].

    The parameter estimates are usually subject to uncertainty due to limited and noise-corrupted data. We determine the uncertainty of the estimated parameters using structural and practical identifiability analysis. For practical identifiability, profile likelihoods are computed [32,33], which provide parameter confidence intervals to particular confidence levels. For profile likelihood calculation, we use the methods recently described for parameter estimation problems with PDE constraints [34].

    Biological processes are still poorly understood and there are usually competing hypotheses giving rise to different model structures. To assess the plausibility of hypotheses, we use the Bayesian information criterion (BIC) [35]. The BIC accounts for model–data mismatch and the complexity of the model, measured by the negative log-likelihood and number of parameters nθ, respectively. It is defined as

    BIC=2logp(D|θ)+nθlog(nD),2.9
    with number of data points nD=nknynp. Models with lower BIC values are preferable and a difference of greater than or equal to 10 is considered as substantial [36]. Model comparison using BIC and other statistical approaches assumes that all models consider the same dataset. As the filtering approach excludes data points, a comparison between approaches using model selection is not possible. We use model selection merely to compare model alternatives fitted using the same statistical approach.

    2.5. Implementation

    All methods are implemented in matlab and available as electronic supplementary material, Code S1. The simulation of the PDE model is implemented using the Partial Differential Equation Toolbox of matlab. The multi-start local optimization exploits the matlab routine fmincon.m. Parameter estimation and uncertainty analysis are performed using the Parameter EStimation TOolbox (PESTO) available on GitHub (https://github.com/ICB-DCM/PESTO) [37].

    3. Results

    In the following, we will illustrate the reliability achieved using whole imaging data and spatial summary statistics, and compare direct, filtering and integrated modelling approaches for statistical inference from imaging data. For this purpose, we studied artificial imaging data, for which the ground truth is known, as well as experimental imaging data, from which new biological insights are gained.

    3.1. Biological process

    We studied the distribution of CCL21 in dermal interstitium (figure 2a). CCL21 gradients facilitate the delivery of antigens to the lymph nodes by guiding mature dendritic cells (figure 2b) [38]. Inside the lymph nodes, mature dendritic cells present the antigens to T-cells, initiating the adaptive immune response.

    Figure 2.

    Figure 2. Model for CCL21 gradient formation and experimental data. (a) Schematic of model for CCL21 gradient formation, including diffusive and immobilized CCL21. (b) Illustration of mature dendritic cells guided by a gradient of immobilized CCL21. (c) Experimental data for immobilized CCL21 and lymphatic vessels. CCL21 immunostaining is colour-coded and the outlines of the lymphatic vessels (light grey lines), which were determined using an additional staining, are indicated. These data were collected and provided by Weber et al. [22] and we refer to the original publication for details on materials and methods.

    The formation of the CCL21 gradients and their biological functions are relatively well understood and experimentally verified [22]. It is known that soluble chemokine CCL21 is secreted at the lymphatic vessels, and it is assumed that from there it diffuses into the dermal interstitium. Furthermore, it has been established that CCL21 binds to heparan sulfate proteoglycan, resulting in immobilized CCL21 which guides the migratory dendritic cells. However, the quantitative properties of the individual processes and the detailed mechanisms remain to be analysed. In addition, the available imaging data (figure 2c) are corrupted by structured noise (see discussion below), rendering the analysis challenging and the process well suited for the evaluation of the proposed approaches.

    3.2. Mathematical model and experimental data

    We modelled the dynamics of the concentrations of soluble CCL21 u1(x, t), of heparan sulfate u2(x, t) and of heparan sulfate–CCL21 dimers u3(x, t) by a system of PDEs [21] with the two-dimensional (2D) spatial coordinate xΩR2. The PDE model accounted for

    the secretion of soluble CCL21 with rate α from L lymphatic vessels,

    the diffusion of soluble CCL21 with diffusion coefficient D,

    the degradation of soluble CCL21 with rate constant γ,

    the binding of soluble CCL21 to heparan sulfate with rate constant k1, and

    the unbinding of CCL21 from heparan sulfate with rate constant k−1,

    and we assumed no flux conditions at the boundaries. The spatial location of the lth lymphatic vessel is marked by the indicator function ql(x), which is zero outside and one inside the lymphatic vessel. This yields the spatial domains covered by lymphatic vessels, Ωl = {xΩ | ql(x) = 1}, l = 1, …, L, which we refer to as lymphatic vessel masks. Mathematically, we obtained the evolution equation

    u1(x,t)tDu1(x,t)=αlql(x)k1u1(x,t)u2(x,t)+k1u3(x,t)γu1(x,t)u2(x,t)t=k1u1(x,t)u2(x,t)+k1u3(x,t)u3(x,t)t=k1u1(x,t)u2(x,t)k1u3(x,t)}3.1
    on xΩ with initial conditions
    xΩ:u1(x,0)=0,u2(x,0)=s0(x)and u3(x,0)=0,3.2
    and boundary conditions
    xΩ:u1ν=0.3.3
    Ω denotes the boundary of Ω and ν denotes its normal vector. As heparan sulfate u2(x, t) and heparan sulfate–CCL21 dimers u3(x, t) are not subject to spatial transport, there are no respective boundary conditions. The heparan sulfate concentration was assumed to be homogenous, s0(x) = S0, unless mentioned otherwise. Furthermore, s0(x) denotes the overall concentration of heparan sulfate, including heparan sulfate proteoglycan.

    Weber et al. [22] succeeded in measuring the in vivo gradients of immobilized CCL21 u3(x, t) in mouse ear sheets by immunostraining. Therefore, mouse ear sheets were incubated with CCL21 antibody and imaged using confocal microscopy. This yielded the 2D images depicted in figure 2c. As the experiments were performed in unperturbed tissue, the images provide the equilibrium distributions. Accordingly, the experimental readout is

    yj=Ωjh(u(x),x,θ)dx,3.4
    with pixel index j, observation function h(u(x), x, θ) = s(u3(x) + b), and u(x) solving (3.1) with ui/t = 0, i = 1, …, 3. The background b models the spatially homogeneous unspecific binding of the CCL21 antibody. The measured pixel intensities are semi-quantitative, requiring the introduction of the scaling constant s. Scaling constant and background were estimated along with the kinetic parameters. In addition to immobilized CCL21, Weber et al. [22] assessed the lymphatic vessel masks Ωl, l = 1, …, L, in the same mouse ear sheets by straining the lymphatic vessel endothelial hyaluronan receptor 1, providing the basis for the simulation of realistic tissue structures. We assumed that the scaling s and the lymphatic vessel masks Ωl, l = 1, …, L, differ between images whereas background b and mechanistic parameters remain identical. For details on the experimental set-up, we refer to the Material and Methods section of [22].

    As the absolute concentration of CCL21 was unknown and merely the equilibrium distribution of immobilized CCL21 was measured, the parameters (α, D, γ, k1, k−1, S0, s, b)T were structurally non-identifiable (see [39] for definition). To circumvent this, we reformulated the model in terms of the parameters (D/γ, (αk1)/(γk−1), s S0, b)T. For details on the reparametrization, we refer to electronic supplementary material, text S2.

    The visual inspection of the imaging data revealed a high level of immobilized CCL21 associated with the lymphatic vessel, which was in agreement with the model. However, there were also high intensity spots outside the lymphatic vessels (figure 2c), which were not explained by the aforementioned processes. As in fixed tissues the immunostaining performed by Weber et al. [22] labels intracellular and extracellular CCL21 [40], these spots are most probably caused by previously reported CCL21 expressing cells [41]. As intracellular CCL21 does not contribute to the extracellular distribution of CCL21 described by model (3.1), the spots should be considered as structured noise and disregarded in the parameter estimation. This rendered the modelling problem appropriate for the evaluation of the integrated modelling approach.

    3.3. Integrative modelling approach outperforms conventional methods on artificial experimental data

    In this section, we assess the properties of different image-based modelling using artificial imaging data. This allows us to evaluate the accuracy with which the true parameter vector is recovered using (i) whole imaging data versus a summary statistic and (ii) direct approach versus filtering approach versus integrated modelling approach.

    3.3.1. Generation of artificial data

    We derived artificial imaging data closely resembling the experimentally observed images to ensure a realistic test scenario. The spatial structure of the images was conserved by using the measured lymphatic vessel masks and selecting model parameters which roughly reproduced the experimentally observed CCL21 distributions. The employed parameters are provided in electronic supplementary material, text S2 and table S1. The structured measurement noise was captured by extracting relevant features of the high-intensity spots. Firstly, the high-intensity spots were detected using MSER filtering [23] using an implementation by Nistér & Stewenius [42]. Secondly, the identified spots were analysed to obtain the distributions of spot shape parameters and sizes (figure 3a). Given these distributions, the artificial data were obtained by simulating the model for the selected parameters and adding a varying number of spots with properties sampled from the measured distribution. For simplicity, the spots were assumed to be ellipsoidal. A representative artificial image is depicted in figure 3c. While the artificial data do not capture the full complexity of experimental data, they facilitate the evaluation of the approaches.

    Figure 3.

    Figure 3. Pipeline for the generation of artificial imaging data. (a) Raw and filtered images. Outlines of spots identified using MSER filtering are indicated. (b) Distribution of size and shape parameters of spots. Histogram and points in scatter plot indicate the information extracted using filtering. The lines indicate the densities used for the generation of the artificial data. (c) Artificial data obtained by simulation of the model (3.1) and subsequent addition of spots and measurement noise.

    In addition to the artificial imaging data, we generated artificial summary statistics. Here, we considered the distance-dependent average intensity of immobilized CCL21. To calculate this summary statistic for the artificial data, the minimal distance to the next lymphatic vessel is computed for each pixel. Subsequently, the intensity values of all pixels with the same distance are averaged. The particular summary statistic was chosen because (1) it was used in the paper by Weber et al. [22] to analyse the considered dataset and (2) it is similar to spatial summary statistics used in other image-based modelling projects [12,14,43,44].

    3.3.2. Detailed spatial information improves estimation accuracy

    Given the artificial datasets, we first asked how much information the raw imaging data contain in comparison to the summary statistic computed from them. To study this, we employed the 2D model (3.1), as well as a one-dimensional (1D) model approximating the distance-dependent average CCL21. The 1D model was included as the use of summary statistics and simplified process description often goes hand in hand [1214]. Overall, we considered three set-ups:

    (i)

    Fitting of the distance-dependent average CCL21 intensity using the 1D model.

    (ii)

    Fitting of the distance-dependent average CCL21 using the 2D model accounting for the measured vessel topology.

    (iii)

    Fitting of the CCL21 imaging data using the 2D model accounting for the measured vessel topology.

    The 1D model employed in set-up (i) is a simplified version of model (3.1) with x ∈ [0, L] denoting the distance from the lymphatic vessel. The secretion at the lymphatic vessel (at x = 0) is modelled via the boundary condition u1/x|x=0 = α. The diffusion and reaction dynamics stay the same. For details on the 1D model, we refer to the electronic supplementary material, text S2. Set-ups (ii) and (iii) employed model (3.1) with the measured lymphatic vessel mask. To study the relevance of detailed spatial information, we considered artificial data without outliers and structured noise but with independent and identically distributed measurement noise, i.e. multiplicative log-normally distributed measurement noise. The signal-to-noise ratio, which is the mean signal intensity divided by the standard deviation of the noise, was approximately 6.

    As the considered artificial data contain neither outliers nor structured noise, we employed the direct approach for statistical modelling. Parameter optimization and uncertainty analysis for set-ups (i)–(iii) were performed using multi-start local optimization and profile likelihood methods, respectively. All parameters were constrained to a regime spanning at least four orders of magnitude (see electronic supplementary material, text S2, table S1).

    The analysis of a representative artificial dataset revealed that for set-ups (i) and (ii) a good agreement with the summary statistic was achieved (figure 4a,b), while for set-up (iii) a good agreement with the imaging data was obtained (figure 4c). Indeed, although the artificial data were generated using a 2D model with a (non-trivial) experimentally observed lymphatic vessel geometry, the 1D model provides an accurate fit of the summary statistics for all but small distances from the vessel. The estimated parameters in set-up (i) were however far from the true parameters. This was among other reasons due to practical non-identifiabilities of the parameters (αk1)/(γk−1) and S0 (figure 4d). While the individual parameters are non-identifiable, their product is practically identifiable. The same phenomenon was observed for set-up (ii) (figure 4d,e), implying that modelling the underlying spatial structure did not improve the information extraction substantially. By contrast, for set-up (iii), all parameter estimates were close to the true parameter and practically identifiable (figure 4d). Thus, not the summary statistic but the whole imaging data should be used as they allow for more accurate parameter estimation. While the use of more informative summary statistics might resolve the problems, it is not clear whether such summary statistics exist and how they can be constructed a priori.

    Figure 4.

    Figure 4. Parameter estimation results for the 1D and 2D models using extracted features and whole imaging data. (a) Fitting result for the 1D model using the distance-dependent average intensity of immobilized CCL21 (1D data). (b) Fitting result for the 2D model using the distance-dependent average intensity of immobilized CCL21 (1D data). (c) Fitting result for the 2D model using the measured intensity distribution of immobilized CCL21 (2D data). (d,e) Profile likelihood derived confidence intervals for parameter estimates for set-ups (i)–(iii). The horizontal line marks the true parameter and the vertical bars represent the confidence intervals corresponding to different confidence levels (75%, 90% and 99%), with non-identifiable parameters indicated by ‘n.i.’. (e) Profile likelihood derived confidence intervals for the product of the parameters which are non-identifiable for set-ups (i) and (ii).

    3.3.3. Integrated modelling approach yields more accurate and robust results than conventional methods

    As whole imaging data are strongly influenced by outliers and structured noise, we compared the accuracy of parameter estimates obtained using the direct approach, the filtering approach and the integrated statistical approach. We considered artificial imaging data with 0 to 620 bright spots and evaluated 30 datasets to obtain robust statistics.

    Our analysis revealed that for artificial datasets with a large number of spots, fits obtained using the direct approach overestimated the concentration of immobilized CCL21 outside the spots while filtering and the integrated approach provided consistent results (figure 5a). Apparently, the direct approach could not explain the structured noise and the bimodal distribution of the residual r=(logymlogy(θ^))/σ (figure 5b). The filtering of points and the integrated modelling resulted in a more consistent statistical description.

    Figure 5.

    Figure 5. Parameter estimation results for images with structured noise. (a) Spatial structure of residuals and (b) observed (histogram) and modelled (line) residual distribution for representative dataset with 320 spots. The results for the simulation of the 2D model with parameter estimates obtained using (top) the direct approach, (middle) the filtering approach, and (bottom) the integrated modelling approach are depicted. Pixels disregarded by the filtering approach are coded green (histogram). The estimated outlier distribution in the integrated modelling approach is depicted in blue (line). (c) Mean parameter estimation error (black line) 95% percentile interval (area) of the direct approach, the filtering approach and the integrated modelling approach for different spot numbers computed using 30 artificial datasets.

    The analysis of the estimation indicated that for low numbers of spots, the filtering approach and the integrated modelling approach yielded almost the same results as without spots while the direct approach already possessed a bias and a large variance (figure 5c). For medium and high numbers of spots, the integrated modelling yielded the smallest estimation error. The improvement of the integrated modelling approach over the alternative approaches was statistically significant (p-value < 0.01; Welch’s paired-sample one-sided t-test) for (αk1)/(γk−1), S0 and σ2 for numbers of spots greater than or equal to 160. This was the case although (a) the filter approach employed the same MSER filter settings used to obtain the spot statistics—this parameter setting appeared to be ideal—and (b) the integrated modelling approach did not account for the spatial structure of measurement noise. Indeed, the integrated modelling approach yielded almost unbiased results. Thus, integrated noise modelling provided robust parameter estimates from imaging data corrupted with the considered type of structured noise.

    In conclusion, our analysis of artificial data suggests that mechanistic modelling of spatial processes should be based on detailed imaging data rather than some spatial summary statistic with unknown information content. Additionally, filtering but even more so the proposed integrated modelling approach can provide robust estimates in the presence of structured noise and outliers.

    3.4. Integrative modelling approach predicts lymphatic vessel-dependent heparan sulfate concentration

    Given the positive results for artificial data, we used the integrated modelling approach on whole imaging data to analyse experimental data for CCL21 gradient formation. Among other things, we asked whether the current assumption of uniform heparan sulfate concentration is appropriate or alternative mechanisms need to be considered.

    3.4.1. Model-based image analysis reveals limitation of a literature-based model

    We employed model (3.1) with uniform heparan sulfate concentration, s0(x) = S0, to describe the imaging data collected by Weber et al. [22]. This model was based on available information in the literature (e.g. [22]) and suggested by experts in the field. As the no-flux boundary conditions (3.3) are presumably not precisely met in the biological system, we disregarded pixels which are within 40 µm of the boundary for the calculation of the objective function (2.6). This depth was chosen based on preliminary estimates for the diffusion length from the summary statistic (electronic supplementary material, text S2, figure S1) and retrospectively validated given the fitting results.

    The fitting results for the model with uniform heparan sulfate concentration for a representative image with multiple lymphatic vessels are depicted in figure 6. For this image, the comparison of experimental data and the fitting results (figure 6b,c) revealed that—as expected—bright spots outside lymphatic vessels are not captured. However, there were also larger regions in the image with substantial disagreement. In particular, in lymphatic vessels 1 and 2, the concentration of immobilized CCL21 was overestimated. Accordingly, the residuals were not uncorrelated but show a clear spatial structure (figure 6d), resulting in a pronounced tail in the residual distribution (figure 6e). This indicated that the model with uniform heparan sulfate concentration might be too simple.

    Figure 6.

    Figure 6. Analysis of model with uniform heparan sulfate concentration. (a) Spatial location of lymphatic vessels. The individual vessels are colour-coded. (b) Experimental data for immobilized CCL21. The difference in the concentration of immobilized CCL21 between lymphatic vessels is indicated along with the presence of spots. (c) Simulation results for immobilized CCL21. The maximum-likelihood estimate for the model with uniform heparan sulfate concentration is depicted. Differences between lymphatic vessels are not captured. Spots are filtered using the integrated statistical model. (d) Residuals of experimental data and simulation results. The low measured intensities in lymphatic vessels 1 and 2 are not captured by the model. (e) Observed residual distribution and distributions of unstructured and structured noise indicated by the integrative modelling approach.

    3.4.2. Mathematical modelling supports hypothesis of vessel-dependent heparan sulfate concentration

    As the detailed analysis of the whole imaging data revealed limitations of the literature-based model, we evaluated possible model refinements. In addition to the hypothesis underlying the model presented in the previous section:

    (1)

    uniform heparan sulfate concentration, s0(x) = S0,

    we considered two alternative hypotheses:

    (2)

    Different heparan sulfate concentrations in lymphatic vessels and the tissue, s0(x)=ST+(SLST)lql(x).

    (3)

    Different heparan sulfate concentrations in individual lymphatic vessels and the tissue, s0(x)=ST+l(SL,lST)ql(x).

    These hypotheses yield models 1–3 which are illustrated in figure 7a. We employed the integrated modelling approach to train the models 1–3 on all 9 images recorded by Weber et al. [22], namely image 1 to image 4 and image 12 to image 16. The optimization converged robustly (figure 7b) and the fitting results for different images are provided in electronic supplementary material, text S2, table S2–2.

    Figure 7.

    Figure 7. Comparison of three different hypotheses for the CCL21 distributions. (a) Schematics of models 1–3. (b) Multi-start optimization results for the model alternatives, indicating a unique global optimum for each model. (c) Results of model comparison using BIC for each image and the overall fit. The differences of BIC values for different models are colour-coded, with each model being assigned one colour (model 1, purple; model 2, cyan; and model 3, red). (d) Experimental data for the spatial distribution of immobilized CCL21 and simulation results for model 3. (e) Experimental data for the distance-dependent intensity of immobilized CCL21 and simulation results for models 1–3.

    For the individual images as well as for the overall dataset, model 3 was substantially better than models 1 and 2 (figure 7c). Model 3 provided a good agreement with the imaging data (figure 7d). Furthermore, the prediction of differences in the heparan sulfate concentrations between individual lymphatic vessels is consistent with experimental data indicating different levels of extracellular CCL21 in collecting and initial lymphatics [40]. Thus, our model-based analysis provided a mechanistic hypothesis which we were able to partially validate using published results.

    To conclude, in this section, we verified the applicability of the integrated modelling approach to experimental imaging data including structured noise. We employed the statistical approach for model-based data analysis and hypothesis testing, thereby providing new insights into the CCL21 gradient formation and dendritic cell guidance. Notably, all models achieved an equally good fit for the summary statistic (figure 7e). This implies that the information content of the considered summary statistic is too limited for model selection and confirms that models should be rather based on the whole imaging data.

    4. Discussion

    Imaging data are widely used to assess biological processes. In many studies, the richness of imaging data is, however, disregarded and they are merely used to derive and evaluate simple summary statistics. We illustrated that this can result in a considerable loss of information. While summary statistics are often sufficient to draw qualitative conclusions, our analyses suggest that quantitative mechanistic models should be trained using whole imaging data—wherever possible—to exploit their richness. Accordingly, quantitative mechanistic models of spatio-temporal processes should be used instead of simplified models, e.g. 1D models, describing summary statistics. This avoids a loss of information and can improve identifiability.

    The model-based analysis of imaging data facilitates the unravelling of novel mechanisms and the comparison of competing hypotheses [9,10,13]. However, this is often demanding and error-prone if structured noise and outliers are present. To address this problem, we introduced an integrated approach for the statistical and mechanistic modelling of imaging data. The integrated modelling approach employs a flexible statistical model with additional parameters. This enables it to cope with intensity distributions arising in the presence of structured noise and outliers. Conceptually, the integrated modelling approach can be interpreted as a direct approach with a more suited model of the measurement noise. For the considered problems, the integrated modelling approach yields similar or better results than conventional sequential methods. Even without knowledge of the precise structure of the noise, the method was able to reduce estimation bias and variance compared to direct and filtering approaches, providing more reliable parameter estimates. The finding that the integrated modelling approach outperforms the filtering approach, which uses information about spot properties, is very promising and hints towards its true potential.

    To evaluate the properties of the integrated modelling approach, we studied CCL21 gradient formation. We established the first quantitative mathematical model of CCL21 gradients measured in tissue. Using experimental data, we quantified the estimation error of different models and performed model selection. Among other results, we found indications that the heparan sulfate concentration is vessel dependent. This finding relies on the mechanistic description of the imaging data we proposed. Simple statistical models are not sufficient as the heparan sulfate concentration is not observed directly. The vessel-dependence can influence the gradient formation and cell guidance and might be relevant in some disease conditions [45]. Furthermore, it demonstrates that integrated modelling approaches might reveal novel information from available data and can help to unravel causal factors. While CCL21 gradient formation is a specific example, the principle of sugar-mediated immobilization in gradient formation is observed for many (extracellular) signalling molecules, including growth factors, cytokines and selected hormones. This renders the proposed model and analysis approach interesting for a large number of research projects.

    In this study, we proposed a simple statistical model for outliers and structured noise, and used it for the inference of PDE models. In principle, the statistical model could be used in combination with other types of mechanistic spatio-temporal models, including agent-based models and hybrid discrete–continuum models [13]. A further improvement of the integrated modelling approach could be achieved by considering more tailored statistical models. The correlation of noise in neighbouring pixels could be considered and even sophisticated segmentation methods, e.g. graph-based segmentation approaches [46], might be incorporated in a likelihood framework. Extension in this direction and towards image regression could improve robustness and applicability further. In addition, the use of time series data—which is supported by the approach—will facilitate the extraction of dynamic features and improve structural and practical identifiability.

    As an alternative to the proposed frequentists method, Bayesian methods could be used to incorporate prior knowledge on the model parameters. In recent years, approximate Bayesian computation (ABC) methods [47,48] became popular and were also used to model spatial processes [13,49]. However, ABC methods employ summary statistics and we are not aware of a study using whole imaging data, which proved necessary in our analysis. Furthermore, ABC methods require accurate noise models [50], such as a simple statistical model for outliers and structured noise, and are often computationally demanding.

    In conclusion, mechanistic understanding and rigorous hypothesis testing in biology require the formulation of mathematical and computational models. For cellular processes, this led to the development of modelling and estimation toolboxes, e.g. Data2Dynamics [51], which support the simultaneous inference of kinetic parameters and measurement noise. We illustrated that such a simultaneous inference is also feasible for the case of spatial models, which are usually more challenging. We illustrated parameter optimization, uncertainty analysis and model selection for PDE models of noise-corrupted imaging data. We expect that the proposed concept and algorithms are well suited for a broad range of applications, including scenarios with time-resolved measurements and time-dependent domains [10]. This is also facilitated by the availability of the matlab code, simplifying reuse and extensions of the methods. Accordingly, this study will contribute to the mechanistic description of spatial processes.

    Data accessibility

    Data, code and results files are provided as electronic supplementary material at: doi:10.5281/zenodo.1202475.

    Authors' contributions

    F.J.T. and J.H. conception and design of the study; M.S. supply of the imaging data; S.H. and J.H. model development and statistical inference; S.H., M.S. and J.H. interpretation of the results; S.H. and J.H. drafting of the manuscript. All authors revised the manuscript critically for important intellectual content, and provided final approval for publication.

    Competing interests

    The authors have no competing interests.

    Funding

    The project was funded by the German Ministry of Education and Research (BMBF) in the SYS-Stomach project (grant no. 01ZX1310B).

    Acknowledgements

    The authors thank Anna Fiedler and Dantong Wang for fruitful discussions and proof-reading of the manuscript.

    Footnotes

    Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.4324007.

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

    References