Information sensitivity functions to assess parameter information gain and identifiability of dynamical systems

A new class of functions, called the ‘information sensitivity functions’ (ISFs), which quantify the information gain about the parameters through the measurements/observables of a dynamical system are presented. These functions can be easily computed through classical sensitivity functions alone and are based on Bayesian and information-theoretic approaches. While marginal information gain is quantified by decrease in differential entropy, correlations between arbitrary sets of parameters are assessed through mutual information. For individual parameters, these information gains are also presented as marginal posterior variances, and, to assess the effect of correlations, as conditional variances when other parameters are given. The easy to interpret ISFs can be used to (a) identify time intervals or regions in dynamical system behaviour where information about the parameters is concentrated; (b) assess the effect of measurement noise on the information gain for the parameters; (c) assess whether sufficient information in an experimental protocol (input, measurements and their frequency) is available to identify the parameters; (d) assess correlation in the posterior distribution of the parameters to identify the sets of parameters that are likely to be indistinguishable; and (e) assess identifiability problems for particular sets of parameters.


Introduction
Sensitivity analysis [1] has been widely used to determine how the parameters of a dynamical system influence its outputs. When one or more outputs are measured (observed), it quantifies the variation of the observations with respect to the parameters to determine which parameters are most and least influential towards the measurements. Therefore, when performing an inverse problem of estimating the parameters from the measurements, sensitivity analysis is widely used to fix the least influential parameters (as their effect on the measurements is insignificant and removing them reduces the dimensionality of the inverse problem) while focussing on estimation of the most influential parameters. Sensitivity analysis is also used to assess the question of parameter identifiability, i.e. how easy or difficult is it to identify the parameters from the measurements. This is primarily based on the idea that if the observables are highly sensitive to perturbations in certain parameters then these parameters are likely to be identifiable, and if the observables are insensitive then the parameters are likely to be unidentifiable. However, the magnitude of the sensitivities is hard to interpret, except in the trivial case when the sensitivities are identically zero. Lastly, parameter identifiability based on sensitivity analysis also assesses correlation/dependence between the parameters-through principle component analysis [2], correlation method [3], orthogonal method [4] and the eigenvalue method [5]-to identify which pairs of parameters, owing to the high correlation, are likely to be indistinguishable from each other (also see [6] and the referenced therein). Another method to assess correlations is based on the Fisher information matrix [6][7][8], which can be derived from asymptotic analysis of nonlinear least-squares estimators [9,10]. Ashyraliyev & Blom [11] suggested that a singular value decomposition of the Fisher information matrix can be used to identify linear combinations of parameters that can be well identified given the observables and measurement noise. Another class of methods to assess identifiability, proposed by Raue et al. [12][13][14], are based on the exploiting the curvature of the likelihood function or the flatness of the profile likelihood, i.e. minimization of the likelihood with respect to all parameters but one. Li & Vu [15,16] proposed that pairwise and higher-order correlations between the parameters may be identified by assessing linear dependencies between the columns of the sensitivity matrix [16] or the matrix of first-order partial derivatives of the state equations [15]. Thomaseth and Cobeli extended the classical sensitivity functions to 'generalized sensitivity functions' (GSFs) which assess information gain about the parameters from the measurements. This method has been widely used to assess identifiability of dynamical systems [10,[17][18][19][20], where regions of high information gain show a sharp increase in the GSFs while oscillations imply correlation with other parameters. There are two drawbacks of GSFs: first, that they are designed to start at 0 and end at 1, which leads to the so-called 'force-toone' phenomenon, where even in the absence of information about the parameters the GSFs are forces to end at a value of 1; and second, oscillations in GSFs can be hard to interpret in terms of identifying which sets of parameters are correlated. Based on a pure information-theoretic approach Pant & Lombardi [21] proposed to compute information gain through a decrease in Shannon entropy, which alleviated the shortcomings of GSFs. However, since their method relies on a Monte Carlo type method the computational effort associated with the computation of information gains can be quite large. In this article, a novel method which combines the method of Pant & Lombardi [21] with the classical sensitivity functions to compute information gain about the parameters is presented. The new functions are collectively called 'information sensitivity functions' (ISFs), which assess parameter information gain through sensitivity functions alone, thereby eliminating the need for Monte Carlo runs. These functions (i) are based on Bayesian/ information-theoretic methods and do not rely on asymptotic analysis; (ii) are monotonically non-decreasing and therefore do not oscillate; (iii) can assess regions of high information content for individual parameters; (iv) can assess parameter correlations between an arbitrary set of parameters; (v) can reveal potential problems in identifiability of system parameters; (vi) can assess the effect of experimental protocol on the inverse problem, for example, which outputs are measured, associated measurement noise, and measurement frequency; and (vii) are easily interpretable.
In what follows, first the theoretic developments are presented in § §2-8, followed by their application to three different dynamical systems in §9. The three examples are chosen from different areas in mathematical biosciences: (i) a Windkessel model, which is a widely used boundary condition in computational fluid dynamics simulations of haemodynamics; (ii) the Hodgkin-Huxley model for a biological neuron, which has formed the basis for a variety of ionic models describing excitable tissues; and (iii) a kinetics model for the influenza A virus.

The dynamical system and sensitivity equations
Consider the following dynamical system governed by a set of parametrized ordinary differential equations (ODEs): where t represents time, x [ R d is the state vector, u [ R p is the parameter vector, the function f : R dþpþ1 ! R d represents the dynamics and x 0 represents the initial condition at time t 0 . The initial conditions may depend on the parameters, and therefore x(t 0 ) ¼ x 0 (u): ð2:2Þ The above representation subsumes the case where the initial condition may itself be seen as a parameter. The RHS of the dynamical system, equation (2.1), can be linearized at at a reference point (x r , u r , t r ), to obtain where (Á)j r represents ( . ) evaluated at the reference point. Henceforth, in order to be concise, the explicit dependence of f(x, u, t) on its arguments is omitted and f, without any arguments, is used to denote f(x, u, t). Following this notation, equation (2.3) is concisely written as The above linearization will be used in the next section to study the evolution of the state covariance matrix with time. Let S [ R dÂp denote the matrix of sensitivity functions for the system in equation It is well known that S satisfies the following ODE system, which can be obtained by applying the chain rule of differentiation to equation (2.1): The goal is to relate the evolution of the sensitivity matrix to the evolution of the covariance of the joint vector of the state and the parameters. Let the subscript n denote all quantities at time t n ; for example, x n denotes the state vector at time t n , S n the corresponding sensitivity matrix, and so on. To relate the sensitivity matrix S nþ1 at time t nþ1 with S n , a first-order discretization of equation (2.6) is considered S nþ1 À S n Dt ¼ r x fj n S n þ r u fj n ð2:7Þ and, therefore, the matrix product S nþ1 S T nþ1 can be written as Next, it is hypothesized that under certain conditions S nþ1 S T nþ1 can be seen as the covariance matrix of the state vector at time t nþ1 . These developments are presented in the next two sections.

Forward propagation of uncertainty
As the objective is to study the relationship between the parameters and the state vector, a joint vector of all the state vectors until the current time t n and the parameter vector is considered. Assume that at time t n , this joint rsif.royalsocietypublishing.org J. R. Soc. Interface 15: 20170871 vector [x T n , x T n21 , . . ., x T 0 , u T ] T is distributed according to a multivariate Normal distribution as follows: To obtain the joint distribution of [x T nþ1 , x T n , . . ., x T 0 , u T ] T (all the state vectors until time t nþ1 and the parameter vector), the linearized dynamical system, equation (2.4), is used. Considering the reference point (x r , u r , t r ) in equation (2.4) to be (m x n , m u , t n ), i.e. considering the linearization around the mean values of the parameter vector and the state at time t n , one obtains

ð3:2Þ
Ignoring the higher-order terms, and employing a forward Euler discretization, one obtains x nþ1 % x n þ fj n Dt þ r x fj n (x n À m xn )Dt þ r u fj n (u À m u )Dt:

ð3:3Þ
Remark 3.1. x nþ1 is completely determined by x n and u, i.e. given x n and u nothing more can be learned about x nþ1 . Hence, the forward propagation forms a Markov chain.
Remark 3.2. fj n ,r x fj n ,r u fj n are evaluated at (m x n , m u , t n ). where I q represents an identity matrix of size q, O q,r represents a zero matrix of size q Â r, and C n ¼ fj n Dt À r x fj n m xn Dt À r u fj n m u Dt ð3:5Þ is a term that does not depend on x n and u. The distribution of [x T nþ1 , x T n , . . ., x T 0 , u T ] T can be written from equation (3.4) as and the covariance S nþ1 can be expanded as where S nþ1,nþ1 ¼ ((I d þ r x fj n Dt)S n,n þ r T u fj n L T n,u Dt)(I d þ r T x fj n Dt) þ (r u fj n S u,u Dt þ (I d þ r T x fj n Dt)L n,u )r T u fj n Dt, ð3:8Þ L nþ1,u ¼ (I d þ r x fj n Dt)L n,u þ r u fj n S u,u Dt ð3:9Þ and S nþ1,j ¼ (I d þ r x fj n Dt)S n, j þ r u fj n L T j,u Dt for 0 j n: ð3:10Þ If the above evolution of the covariance matrix can be related to the evolution of the sensitivity matrix, as presented in §2 and equation (2.8), then the dependencies between the state vector and the parameters can be studied. This concept is developed in the next section.

Relationship between sensitivity and forward propagation of uncertainty
In this section, the relationship between the evolution of the sensitivity matrix and the evolution of the covariance matrix of the joint distribution between all the state vectors until time t n and the parameters is developed. Equation (3.8) can be expanded as follows: S nþ1,nþ1 ¼ S n,n þ r x fj n S n,n Dt þ r T u fj n L T n,u Dt þ S n,n r T x fj n Dt þ r x fj n S n,n r T x fj n Dt 2 þ r T u fj n L T n,u r T x fj n Dt 2 þ r u fj n S u,u r T u fj n Dt 2 þ L n,u r T u fj n Dt þ r T x fj n L n,u r T u fj n Dt 2 : ð4:1Þ which, as evident from equation (2.7), is the standard forward propagation of the sensitivity matrix. Hence Finally, the term S nþ1,n from equation (3.10) can be written as S nþ1,n ¼ S nþ1 S T n : ð4:7Þ From equations (4.5), (4.6) and (4.7), it can be concluded that if the initial prior uncertainty in [x T 0 , u] T is assumed to be Gaussian with covariance then the joint vector of u, the parameters, and [x T n , x T n21 , . . ., x T 0 ] T , the state-vector corresponding to time instants [t 0 , t 1 , . . ., t n ], can be approximated, by considering only the firstorder terms after linearization, to be a Gaussian distribution with the following covariance: based on which the mean vector of the state will propagate according to equation (3.3), essentially according to the forward Euler method. While this propagated mean does not directly influence the posterior uncertainty of the parameters, which depends only on the covariance matrix, it is important to note that the sensitivity terms in the covariance matrix of equation (4.9) are evaluated at the propagated means. The propagated mean of the joint vector [x T n , x T n21 , . . ., The required conditions presented in equations (4.2), (4.3) and (4.4), can also be derived without temporal discretization of the sensitivity and linearized forward model. This is presented in appendix A, which presents a differential equation describing the evolution of the joint covariance matrix, leading to the conditions derived above without temporal discretization. Even though the method presented in appendix A may be considered more general, the author first conceived the idea using the arguments shown above, and hence these ideas are presented in the main text.

Measurements (observations)
Having established how the covariance of the state and the parameters evolves in relation to the sensitivity matrix, the next task is to extend this framework to include the measurements. Eventually, one wants to obtain an expression for the joint distribution of the measurements and the parameters, so that conditioning this joint distribution on the measurements (implying that measurements are known) will yield information about how much can be learned about the parameters.
Consider a linear observation operator where y n [ R m is measured at time t n according to where H n [ R mÂd is the observation operator at time t n and e n is the measurement noise. Let e n be independently (across all measurement times) distributed as where O m is a zero vector and Y n is the covariance structure of the noise. From equations (4.9) and (5.1), it is easy to see that [y T n , y T n21 , . . ., y T 0 , u] T follows a Gaussian distribution with the following mean and covariance:      and

Conditional distribution of the parameters
The quantity of interest is the conditional distribution of parameters; i.e. how the beliefs about the parameters have changed from the prior beliefs to the posterior beliefs (the conditional distribution) by the measurements. More than the mean of the conditional distribution, the covariance is of interest. This is due to two reasons: (i) owing to the Gaussian approximations, the covariance entirely reflects the amount of uncertainty in the parameters; and (ii) while the mean of the conditional distribution depends on the measurements, the covariance does not. The latter is significant because a priori, the measurement values are not known. Consequently, the average (over all possible measurements) uncertainty in the parameters too is independent of the measurements, and hence can be studied a priori. From equation (5.3), since the joint distribution of the parameter vector and the observables is Gaussian, the conditional distribution of the parameter vector given the measurements is also Gaussian and can be written as ð6:2Þ and where y o i denotes the measurement value (the realization of the random variable y n observed) at t i . Note that the conditional covariance C n is independent of these measurement values y o i . Furthermore, since the uncertainty in a Gaussian random variable, quantified by the differential entropy, depends only on the covariance matrix, the posterior distribution uncertainty does not depend on the measurements.

Conditional covariance when n ! 1 and when n is finite
For the asymptotic case when n ! 1, it can be shown that (for proof see appendix B) lim n!1 where M is the Fisher information matrix defined as Furthermore, for finite n, the conditional covariance can be written as (for proof see appendix C) : ð7:3Þ Remark 7.1. Equations (7.1) and (7.3) relate to the classical and Bayesian Cramer Rao bounds [22,23], respectively, in estimation theory.

Information gain
In this section, the gain in information about the parameters by the measurements is considered. For details of such an information-theoretic approach the reader is referred to [21]. The gain in information about the parameter vector u by the measurements of z n ¼ [y T n , y T n21 , . . ., y T 0 ] T is given by the mutual information between z n and u, which is equal to the difference between the differential entropies of the prior distribution p(u) and the conditional distribution p(ujz n ). From equations (4.8), (6.1) and (6.3), this gain in information can be written as where det (Á) denotes the determinant. The above can be expanded through equation (7.3) as ð8:2Þ Note that the above represents the information gain for the joint vector of all the parameters. Commonly, one is interested in individual parameters, for which the information gain is now presented. Let u {S} denote the vector of a subset of parameters indexed by the elements of set S and u {S c } denote the vector of the remaining parameters, the complement of set S. Hence, u fig denotes the ith parameter, u fi,jg denotes the vector formed by taking the ith and jth parameters, and so on. The conditional covariance rsif.royalsocietypublishing.org J. R. Soc. Interface 15: 20170871 matrix C n can be decomposed into the components of u {S} and u {S c } as ð8:3Þ and where s is the cardinality of S, and S {S} [ R dÂs and Given the above decomposition, the marginal covariance of u {S} given the measurements can be written as the Schur complement of the matrix in C n as follows: and the information gain I {S} n as Another quantity of interest is the correlation between two subsets of parameters u {S} and u {W} . In an informationtheoretic context this can be assessed by how much more information is gained about the parameters u {S} in addition to I {S} n if u {W} was also known, i.e. the mutual information between u {S} and u {W} given the measurements. Similar to the procedure employed in equation (8.4), by splitting C n into three components for u {S} , u {W} and u {(S<W) c } , one can write the conditional covariance C {SjW} n of the parameters u {S} given the measurements and, additionally, the parameters u {W} as follows: where w is the cardinality of W, The information gain I {SjW} n about the parameters u {S} given both the measurements and the parameters u {W} is rsif.royalsocietypublishing.org J. R. Soc. Interface 15: 20170871 The above developed functions for information gains (and associated variances) are collectively referred as 'ISFs'. From this point onwards, the terms marginal posterior variance or just marginal variance for a parameter subset u {S} refers to the the variance conditioned on only the measurements, equation (8.5), and the corresponding information gain, equation (8.6), is referred as the marginal information gain. Similarly, the term conditional variance is used to refer to the variance when the measurements and additionally a parameter subset u {W} is given, equation (8.7), and the corresponding information gain is referred as the conditional information gain, equation (8.10). Lastly, the information shared between two subsets of parameters given the measurements, equation (8.11), is referred as the conditional mutual information or just the mutual information. Finally, the vector z n ¼ [y T n , y T n21 , . . ., y T 0 ] T is used to denote a collection of all measurement vectors up to time t n .

Results and discussion
In this section, the theory developed above is applied to study three dynamical systems.

Three-element Windkessel model
Windkessel models are widely used to describe arterial haemodynamics [24]. Increasingly, they are also being used as boundary conditions in three-dimensional computational fluid dynamics simulations to assess patient-specific behaviour [20,25]. To perform patient-specific analysis, it is imperative that the parameters of the Windkessel model are estimated from measurements taken in each patient individually. A three-element Windkessel model is shown in figure 1a and consists of three parameters: R p ( proximal resistance) which represents the hydraulic resistance of large vessels; C (capacitance) which represents the compliance of large vessels; and R d which represents the resistance of small vessels in the microcirculation. Note that these models use the electric analogy to fluid flow where pressure P is seen as voltage and flow-rate q is seen as electric current. Typically, inlet flow-rate q i is measured (via magnetic resonance imaging or Doppler ultrasound) and inlet pressure P i is measured by pressure catheters. The goal then is to estimate the parameters (R p , C and R d ) by assuming q i is deterministically known and minimizing the difference between the P i reproduced by the model and the P i that was measured. The model dynamics is described by the following differential algebraic equations, which may also be rewritten as a single ODE: where P i and P c are the inlet and mid-Windkessel pressures, respectively, (figure 1a); P ext and P ven are the reference external and venous pressures, respectively, which are both set to zero; and q i and q o are the inlet and outlet flow-rates, respectively. The measurement model is written as follows: where e n is the noise (normally distributed with zero mean and variance s 2 noise ) in measuring P i n to give the measurement y n at time t n . The measurement vector, therefore, has only one component y n ¼ [y n ]. The nominal values of R p , C, R d are 0.838 mmHg . s cm 23 , 0.0424 cm 3 mmHg 21 and 9.109 mmHg . s cm 23 . Note that these units are chosen so that the results are comprehensible in typical units used in the clinic: millilitres for volume and millimetres of mercury for pressure. Figure 1b shows the inlet flow-rate q i (taken from [20,26] where it was measured in the carotid artery of a healthy 27-year-old subject), and the resulting pressure curves obtained by the solution of equation (9.1) with P i 0 ¼ 85 mmHg and nominal parameter values. To put a zeromean and unit-variance prior on the parameters, see equation (4.8), the following parameter transformation is considered where j represents the real parameter, j 0 and 6 j are transformation parameters, respectively, and u j represents the transformed parameter on which a prior of zero mean and unit variance is considered. Therefore, the prior considered on the real parameter j has mean j 0 and variance 6 2 j . The posterior variances for the transformed parameter u j and the real parameter j are represented by s 2 u and s 2 , respectively. A total of 150 timepoints, evenly distributed between t ¼ 0 s and t ¼ T c (where T c ¼ 0.75s is the time period of the cardiac cycle), are used for the computation of ISFs and conditional variances. Figure 2 shows the marginal posterior variances (conditional only on the measurements, (ac)) and the corresponding information gains (d-f ) for individual parameters at four different levels of measurement noises. The conditional variances when all measurements are taken into account, i.e. at t ¼ T c , are also summarized in table 1. An immediate utility of figure 2 is in identify intervals of time For R d information is available throughout the cardiac cycle. These observations have also been presented in [20] through the computation of GSFs [27] and in [21] through a Monte Carlo type computation of information gain. However, as opposed to GSFs which can be non-monotonic and therefore hard to interpret, the ISFs are always monotonic. Furthermore, since the GSFs are normalized by design, they are forced to start at 0 and end at 1, thereby making the assessment of measurement noise difficult. On the other hand, the effect of measurement noise is inherently built in to the ISFs. Figure 2 quantifies how increasing measurement noise results in a decreasing amount of information gained about the parameters. While this behaviour is intuitively expected, its quantification with respect to each individual parameter is made possible with the proposed method. For example, while at s 2 noise ¼ 100.0 mmHg 2 the conditional variance of the parameter u R p after considering all the measurements is 0.158 square units, at s 2 noise ¼ 4900.0 mmHg 2 this conditional variance is 0.887 square units. Comparing this to the prior variance of 1.0 square units, one may conclude that at measurement noise of 4900.0 mmHg 2 (standard deviation of 70.0 mmHg), the parameter R p is extremely difficult to identify relative to when the measurement noise is 100.0 mmHg 2 (standard deviation of 10.0 mmHg). A similar argument can be made for the parameter u C , even though its identifiability is better than that of u R p (u C has posterior variance of 0.672 square units at measurement noise of s 2 noise ¼ 4900.0 mmHg 2 ). However, the parameter u R d appears to be well identifiable even at s 2 noise ¼ 4900.0 mmHg 2 with final posterior variance of 0.07 square units. This behaviour can be explained by the fact that measurement noise is assumed to be independent and identically distributed with zero mean at all measurement times. Therefore, the mean pressure is measured much more precisely than individual pressure measurements, irrespective of the noise levels, as when mean/expectation of equation (9.2) is taken, the expectation of noise component is zero: where E denotes the expectation operator. From equation (9.1) and figure 1a, the inlet mean pressure is equal to the inlet mean flow-rate times the sum of both resistances, i.e.
. Approximating E[y n ] by the sample mean as (1=n) P n 0 y i , one obtains As q i is assumed deterministic, from the above equation it can be seen that R p þ R d is indirectly measured with high precision. As R d is approximately an order of magnitude larger than R p , it is natural that R d dominates the sum (R p þ R d ) and hence, irrespective of the noise levels, a large amount of information is obtained about R d (figure 2c,f). The order of magnitudes of the resistances are chosen by the physics of circulation, where the resistance of small vessels and microcirculation is significantly higher than that of large vessels [20,26], and is reflected in the chosen priors for the problem.  Equation (9.5) and the arguments presented above imply that a significant amount of correlation must have been built up between the parameters R p and R d in the posterior distribution as the sum (R p þ R d ) is measured with high precision. This correlation implies that if one of the parameters R p or R p were known then how much additional information can be gained about the other parameter. The CMI presented in equation (8.11) precisely measures this additional information. CMIs for all the three pairs of the parameters are shown in figure 3. It is clear that at the end of the cardiac cycle, the largest CMI is for the parameter pair u R p and u R d . It is sensible to compare the magnitude of CMIs with the marginal information gains (figure 2). For example, for the case of s 2 noise ¼ 100.0, the marginal gain in information about the parameter R p is approximately 0.9 nats and the mutual information between R p and R d is 0.35 nats; therefore, one may conclude that approximately 40% extra information about the parameter R p is locked up in the correlation with R d . For the pair R d and C, it appears that correlation is built up in the t [ [0.0, 0.4], the diastole, and destroyed in the remaining part, the systole, of the cardiac cycle. This can be explained by the fact that the time-constant e 2t/t , with t ¼ R d C, is the dominant parameter that governs the diastole phase [21] leading to a built up of correlation, and as independent information about C and R d is acquired in systole (figure 2) this correlation is destroyed. It should be noted that these aspects, even without knowing the physics (or solution) of the problem, can be naturally inferred from figures 2 and 3.
The effect of correlations can be further assessed by looking at the conditional variances (a-c) and conditional information   For the parameter R d , as a large amount of individual information is obtained marginally, the conditional variances are not too different than the marginal variances. Note, that the variances show an opposite behaviour to information gains as a decrease in variance implies gain in information. Therefore, even though the two measures appear to be similar, information gain is a better measure as it can be readily applied to cases where behaviour of a set of parameters is required to be studied. For example, if one was interested in the joint information again for a set of two parameters given a third, the information gain measure will be a scalar but the joint covariance will be a matrix. Furthermore, the relation between conditional information gain, marginal information gain, and mutual information is additive, see equation (8.11), whereas the relation between conditional variance and marginal variance is, in general, not additive. As a demonstration, it can be observed that the conditional information gain curves in figure 4 can be obtained by the addition of the corresponding curves from figures 2 and 3.

The Hodgkin -Huxley model of a neuron
The Hodgkin -Huxley model [28] describes ionic exchanges and their relationship to the membrane voltage in a biological neuron. This model has also been used as the basis for several other ionic models to describe a variety of excitable tissues such as cardiac cells [29]. The model is described by the following ODE equations: where V m is the membrane voltage, C m is the membrane capacitance, I ext is the external current applied; I Na , I K and I L are the sodium, potassium and leakage currents, respectively; V Na , V K and V L are the equilibrium potentials for sodium, potassium and leakage ions, respectively; g Na , g K and g L are the maximum conductances for the channels of sodium, potassium and leakage ions, respectively; and m, h and n are the dimensionless gate variables, m, h, n [ [0, 1], that characterize the activation and inactivation of sodium and potassium channels. C m is set to 1 mF cm 22 , and the equilibrium potentials are defined in millivolts (mV) relative to the membrane resting potential, E R , as follows [30,31]: The inverse problem is of estimating the three parameters g Na , g K and g L by measuring the membrane voltage V m when a constant external current I ext ¼ 20 mA cm 22 is applied to the neuron. It is well known that when a relatively high constant external current is applied the neuron exhibits a tonic spiking pattern in membrane voltage V m [32][33][34]. With nominal parameter values of g Na ¼ 120.0 mS cm 22  where V m n is the membrane voltage at time t n and e n is the zeromean measurement noise with variance s 2 noise . As only V m is measured the observation vector is y n ¼ [y n ]. As opposed to the Windkessel case where the effect of noise is evaluated, in where j 0 is the nominal parameter value, zero-mean and unitvariance normal distribution prior is imposed on the transformed parameter u j , resulting in the prior distribution imposed on the real parameter j to be a normal distribution with mean j 0 and variance 6 2 j . The parameters 6 j are set to 10.0, 6.0 and 0.1 mS cm 22 for g Na , g K and g L , respectively. Figure 6 shows the posterior marginal variances (a-c) and the marginal information gains (d-f ) for the three parameters for all the four observation frequencies. In all these plots, an arbitrarily scaled V m (t) curve is shown in light grey for ease of interpretation relative to V m (t) variations. As expected, increasing the measurement frequency results in larger amounts of information (and consequently larger reduction in the posterior variances). However, it is observed that the parameters u g Na and u g L benefit most from an increase in measurement frequency as opposed to the parameter u g K which benefits only marginally. This implies that at low observation frequencies the identifiability of u g K is good, while very low amount of information is available for the parameters u g Na   and u g L . The behaviour for the parameter u g K ( figure 6b,e) shows that the information about this parameter is concentrated mostly in the sharp rising phase of the action potential V m . A similar behaviour, although less salient, is observed for the parameters u g Na and u g L (figure 6a,d,c,f ). While the Hodgkin-Huxley model is quite complex with gating variables of different time-constants and dependence of ionic currents on powers (up to fourth power) of the gating variables, it is widely understood that the rising phases of the action potential V m are related to the sodium and potassium currents. This may explain why information about the parameters u g Na and u g K is mostly concentrated in this region. Furthermore, if we accept that the sodium and potassium currents, in combination, are responsible for the rising action potential, then we should also expect a substantial amount of correlation between the parameters u g Na and u g K as it should be hard to distinguish between these two parameters. This is precisely what is observed by the CMI analysis, figure 7, where a large amount of mutual information is developed between these two parameters. For the case of N obs ¼ 100, the marginal information gain in the parameter u g Na , figure 6, is approximately 0.3 nats, and it is observed from figure 7 that approximately 0.7 nats of mutual information exists between u g Na and u g K . This implies that the amount of information that can be gained about u g Na by knowing u g K , in addition to the measurements, is larger than the amount of information gained by just the measurements. Indeed, as the observation frequency is increased more information is available about all the parameters individually. Figure 7 also shows that significant amount of correlation is built between the parameters u g K and u g L during the sharp rising part of V m . For example, for N obs ¼ 200, the amount of CMI between u g K and u g L is approximately 0.14 nats ( figure 7), approximately the same magnitude as the marginal information gain of 0.15 nats (figure 6) for the parameter u g L . At the same time, since the marginal information gain for u g K is approximately 1.25 nats (figure 6), the effect of this correlation, amounting to an information gain of 0.14 nats (figure 7), is not too significant for estimating u g K . Finally, the effect of the CMI, i.e. the correlation, can also be seen in terms of the conditional variances and conditional information gains as shown in figure 8 for N obs ¼ 200. As discussed above the correlations between the pairs (u g Na , u g K ) and (u g K , u g L ) show that the conditional variances are significantly lower (and the conditional information gain is larger) for one parameter when the other parameter is additionally known. It should be noted that the correlations and information gains presented are specific to the protocol, i.e. a constant external current resulting in tonic spiking of the neuron and only V m being measured. The information gains will behave differently if the protocol is changed, for example to intermittent step currents or continuously varying external currents. Therefore, one application of the methods proposed in this article can be in optimal design of experiments, where one may design the protocol such that maximal information gain occurs for individual parameters while CMI (correlations in the posterior distribution) are minimized.

Influenza A virus kinetics
The final example presented is for the kinetics of the influenza A virus. The following model was proposed by Baccam et al. [35] to describe viral infection 1.0 s 2 (q g Na |z n ) s 2 (θ g Na |z n ,q g K ) s 2 (q g Na |z n ,q g L ) 0 0.5 1.0 s 2 (q g K |z n ) s 2 (q g K |z n ,q g Na ) s 2 (q g K |z n , q g L ) 0 0.5 1.0 where V is the infectious virus titre (measured in TCID 50 ml 21 of nasal wash), T is the number of uninfected target cells, I is the number of productively infected cells and fb, d, p, cg are the model parameters. The parameter p represents the average rate at which the productively infected cells, I, increase the viral titres, and the parameter d represents the rate at which the infected cells die. The parameter b characterises the rate at which the susceptible cells become infected and c represents the clearing rate of the virus. As opposed to the previous example where the initial conditions were assumed to be known, in this example, the initial conditions for the virus titre V 0 and the number of uninfected target cells T 0 are considered unknown and hence form the parameters of the dynamical system. Time is measured in days (d) and the initial condition for the number of infected cells I 0 is assumed to be known at 0  [35]. As in the previous examples, the following parametrization is used to impose zero-mean and unit-variance priors on the transformed parameters: where u j represents the transformed version of the real parameter j, j 0 represents the nominal values of the parameter, and hence with a zero-mean and unit-variance prior on the transformed parameters u j , the prior imposed on the real parameter is of mean j 0 and variance 6 2 j . The scaling parameters 6 j are set to 9 Â10 206 , 1.3, 0.004, 1.0, 0.03 and 2.0 Â 10 8 for b, d, p, c, V 0 and T 0 , respectively, in their respective units. The solution to equation (9.11) for the nominal parameter values is shown in figure 9. It is observed that both the virus titre V and the number of infected cells I increase sharply until they peak at the 2-3 day mark. After this a decrease in both values is observed. The number of uninfected target cells T remains approximately constant until the 2 day mark after which a sharp decrease (approx. 4 orders of magnitude) is observed over the next 2 days leading to a plateau.
To study the sensitivity and information gain two cases are considered: first, when only V is measured; and second, when both V and I are measured. In the first case, the observation model reads:  where V n is the virus titre concentration at time t n and e n is the zeromean measurement noise with variance s 2 noise ¼ 2.5 Â 10 7 (TCID 50 ml 21 ) 2 , i.e. a standard deviation of 5 Â 10 3 TCID 50 ml 21 . A total of 200 measurements are evenly distributed between 0 days and 10 days for the computation of marginal variances and information gains. Figure 10 shows the marginal variances for all the parameters in solid lines and the conditional variances for a four pairs of parameters in dashed lines. Given the dynamics of the problem as shown in figure 9 it is not surprising that most of the information gain about all the parameters occurs in t [ [0, 4] days. The parameters u b , u d and u c appear to be well identifiable given the large decreases in marginal variances. However, the initial conditions u V 0 and u T 0 show less decrease in the variances indicating problems in their identifiability. Finally, the parameter p appears to be unidentifiable given that its marginal variance decreases from 1.0 (standard deviation 1.0) to only 0.7 square units (standard deviation 0.84 units). Figure 11 shows the mutual information between all the pairs of the parameters, where the parameter pairs that show a high mutual information are plotted in dashed lines. For the parameters in these pairs of high mutual information, (u d , u c ) and (u p , u T 0 ), the conditional variances are plotted in figure 10. The parameter pair (u p , u T 0 ) is particularly interesting as the parameter u p , although unidentifiable individually, becomes very well identifiable, owing to the large mutual information it shares with T 0 , if the initial condition T 0 is known. This observation was proved through classical methods by Miao et al. [6] where it was shown that taking higher-order derivatives of equation (9.11) and eliminating the unmeasured variables, T and I, one obtains the following differential equation: As the above equation does not contain the parameter p, in the absence of any other quantity, i.e. T and I, and the corresponding initial conditions, the parameter p is not identifiable. Miao et al. [6] also reported that when T 0 is known, the parameter p becomes identifiable, which is consistent with the In the Bayesian approach adopted in this manuscript, a non-zero amount of knowledge (non-infinite variance) is inherently assumed in the prior for u T 0 , which results in a small amount of information gain (and hence a small reduction in the marginal variance from 1.0 to 0.7 square units). This small amount of information gain is a result of the knowledge assumed in the prior. However, it is not significant enough to hide the identifiability problem for u p . One can choose to impose prior of higher ignorance by increasing the prior variance of the real parameter T 0 by increasing the scaling factor 6 T 0 . The results for four different values of 6 T 0 on the marginal variance of the parameter u p are shown in figure 12b. It is clear that a higher value of 6 T 0 , which implies higher ignorance in the prior for T 0 , results in a decreasing amount of information gained about the parameter u p . This example shows how, without the use of classical analytical methods, see for example those presented in [6], which may not be easily applicable to all dynamical systems, the information theoretic approach can provide similar conclusions about parameter identifiability. Lastly, the classical sensitivity of the parameter p to the measurable V is shown in figure 12a, whose large magnitude does not indicate any problems of parameter identifiability. Finally, Miao et al. [6] reported that all the parameters of the influenza dynamical system were well identifiable if both V and I, or both V and T were measured. For the case when both V and I are measured, the marginal variances are shown in figure 13, which too shows that no identifiability problems persist in this case. Note that the error structure in the measurement of I was assumed to be identical to the measurement of V, equation (9.13).

Conclusion
A new class of functions called the 'ISFs' have been proposed to study parametric information gain in a dynamical system. Based on a Bayesian and information-theoretic approach, such functions are easy to compute through classical sensitivity analysis. Compared to the previously proposed generalized sensitivity functions (GSFs) [27] to measure such information gain, the ISFs do not suffer from the forced-to-one behaviour and are easy to interpret as correlations are measured through  separate measures of mutual information as opposed to oscillations in GSFs. Furthermore, as opposed to GSFs, which are normalized, the ISFs can be used to compare information gain between different parameters and hence can be used to rank the parameters on ease of identifiability. They can be used to identify regions of high information content and indicate identifiability problems for parameters which show little to no information gain, or high mutual information (correlation) with other parameters. The application of ISFs is demonstrated on three models. For the Windkessel model, the effect of measurement noise is illustrated and it is shown that the insights provided by ISFs are consistent with those of a significantly more expensive Monte Carlo type approach [21]. For the Hodgkin-Huxley model, the effect of measurement frequency is illustrated, and finally, for the influenza A virus, it is shown how, even when classical sensitivity analysis fails to assess identifiability issues, the ISFs correctly reveal identifiability problems, which have been analytically proven through classical methods. Appendix A. Differential analysis for equivalence of sensitivity and covariance evolution From equation (3.2), the linearized dynamical system is ðA 1Þ Separating the random variables u and x gives _ x ¼ r x fj n x þ r u fj n u þ (fj n À r x fj n m xn À r u fj n m u þ r t fj n (t À t n )) |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} tn : ðA 2Þ Combined with trivial dynamics for the parameters _ u ¼ 0, the dynamics for the combined vector [x T , u T ] T can be written as where O a,b represents a zero matrix of size a Â b. The above is concisely written as For the above stochastic differential equation, it is well known, see for example [36], that the covariance matrix of 5, denoted by J, evolves according to the following differential equation: Therefore, if the covariance matrix of 5 n is then J evolves according to equation (A 5) as _ J ¼ r x fj n r u fj n O p,d O p,p " # S n,n L n,u L u,n S u,u þ S n,n L n,u L u,n S u,u r T x fj n O d,p r T u fj n O p,p " # ¼ r x fj n S n,n þ r u fj n L T n,u þ S n,n r T x fj n þ L n,u r T u fj n r x fj n L n,u þ r u fj n S u,u L T n,u r T x fj n þ S u,u r T u fj n O p,p : The next task is to relate the above evolution of the covariance matrix with the evolution of sensitivity matrix. From equation (2.6), the sensitivity matrix S evolves as _ S ¼ r x fj n S n þ r u fj n : ðA 8Þ Therefore, taking the transpose of equation (A 8) yields The derivative of the matrix product SS T can be written as follows: Substituting the derivatives from (A 8) with (A 9) into the above equation gives d(SS T ) dt ¼ SS T r T x fj n þ r x fj n SS T þ r u fj n S T þ Sr T u fj n : ðA 11Þ It is easy to see that if S u,u ¼ I p , L n,u ¼ S and S n,n ¼ SS T , then the state covariance, S n,n , and the cross-covariance, L n,u , from equation (A 7) evolve as _ S n,n ¼ SS T r T x fj n þ r x fj n SS T þ r u fj n S T þ Sr T u fj n ¼ d(SS T ) dt ðA 12Þ and _ L n,u ¼ r x fj n S n þ r u fj n ¼ _ S: ðA 13Þ Appendix B. Asymptotic analysis of the conditional covariance In this section, the behaviour of the conditional covariance matrix C n as n ! 1 is considered. From equation (5.4), A n can be written as where G n is a diagonal matrix with elements as follows: By applying the Kailath variant of the Sherman -Morrison -Woodbury identity the inverse of A n can be expanded as Plugging this in equation (6.3) yields where The matrix D is symmetric, and can be factorized by singular value decomposition (SVD) as follows with U n U T n ¼ I p ðB 8Þ and F n is a diagonal matrix with diagonal entries equal to the eigenvalues, l i , of D n . C n ¼ I p À U n F n U T n þ U n F n U T n (I p þ U n F n U T n ) À1 U n F n U T n ðB 10Þ ¼ I p À U n F n U T n þ U n F n U T n (U n U T n þ U n F n U T n ) À1 U n F n U T n ðB 11Þ ¼ I p À U n F n U T n þ U n F n U T n (U n (I p þ F n )U T n ) À1 U n F n U T n ðB 12Þ ¼ U n U T n À U n F n U T n þ U n F n (I p þ F n ) À1 F n U T n ðB 13Þ ¼ U n [I p À F n þ F n (I p þ F n ) À1 F n ] |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} P n U T n : ðB 14Þ In the above P n is a diagonal matrix with the entries If the minimum eigenvalue of D n is much larger than 1, i.e. Consequently, equation (B 10) yields C n % U n F À1 U T n ¼ D À1 n : ðB 19Þ Finally, from the above and equations (B 5), (B 2) and (5.5), the conditional covariance matrix can be written as : ðB 20Þ It can hence be concluded that if the minimum eigenvalue of D n monotonically increases as n increases then lim n!1 : ðB 21Þ Let the eigenvalues of D n be denoted in decreasing order as l 1 (D n )!l 2 (D n )! . . . !l p (D n ). The behaviour of the minimum eigenvalue of l p (D n ) is of concern. Note that D n can be written as Consequently, ðB 23Þ D n and Q n are both symmetric matrices. Let the eigenvalues of Q nþ1 be denoted in decreasing order as l 1 (Q nþ1 )!l 2 (Q nþ1 )! . . . !l p (Q nþ1 ). From equation (B 19) one has where Tr denotes the trace. Expressed in terms of the eigenvalues of the respective matrices, the above reads Consequently, if Q nþ1 is full rank then l p (Q nþ1 ) . 0 and which implies that the minimum eigenvalue of D n is monotonically increasing.
The above results are put in the perspective of classical nonlinear regression analysis [9,10] by assuming that the observation operator H i is equal to identity for all i. Then, under a further assumption that Q i ¼ (r T u xj i )Y À1 i (r u xj i ) is full-rank for all i, the conditional covariance matrix of the parameter is lim n!1 where M is the Fisher information matrix defined as