Using constraints and their value for optimization of large ODE systems

We provide analytical tools to facilitate a rigorous assessment of the quality and value of the fit of a complex model to data. We use this to provide approaches to model fitting, parameter estimation, the design of optimization functions and experimental optimization. This is in the context where multiple constraints are used to select or optimize a large model defined by differential equations. We illustrate the approach using models of circadian clocks and the NF-κB signalling system.


Introduction
Systems Biology is producing a rapidly growing number of complex mathematical models of dynamic biological systems such as the cell cycle, circadian clocks and numerous signalling systems. These models are usually highly nonlinear and have many state variables and parameters. It is therefore very challenging to understand how the behaviour of these models depends upon model structure and parameters and to distinguish those features of the model that are fundamental from those that are accidental or irrelevant. Moreover, the nonlinearity and large size of these models makes validation and calibration against biological data very difficult.
We focus in this paper on large models given by differential equations, the most ubiquitous method for such systems. When estimating the parameters of such systems, it is usual to either introduce a likelihood expressing the probability of the data given a particular deterministic solution or provide a function measuring the fit of such a solution to the data. A common choice for a likelihood is to assume that intrinsic noise can be neglected and that the main source of stochasticity is observational error, which is often assumed to be normally distributed. Optimization functions are often based on the Euclidean distance or are a sum of squares each measuring the deviation of a summary statistic from the data derived value. For the most complex models, it is often the case that the data only partially constrain the model and therefore, for these models, such fitting is done by hand or using optimization functions, where the modellers have to identify qualitative features of interest.
In each case, there is a great need for analytical tools to facilitate fitting and to provide a rigorous assessment of the quality and value of the fit and our aim here is to provide some mathematical tools to tackle both of these challenges. To demonstrate the usefulness of these tools, we apply them to some significant exemplar models. We are particularly interested in large models with many parameters and state variables. For example, one of the models we consider has 28 state variables and 104 parameters and as we combine this with models for some mutants, the effective number of state variables is several times this.
Suppose that we are considering such a system and that we have data and models for the wild-type and a number of mutants in a set of conditions. For example, the data might be for a wild-type circadian clock or signalling system and several gene knockouts in a number of environmental conditions. We assume that for each such combination of genetic background and environmental conditions (which we henceforth call GE-combinations), we have a model and that from the data we have a set of constraints that the models should satisfy. Note that in this paper, when we talk of the constraints, we mean quantitative properties usually derived from experimental data. They might, for example, quantify the levels of certain mRNAs or proteins for the given combination, or in the case of oscillating systems they might determine the period of the oscillation or the relative phases of the mRNAs or proteins. In general, they will be of the form C i (g) ¼ C 0 i for some real-valued function C i of the solution g of interest for one of the given GE-combinations. The value of C 0 i will come from the experimental data. As usual it is just the parameters k that we are changing, g, and hence C i , is a real-valued function of them and the conditions to be satisfied are of the form C i (k) ¼ C 0 i . A constraint determines the value of some quantity. Often these constraints are collected into a single realvalued function to be locally optimized which effectively acts as a likelihood function. Approximate Bayesian computation (ABC) functions in this way. ABC methods seek to infer parameters by comparing simulated data to the observed data, in terms of an optimization function that combines a set of summaries C ¼ (C 1 , . . . , C m ) essentially equivalent to the constraints mentioned above [1][2][3]. A related, less sophisticated approach that has been successfully employed is to search subspaces of the space of parameters using such an optimization function to find approximate local optima [4,5]. For both sorts of methods, the function to be optimized is usually of the form and a key question that we discuss below is how to choose the function, for example, in the case of (1. 1), what constraints C i should be used and how should their weights a i be chosen. However, in the main, our approach will be to consider the optimization problem in terms of a set of m individual constraints rather than to try and incorporate them into a single function to be optimized. We consider this in the context of a combined model for a set of GE-combinations as formulated below. The theory that we present allows us to investigate a number of interesting aspects. Firstly, we can consider a combined model that satisfies a set of constraints C 1 , . . . , C m and gives a quantitative measure of the extent to which the constraints actually constrain the model. Given that a set of constraints have been applied, we quantify the extent to which an additional constraint C mþ1 further constrains the model and explain how its constraint value can be measured. Secondly, we show that it will usually be the case that the constraint value declines very rapidly as m increases. In fact, we also demonstrate that it is reasonable to expect that many of the constraints will have small norm and hence be ineffective. Thirdly, we prove a theorem (called m ! m þ 1 Transition theorem) that allows us to use this constraint value to determine the effects of adding a new constraint to the form of the optimization problem both in terms of geometry, analysis and stochastic optimization. Fourthly, we consider the construction of optimization functions and show how constraint value can be used to help design them for use in statistical estimation algorithms. Finally, we discuss how to use the results for experimental design. We want to facilitate the choice of effective constraints that will better characterize the system when deciding what experiments to do. We give some examples of how this can be done using our framework.
It is important to stress at this point that we are not providing an algorithm for estimating parameter values. However, as we provide an approach to analytically determine which sets of constraints are most informative, it can be used to help determine which are the most useful to use in some estimation algorithms.
In general, the constraints C i of interest are nonlinear. Unfortunately, a general global nonlinear theory is not possible because our current understanding of dynamical systems, though extensive, is not adequate for this. However, we can develop a relatively powerful and useful theory based on local analysis about a particular set of parameter values. This uses the extensive and powerful perturbation theory for differential equations.

Mathematical preliminaries
We assume that we have a set of models described by a system of differential equations of the form dx i,k =dt ¼ f i,k (t, x k , k). Each model is for a given GE-combination k.
Here, t is time and the vector x k ¼ (x 1,k , . . . , x n,k ) represents the state variables (typically for our applications, mRNA and protein levels). For this GE-combination k, we can write this system as There is a common vector of parameters k ¼ (k 1 , . . . , k s ) for all such models. We then integrate all these models into a single one given by is the set of GE-combinations being considered. We also assume that for each of the systems (2.1), dx k /dt ¼ f k (t, x k , k) there is a solution x k ¼ g k (t, k) or a class of solutions defined for a specific time range 0 t T k that are of particular interest. For example, for circadian oscillations, the primary object of interest is an attracting periodic orbit of equation (2.1) and T k will be the period of this orbit. On the other hand, for models of signalling systems, one is often interested in a solution that is not periodic but is defined by a given initial condition x 0 . Such a signalling system is usually also subject to a given perturbation caused by an incoming signal and this will typically be modelled by a sudden change in a system parameter or by the time dependence of the right-hand side of equation (2.1).
In regulatory and signalling systems, the values of two parameters may differ by an order of magnitude or more. Therefore, it is usually not appropriate to consider absolute changes in the parameters k j , but instead to consider relative changes. A good way to do this is to introduce new parameters k 0 j ¼ log k j because absolute changes in k 0 j correspond to relative changes in k j . Then for small changes dk j to the parameters, the corresponding change to k 0 j is dk 0 j ¼ dk j =k j which is scaled and non-dimensional. We adopt the convention that our non-zero parameters k j are henceforth these logged parameters k 0 j . In fact, the theory applies equally well to the unscaled parameters but in our examples we always use logged parameters for the reason given above.
We are interested in the size of the variation of a constraint as parameters are varied. However, in the biological problems we are interested in the optimal value C 0 i of a constraint C i is determined by data and this is always only estimated. Therefore, we need to take account of the standard error e 0 i (i.e. the estimated standard deviation of the mean) of the experimentally observed values of C 0 i . The constraint C i is not effective if the variation is small compared with e 0 i . For our presentation, it is convenient to always normalize the constraints by replacing C i by C i =e 0 i . Therefore, in what follows, by a constraint, we always mean one that has been normalized by its standard error of the corresponding data.

Constraints and their value
We firstly consider the behaviour of the constraints about a parameter value k ¼ k * . According to Taylor's theorem, the local variation of C i (k) about k ¼ k * is given by where c i ¼ (c i,1 , . . . , c i,s ) is the derivative of C i at k * and . denotes the usual dot product between vectors. Thus, We therefore call the vectors c i in R s the linearized constraints.

If we have a set of constraints
. . , m, with associated linear constraints c i at k ¼ k * , there are two ways in which they can be ineffective. Firstly, such a constraint C i might be insensitive to variation in the parameters at k * which means that c i will have small norm c i k k where c i k k 2 ¼ P s j¼1 c 2 i,j . This is because, up to the second-order terms O( dk k k 2 ), One might think that constraints would be chosen to ensure that they did not have a small norm. However, in figure 1, we show the norms of the constraints that were chosen for an important model of the circadian clock. We see that many have very small norms. This is not some mistake on the part of the authors but, surprisingly, is inevitable for systems like this as we explain below. Even more importantly, the theory we present explains why we should expect that large sets of constraints are often strongly non-independent.
The second way is that a linearized constraint in this set might not be very independent of the other linearized constraints in it because it is very close to being a linear combination of them. This is a problem because then this constraint will be largely determined by the others. We now give a precise description of this.
Suppose that we have such a set of linearized constraints c 1 , . . . , c m . For any other linearized constraint c ¼ c mþ1 define r(cjc 1 , . . . , c m ) to be the unique vector orthogonal to c 1 , . . . , c m such that for some a 1 , . . . , a m . Then for the constraint C to be effective when we have already applied the other constraints C i , we need that r(cjc 1 , . . . , c m ) k k is not too small. This is because in changing dk ¼ k 2 k * to optimize the value of c, the part c 0 ¼ P m i¼1 a i c i is not allowed to change as it is determined by the other constraints C i . Therefore, only r(cjc 1 , . . . , c m ) can change and as this has a small norm, the constraint C only has small variation around k * .
Let us explain this in a little more detail. Suppose that the C i Suppose also that we now want to add the new constraint C mþ1 for which C mþ1 (k Ã ) = C 0 mþ1 and tune k * to k 0 Ã so that Thus, if r(c mþ1 jc 1 , . . . , c m ) is small, the parameter change that will be needed will be much larger than the change in C mþ1 that is required.
We therefore make the following definition. The use of the standard error for normalization was discussed above. It is also important to note that it sets a natural scale which is necessary because if the normalization is omitted, then it is possible by scaling to trivially increase the constraint value because val(lcjc 1 , . . . , c m ) ¼ l val(cjc 1 , . . . , c m ) for all l . 0. Importantly, its use makes the constraint value non-dimensional.
Moreover, we emphasize that the practical use of constraints such as C i (k) ¼ C 0 i rarely requires exact values for C 0 i . All the applications we consider only require reliable estimates of the order of magnitude of C 0 i . Thus only approximate determination of the standard errors e 0 i is usually required. For some of the data in the examples we discuss, standard errors were not available but we could estimate the standard deviation and therefore we approximated the standard error by this.
We will say that a set of constraints is non-degenerate if they are linearly independent. Many of the results that we discuss rely on an analysis of the matrix M ¼ M(c 1 , . . . , c m ) whose ith row is the vector c i . In particular, the constraints are non-degenerate if the rank of M is maximal, i.e. m.
Then an important result (electronic supplementary material, theorem S2) is that one can reorder the constraints so that v 1 is maximal among all such rsif.royalsocietypublishing.org J. R. Soc. Interface 12: 20141303 orderings and so that v 1 ! v 2 ! Á Á Á ! v m . This ordering is effectively unique subject to the possibility that there might be multiple constraints with the same v i . We say that such a set of constraints is ordered. Electronic supplementary material, theorem S2, provides a fast way to calculate the ordering of the constraint set using the LQ-decomposition of a matrix.
A set of constraints can only be non-degenerate if m s. However, the result about ordering the constraint values works equally well for the case m . s. This can be thought of as an optimal choice of a subset of s independent constraints. The remaining constraints will be linearly dependent upon this subset.

Important exemplar systems: clocks and signals
In this section, we will demonstrate the practical use of our mathematical tools using three key examples. Our examples showcase the wide range applicability of our methodology both in terms of systems and the type of constraint.

Pokhilko 2012 model of the plant circadian clock
An important recent model of the plant circadian clock from [6] consists of n ¼ 28 variables representing the levels of the following: mRNA and protein of the genes LHY, CCA1, TOC1, PRR9, PRR7, NI, LUX and ELF4; ZTL protein, LHY modified protein; mRNA of ELF3 and GI, cytoplasmic proteins of ELF3, GI, COP1; nuclear proteins of ELF3; GI and COP1 in day and night forms; and the cytoplasmic protein complexes ELF3-GI, GI-ZTL (ZG) and nuclear protein complexes ELF3-GI, ELF3-ELF4 and EC. The model has a complex structure incorporating multiple positive and negative feedback loops with the interaction between components described by s ¼ 104 parameters. It has been constrained by an impressively large collection of experimental data from the plant Arabidopsis thaliana with various genetic backgrounds and tested under different environmental conditions. Most of the parameters are fitted to the biological data or their values are taken from earlier models that were likewise fitted to data. Six of the parameters represent Hill coefficients whose values were not fitted: instead they were fixed either for the sake of simplicity or taken to correspond to the experimental evidence of protein dimerization in some of the gene interactions. We assume that these six do not form the part of the set of parameters that can be perturbed.
The model parameter fitting procedure aimed (i) to minimize the deviation of model simulated mRNAs from the normalized experimental data for nine key genes in WT plants under cycles of 12 h of light followed by 12 h of dark (denoted 12 L : 12 D) and (ii) to fit the clock oscillation period in plants in different GE-backgrounds where the plant is either in constant darkness or constant light (denoted DD or LL). The various GE-combinations and constraints are shown in table 1.
We now briefly outline how the constraints in part (i) are translated to our framework of constraints. We refer to the WT 12 L : 12 D GE-combination as k 1 (table 1). We let g k 1 (t, k) be the model solution representing the mRNA and protein levels of the system described by equation (2.1) with k ¼ k 1 . The biological data for a particular mRNA represented by the jth variable in the model gives time- The mRNA profiles of nine genes measured at different time points of the light : dark cycle result in 82 linearized constraints of the type above. A detailed breakdown of the constraints coming from each gene measurement is left to the electronic supplementary material.
The reader might wonder why we do not use a constraint on the vector v ¼ (m(t 1 ), :::, m(t l )) rather than what we do which is to regard these as individual constraints. The key point is that these measurements m(t i ) will be highly correlated. We could assign a constraint value to the vector v but this would lose the information that some time points have much greater constraint value than others. This is confirmed in electronic supplementary material, table S2, because the constraint values of the individual constraints on the mRNA levels can vary by an order of magnitude or more. In this example and others, one can drop a majority of the time points with hardly any loss in accuracy.
The second type of constraints comes from values of freerun periods of the plants in different GE-backgrounds (cf. table 1). The constraint C(k) is the period t(k) of the model solution g k2 (t, k) for a GE-background, k 2 (WT plant under LL). Thus, the linearization is c(k) ¼ @t/@k. This can be expressed in terms of the solution g k2 (t, k) using any variable (e.g. jth variable) as follows from [7]. If x 0,k2 is a point on the limit cycle and the corresponding solution is given by The Pokhilko model matched the period data for the clock in eight different GE-combinations [6]. A breakdown of period data fitted and reproduced without fitting is listed in table 1. Out of the eight models (each associated with one GE-combination), only four had long-term stable oscillations and thus only their period profiles can be translated to our LL period rsif.royalsocietypublishing.org J. R. Soc. Interface 12: 20141303 framework to make up four linearized constraints. These are the period constraints of GE-combinations k 2 , k 4 , k 5 and k 7 . Several of the GE-combinations in table 1 describe a plant model with a mutant genetic background. To convert a WT model to a mutant model, the convention is to set the translation rate of the knocked-out gene, k j , to be either zero or sufficiently close to zero. While for a WT model, all parameters can be perturbed, in mutant models, we do not allow the translation rate of the knocked-out gene to be perturbed, as this rate essentially describes the mutant model. This means that for a constraint c i of a mutant model, we must set the corresponding ( jth) entry to zero, i.e. c i,j (k) ¼ 0.
When calculating these constraints, we noted that, before normalization, many had very small norms as is shown in figure 1. As a small norm means that the constraint only varies by a small amount when parameters are varied, such constraints are ineffective and therefore should not be used. We return to this in §4 where we explain why it is reasonable to expect that many constraints will be forced to have a small norm.
We scale the constraints by standard errors in the case of time-series measurements and by standard deviations or standard errors (depending on which is available in the literature) in the case of the period constraints. As the time series are normalized so that peak value is 1, the standard errors are normalized by the same factor as the time series. More details are given in the electronic supplementary material.
In total, the Pokhilko model has 86 linearized constraints and the constraint values of the ranked linearized constraints are exponentially decreasing (figure 2). The ranking reveals that there is little value in using more than the top 32 constraints (inset in figure 2) as the remaining 54 constraints have constraint values of less than 1% of the top ranked constraint value.
The top four constraints are the constraints from GEcombinations k 1 relating to the levels of GI mRNA, and the period constraints of the model for GE-combinations k 2 and k 5 (described in table 1). The full list of the top 20 constraints is given in electronic supplementary material, table S2. All four period constraints of the model feature in the list of top 20 constraints.
Ranking also reveals that there is a large jump in the constraint values, with the six ranked lowest having nearzero constraint values. This set of six constraints comprises a combination of constraints on LUX and ELF4 mRNA levels of a model of GE-combination k 1 (WT plant entrained to 12 L : 12 D). Closer inspection of the constraints reveals that the linearized constraints on LUX mRNA levels are nearly identical to the constraints on ELF4 mRNA levels. This is not surprising, as the ODE equations describing these two mRNAs are almost identical. They share the same transcription term and kinetic constants of (linear) degradation rates and therefore these constraints are effectively identical.
We also check how much the ranking of the constraints changes when we perturb the model parameters. Each new parameter set is obtained by perturbing every parameter k j from its original value (from [6]) by adding a perturbation which is normally distributed with mean zero and standard deviation 0.05 k j (further details are described in the electronic supplementary material). Under these parameter perturbations, the models appear to maintain a near-identical constraint ranking of the top 10 constraints to the ranking of the original model. Figure 3 shows the top 40 constraint values for 10 models simulated under different parameter sets P i chosen in this way with the top 10 constraints of the original model (P 1 ) identified by crosses shaded in blue (with progressively darker shading indicating lower rank of the associated constraint). These same constraints were identified in the other 10 models (P i , i ¼ 2, . . . , 11) and they appear to feature mainly among the top 10 constraints and to mainly preserve the rank order. It is also worth noting that aside from the very similar rank order, each of the 10 models preserved the exponential decay in the constraint values of the constraints. This result indicates that the rankings and the rate of decay of constraint values are robust to parameter perturbations.

Locke 2006 model of Arabidopsis thaliana circadian clock
The Locke 2006 model [8] is an earlier plant clock model that describes interaction of a subset of genes from the Pokhilko model and has n ¼ 16 state variables. The model has s ¼ 77 parameters, most of which correspond to various kinetic rates and all of which can be perturbed, as even the Hill coefficients are fitted (cf. Pokhilko model). It is interesting to consider this alongside the Pokhilko model because it is fitted to qualitative features of the data, for example the shape of the mRNA expressed through broadness of the troughs and sharpness of the peaks. Other features fitted include amplitude of oscillations, timing of peak and trough mRNA levels, and period of oscillations. A typical constraint on oscillation period was given in the previous subsection (equation (3.2)), while a constraint on amplitude can easily be obtained from constraints on solution levels (equation (3.1)). To define broadness of peaks and troughs for a variable of interest, we followed the description in [8]. Locke et al. describe the difference in the value of a particular variable in g k (t, k) 2 h before and after the peak value time. For a sharp peak, the expectation is for the variable levels to fall quickly on either side of the peak. Consider the jth variable of g k (t, k) and define the ratio of how fast the level of g j,k (t, k) falls 2 h after time of its peak f j by, Together with the constraint on the timing of the peak, ensures that the ratio of levels of the jth variable at the two time points is fixed. As we can calculate the partial derivative with respect to parameters of any variable of solution g j,k (t, k) at any time point, we can derive similar constraints to fix the ratio of levels at any time. Note that the constraint of the peak timing (equation (3.3)) can also be obtained from the partial derivatives @g j,k /@k, [7] q.v. This mathematical description, as well as the full description of other constraints listed above are given in the electronic supplementary material.
The full list of GE-combinations and the constraints is given in the electronic supplementary material. Construction of mutant models and their constraints follows closely the description we outlined above for the Pokhilko model. In the mutant versions of the Locke model, whole sub-networks of the clock can become non-functional, i.e. multiple model variables converge to the zero equilibrium. This means that the relevant model structure can be reduced and the effect of fewer parameters needs to be considered in the constraints (i.e. more entries of the linearized constraints can be set to zero). The full list of these types of reductions for Locke mutant models is outlined in the electronic supplementary material.
The data presented in [8] does not have any error bars, so it is not possible to extract any error measurements pertaining to the shape of the oscillations and their peak times. We describe how we determined the s.e. in the electronic supplementary material.
The Locke model has 24 linearized constraints and their ranking according to constraint value also shows an exponential decrease ( figure 4).
Only the top 17 constraints show any significant constraint value, with the value of the 17th constraint at 1.96% of the highest value. The top five constraints are associated with the period of the WT clock in LL, LHY/CCA1 amplitude in WT 12 L : 12 D, the period of the lhy/cca1 mutant clock in DD, the level of LHY/CCA1 fall after peak in the toc1 mutant clock in 12 L : 12 D and the period of the toc1 mutant plant in DD. The full list of top constraints is presented in electronic supplementary material, table S4. It is worth noting that the period constraints for WT LL and both mutants, lhy/cca1 and toc1, feature at the top of the rankings. It is not possible to compare the ranking of the constraints from the Locke and Pokhilko models because, even though they do model the same biological system, their constraints are very different. However, it is worth noting that the two constraints that feature in both models ( periods of the WT and the toc1 mutant clocks in LL conditions) are featured at the top of both rankings.

NF-kB signalling system
We consider the model of the NF-kB system from [9]. The solution of interest is a transient solution describing the oscillations in the level of cytoplasmic and nuclear NF-kB concentration resulting from an incoming signal of tumour-   rsif.royalsocietypublishing.org J. R. Soc. Interface 12: 20141303 necrosis factor-a (TNFa). The system is compared to experiments where cells were subjected to constant and pulsatile TNFa signals for total periods of approximately 600 min. Salient characteristics of the ratio of nuclear to cytoplasmic NF-kB (henceforth denoted by N : C NF-kB) were identified and the model constructed to match the observed characteristics. N : C NF-kB does not feature as a variable of the NF-kB model, hence we introduce an additional ODE for the dynamics of N : C NF-kB and consequently the number n of state variables is 16. Other state variables include those describing cytoplasmic and nuclear NF-kB, IkBa, their complexes, and also the A20 gene and the kinase IKK and its activated and inactivated states. The IKK system is activated downstream of the TNFa receptors and this, in turn, causes phosphorylation and subsequent degradation of IkB freeing NF-kB to enter the nucleus. This activates IkBa transcription and the subsequent production of IkBa protein that binds the nuclear NF-kB and pulls it back into the cytoplasm, restarting the cycle.
The model has s ¼ 28 parameters, most of which are rate constants. Parameter values were fitted to match the observed N : C NF-kB oscillatory responses such as peak timing, persistence in oscillations and specific decay in oscillation amplitude. The various GE-combinations and the full list of associated constraints are shown in the electronic supplementary material.
The NF-kB model of Ashall et al. [9] was fit using specific cost functions (described in electronic supplementary material, table S5 of [9]). A score of 1 is set to approximately match 1 s.d. from the mean of the respective feature (that they wish to match). From this information, we could extract the standard errors which we used to scale the model constraints. Further details are given in the electronic supplementary material.
The key observed characteristics of the model translate to 25 linearized constraints, details of which are given in the electronic supplementary material. Some of the target characteristics for model parameter fitting that are outlined in [9] could be eliminated (more information about that elimination is given in electronic supplementary material). The 25 linearized constraints are ranked in order of decreasing constraint values in figure 5 and they show an exponential decrease in constraint values. Only the top seven constraints (listed in table 2) have any significant constraint value (i.e. their own values are higher than 1% of the top value).

Ordered constraints tend to have rapidly decreasing constraint values and many unnormalized constraints have a small norm
The examples of §3 manifest the two properties mentioned at the beginning of the paper, namely that the constraint values decrease rapidly and that many unnormalized constraints have a small norm. We now explain why this is the case. It has been observed [7] that a large class of models of regulatory and signalling systems of the sort that we are considering have the following property: there is (i) a rapidly decreasing sequence of s positive numbers,s 1 ! . . . !s s , (ii) s n-dimensional time series defined for 0 t T, U i (t) ¼ (U i,1 (t), . . . , U i,n (t)), i ¼ 1, . . . , s, which are of unit length and orthogonal to each other in the L 2 sense, and (iii) an orthogonal s Â s matrix W, such that for any change in parameters k ! k þ dk, the corresponding change dg in the solution of interest is where l i ¼ P j W ij dk j . This result comes from the singular value decomposition UDV t of the linearization of the map from parameters k to the solution of interest. The columns of U are the time series U i (t), V ¼ W t is a s Â s orthogonal matrix and D is a diagonal matrix with entriess i . In [7], this observation is formulated for a model with a single GE-combination, however the same decomposition will apply for a model where multiple GE combinations are integrated, though here the decay of the singular valuess i may be slower. The relevant result in [7] is expressed as in (4.1). However, such a result is implicitly contained in the earlier papers [10,11] because thes 2 i are eigenvalues of the Fisher information matrix (FIM) discussed there. They showed that they decrease quickly in some systems biology models. This observation was developed further in [12 -14]. The decay was also found early on in the context of circadian clocks in [15,16].
It is shown in the electronic supplementary material that for such a model under very general conditions, an ordered set of constraints c 1 , . . . , c m will have  Table 2. Top 7 ranked constraints of the NF-kB model of [9]. As the reader will see from the figures, there is no natural gap in the constraint values v i . However, there is a natural cut-off given by v i % 1 because if v i (1, the unconstrained variation defined by the constraint value is small compared with the uncertainty in the constraint. We mentioned above that many constraints that have been used to analyse the systems above have linearizations with a very small norm. When we observed this, we realized that one can argue that this is a consequence of equation (4.1). As the constraints C i are functions of the solution of interest g, i.e. C i (k) ¼ D i (g( Á , k)), it follows from (4.1) that the linearized constraints satisfy c i ¼D i Á W, whereD i ¼ (s 1 d i Á U 1 , . . . ,s s d i Á U s ) and d i is the derivative of D i with respect to g evaluated at g( Á , k Ã ) (see section 3 of the electronic supplementary material). Therefore, as W is orthonormal, As the U i are orthogonal, it seems reasonable to assume that d i Á U j is uncorrelated with d i Á U ' if j = '. If we assume that the norms of the d i are O(1), then we can model the D i as random s-dimensional vectors with O(1) norm. As is explained in the electronic supplementary material, it follows from this that, with high probability not less than 1 2 O(e 21s/4 ), In the electronic supplementary material, this is illustrated with an example showing the expected distribution of norms c i k k under these assumptions. As constraints are not very useful, if their norm is small compared with their uncertainty, we already get some very useful information by just checking these norms. Indeed, we see that about 75 of the constraints on the Pokhilko 2012 model have norms less than 10% of the norm of the constraint with the greatest norm. For the Locke and NF-kB models, about 50% of the constraints are this small.

Geometric shape of the approximate solution set
We provide a geometric interpretation of constraint value when m s by considering the geometric shape of the approximate solution set. When m . s, this set might be empty. Consider the mapping C : R s ! R m given by C(k) ¼ (C 1 (k), . . . , C m (k)). We assume that the corresponding linear constraints c 1 , . . . , c m are ordered and that the matrix M ¼ M(c 1 , . . . ,c m ) has maximal rank. Let s 1 ! s 2 ! . . . ! s m . 0 be its positive singular values and let V i and U i be its right and left singular vectors. In this case, as M is of maximal rank, the set S of parameters values which satisfy the constraints will, near the parameter vector of interest k * be a (s 2 m)-dimensional sub-manifold of the parameter space.
In the electronic supplementary material, theorem S4, we prove that the set of parameter values that approximately satisfy the constraints, tends in a precise sense as 1 ! 0 to the set E m 1 given by X m i¼1 s 2 i l 2 i 1 2 , ( 5 :1) where l and the parameters k are related by the equation where W is an orthogonal matrix. This orthogonality is important because it ensures that objects in the l coordinate system are measured on the same scale as in the original coordinate system. Therefore, E 1 is the interior of an m-dimensional ellipsoid with principal axes of length s i in both coordinate systems.
Therefore, we can interpret the effectiveness of the constraints as follows. The constraints only constrain the parameter values insofar as they constrain the l i and the extent of this is that (i) l 1 , . . . , l m must satisfy equation (5.1) (i.e. that (l 1 , . . . , l m ) regarded as a point in R m must be inside the ellipsoid E m 1 given by equation (5.1)) and (ii) l mþ1 , . . . , l s are unconstrained.
For an ordered set of linear constraints the notion of constraint value fits nicely with this interpretation because our m ! m þ 1 Transition theorem tells us that adding a constraint C mþ1 with linearization c mþ1 and constraint value

Construction of optimization functions
We consider functions of the form and suppose that k * is a maximum of this function. If m ! s, and the matrix M ¼ M(c 1 , . . . , c m ) has full rank s, then the structure of w about its minimum is given by its Hessian. The Hessian is the matrix F of partial derivatives (@ 2 f/ @k i @k j ) evaluated at k * . Without any loss of generality, we can incorporate the coefficients a i into the constraints and thereby assume that a i ¼ 1. The second derivatives of w are given by when k ¼ k * . If c 1 , . . . , c m are the linearized constraints associated with C 1 , . . . , C m at k * and M ¼ M(c 1 , . . . , c m ), then the right-hand term of (6.2) is the ijth entry of the matrix F ¼ M t M, where M t is the transpose of M. Thus, we see that if m , s, then F has zero eigenvalues and the Hessian of w is degenerate. Thus, we now consider the case m ! s but mention the alternative case in a note below. Indeed, it is worth noting that in some applications (e.g. in [4,5]), the functions w used are of the form in (6.1) with m , s. Alternatively, one can use the function f as an artificial likelihood and regard P(C 1 , . . . , C m ) ¼ exp (w)=Z as the (normal) distribution of the vectors C ¼ (C 1 , . . . , C m ) (Z is the normalizing factor). In this case, the matrix F is the FIM rsif.royalsocietypublishing.org J. R. Soc. Interface 12: 20141303 for the system, i.e. F is the P-expectation of the Hessian of 2V ¼ 2logP which is given by The inverse of the FIM provides an approximation of the covariance (i.e. the multidimensional spread around the mode) of the posterior probability distribution P(kjC) and provides a lower bound (generally known as the Cramér-Rao bound) for the error covariance of any unbiased estimator of the true parameters.
In either case, we are interested in the singular values s This tells us that k is well constrained in the directions V F i with s F i large and badly constrained in the directions V F i , where s F i is small. This approximation result follows from well-known results about so-called Morse functions and arguments similar to those used in electronic supplementary material, theorem S4.
In particular, we are interested in how the singular values change when we remove or add a new constraint to w. To determine whether to add a new constraint in the case of m ! s, one should firstly reorder the constraints using the algorithm defined by electronic supplementary material theorem S2, but using the constraints c 1 , . . . , c mþ1 instead of c 1 , . . . , c m . This is because, when reordered, the new constraint may move much higher up the list and have a greater constraint value. One can then use the reordered list of constraints c Although above we are restricted to the case m ! s, the discussion above does apply to the case m , s if one restricts parameter changes that are allowed to be only those that do not change the linear combinations l mþ1 , . . . , l s of the parameters.
In this case, we use the m ! m þ 1 Transition theorem (electronic supplementary material, theorem S5) to address how the singular values change when we remove or add a new constraint to w. The Transition theorem tells us that adding a new constraint C mþ1 with linearization c mþ1 and constraint value v mþ1 ¼ val(c mþ1 jc 1 , . . . , c m ) has the following effect. Consider the singular values . . , c m , c mþ1 ). These have an interlacing property in that and moreover, in the electronic supplementary material, we show that s F 0 mþ1 v 2 mþ1 while for 1 i m A simple way to characterize the effectiveness of w is via the condition number of the matrix F which can be taken to be given by k F ¼ s F 1 =s F m as this determines the ratio of the lengths of the major and minor axes of the ellipsoid given by (6.3). Using the above inequalities, we see that Therefore, to improve the condition number, one must find new constraints whose constraint value exceeds the smallest singular value of M. This quantifies the usefulness of adding a new constraint. It is only useful when its reordered constraint value is relatively high. Using a low-value constraint involves extra computational cost with no significant improvement in terms of estimation utility. Moreover, because of the results of §4 finding constraints with good reordered constraint value will require careful design.

Experimental optimization
Experimental design in systems biology has been discussed extensively from a number of points of view including classical approaches using Fisher information [17], sensitivity analysis [18] and methods to maximize the expected mutual information between prior and posterior parameter distributions ( [19] and references therein). In this section, we illustrate how the constraint value can be used for experimental design. The idea is that once a working model has been formulated and the current constraints C 1 , . . . , C m analysed, then one can test new GE-combinations for new constraints C with a high value val(cjc 1 , . . . , c m ). To do this, we formulate the model in equation (2.2) for all the relevant GE-combinations including the new one.
We now give some illustrative examples. In each case a gene mutant is simulated by putting the corresponding translation rate to zero and not allowing this rate to change when perturbing the parameters.

Pokhilko 2012 model and the prr9 and ni mutants
No constraints for these mutants were used in formulating the Pokhilko 2012 model. One can therefore ask whether an experiment on the mutants will add value. When this experiment is being considered, we could predict the value of it for our purposes using our techniques and this can be used to help assess the priority of this experiment. In fact, the periods of these mutants are already known [20], but the discussion still illustrates our approach had we not had the data already. Moreover, given that we have it, we can also ask whether if it does add value and whether one should put in the effort to reparametrize the model to match it. Therefore, using the Pokhilko 2012 model, we simulated the prr9 mutant and ni mutant models in constant light. In fact, Salomé & McClung [20] have measured periods of the both prr9 and prr5 mutants ( prr5 is a proxy for our ni component) in three different clock markers. They estimate the period of the prr9 mutant to range from 25.3 h + 0.1 s.e. to 26.2 h + 0.4 s.e. for the different markers. The model prr9 mutant period is 23.90 h, slightly shorter than the estimated periods. The period constraint for the prr9 mutant is calculated as explained above and it is scaled by the larger s.e. rsif.royalsocietypublishing.org J. R. Soc. Interface 12: 20141303 (i.e. 0.4). When compared to the other model constraints, the scaled prr9 period constraint has very high constraint value. By order of decreasing value, it is the fourth highest constraint, with the constraint value approximately 26% of the top constraint value.
Salomé & McClung [20] also estimate the period of the prr5 mutant to range from 23.1 h + 0.1 s.e. to 23.9 h + 0.2 s.e. for the different markers. In the model, the NI component is meant to be the proxy for PRR5. The model ni mutant period is 24.3281 h and so, within the estimated ranges. The period constraint for the ni mutant is calculated as explained above and it is scaled by the larger s.e. (i.e. 0.2). When compared to the other model constraints, the scaled ni period constraint has the highest constraint value and ranks as the most influential constraint. We thus conclude that, from this point of view, both the knockout experiments have significant value.

A20 knockdown for NF-kB
A similar approach can be used with the NF-kB model. As an example, we test whether the predictive constraint on the period of the A20-knockout-mutant model under constant TNFa adds value compared to the other constraints. The A20 mutant model is simulated by halving the transcription of A20. While the WT model under constant TNFa has a period of 93.95 min (calculated as the average peak distance from third to last (in this case, sixth) peak), the A20 mutant has a shorter period of 85.07 min (calculated as the average peak distance from third to last (in this case, seventh) peak). The period constraint is calculated and the entry relevant to A20 translation in the constraint is set to 0 (as this rate is not allowed to change in the mutant). The A20 predictive constraint ranks as the 12th top constraint (in order of decreasing constraint value). Its value is a lot lower than that of the top constraint value, approximately 0.0031% of the top value. Our model prediction is that this constraint does not add much value to the models given that the other constraints have been applied.

Discussion
There is a huge literature on fitting ODE systems to data and the relevant literature is simply too extensive to list. A key reference is [21] which initiated one of the main lines of enquiry in this area and [22] is a recent example of this with a good reference list. Methods using stochastic simulation such as MCMC, a Bayesian approach and/or hierarchical models have also been increasingly used, and [23,24] are examples of this. These methods generally employ a single likelihood or likelihood-like objective function as opposed to our approach which considers the optimization problem in terms of a set of many individual constraints. Moreover, they are so far only applied to relatively small systems. Similar questions are also being actively pursued for fully fledged stochastic models and this is currently a very active area of research [25][26][27][28][29][30][31]. A link between these two approaches and a possible way to move to bigger systems is given by the ABC methods and the ideas in this paper may aid the move to larger systems by helping construct good likelihoods and enabling better understanding of the shape of likelihood and optimization functions.
The examples that we discuss show that our approach gives a substantial amount of valuable information on the value of constraints, information that is very difficult, if not impossible, to obtain by intuition. They show that for these large state-of-the-art models, only a fraction of the constraints have non-negligible constraint values and they identify which of the constraints are valuable. This knowledge is extremely useful when fitting models and allows for a more rational approach. The examples given also demonstrate that this approach can be successfully applied to both quantitative and qualitative constraints.
We have demonstrated the non-intuitive fact that one should expect the constraint value of many constraints to be small and consequently ineffective. We characterized what can be learned from this approach in terms of understanding the geometry of the optimization problem, design of optimization functions and artificial likelihoods and experimental optimization. One can also use this theory to give useful information on how to optimize a non-optimal system using both deterministic and stochastic approaches. For example, when using deterministic gradient following methods, it is wellknown [32] that a common problem is that the algorithms of the successive line minimization type are ineffective when the level surfaces of the constraints or optimization function have a ellipsoidal structure with an extreme aspect ratio of the sort we find. Our results suggest methods for choosing the move direction. Moreover, the most effective methods for moving to an optimum use Newton's method and this relies on inverting the derivative of the constraint map. As this is our matrix M(c 1 , . . . , c m ), its smallest singular value tells us how well controlled the Newton algorithm will be. Finally, statistical optimization methods can use an artificial likelihood of the type we have analysed.
The fact that in a typical model, only a few constraints will have a significant value leads to an interesting new concept of a tight model. Suppose that we have an ordered set of constraints C 1 , . . . , C m and that C rþ1 , . . . , C m have very small constraint values. Furthermore, suppose that we have tuned the parameters so that C 1 , . . . , C r are satisfied. If we then demand that any further parameter changes must not change C 1 , . . . , C r , it will be extremely difficult to tune C rþ1 , . . . , C m because of their very small constraint values. Therefore, if C rþ1 , . . . , C m are quantitatively correct, this can be interpreted as suggesting that the structure of the model is correct. If the correctness of these small value constraints has not been artificially determined, then it is reasonable to define such a model as tight in the sense that a large number of constraints take the correct value even though only a proportion of them can be tuned by adjusting parameters.
If system biologists are to reliably use complex models to provide robust understanding, it is crucial that there are analytical tools to enable a rigorous assessment of the quality and selection of these models and their fit to current biological knowledge and data. Our aim in this paper is to contribute to that.