Philosophical Transactions of the Royal Society B: Biological Sciences
You have accessResearch article

Grouping substitution types into different relaxed molecular clocks

Hui-Jie Lee

Hui-Jie Lee

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

[email protected]

Google Scholar

Find this author on PubMed

,
Hirohisa Kishino

Hirohisa Kishino

Laboratory of Biometrics and Bioinformatics, University of Tokyo, Tokyo, Japan

Google Scholar

Find this author on PubMed

,
Nicolas Rodrigue

Nicolas Rodrigue

Department of Biology, Institute of Biochemistry, and School of Mathematics and Statistics, Carleton University, Ottawa, Ontario, Canada

Google Scholar

Find this author on PubMed

and
Jeffrey L. Thorne

Jeffrey L. Thorne

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA

Google Scholar

Find this author on PubMed

    Abstract

    Different types of nucleotide substitutions experience different patterns of rate change over time. We propose clustering context-dependent (or context-independent) nucleotide substitution types according to how their rates change and then using the grouping for divergence time estimation. With our models, relative rates among types that are in the same group are fixed, whereas absolute rates of the types within a group change over time according to a shared relaxed molecular clock. We illustrate our procedure by analysing a 0.15 Mb intergenic region to infer divergence times relating eight primates. The different groupings of substitution types that we explore have little effect on the posterior means of divergence times, but the widths of the credibility intervals decrease as the number of groups increases.

    This article is part of the themed issue ‘Dating species divergences using rocks and clocks’.

    1. Introduction

    Divergence time estimation methods that accommodate rate variation over time are diverse [1–4] and are now referred to as relaxed clock procedures. More attention has been devoted to rate variation over time than to the fact that not all kinds of nucleotide substitutions experience the same pattern of rate variation. Most notably, the rate in mammals at which the CG dinucleotide changes to TG (conventionally referred to as CpG transitions) is high and, when compared with other substitution types, relatively constant [5–7]. The approximately clock-like behaviour of CpG transitions is hypothesized to stem from a spontaneous mutation mechanism that can operate during the majority of the mammalian lifecycle [6]. Presumably because of this mechanism, CpG transitions accumulate in proportion to the amount of elapsed chronological time. In contrast, other types of point mutations might be more strongly associated with meiosis and therefore be likely to occur in a relatively small period during each generation. Even if the rate per generation of these meiotic-dependent mutations is constant, the mutation rate per year will change as generation length changes.

    To reflect variation among substitution types in how rates change, one possibility is each kind of nucleotide substitution having its own independent relaxed clock. A shortcoming of this is that some kinds of mutation will share mechanisms or environmental influences and therefore tend to have highly correlated patterns of rate change. In an earlier work [8], we analysed primate sequence data and allowed each of nine substitution types to have its own independent relaxed clock. We concluded that some of the credibility intervals for divergence times were overly narrow, possibly because of failure to account for correlated changes in evolutionary rate among substitution types.

    Here, we modify our divergence time procedure in the hope that a more realistic treatment of relaxed clocks will emerge. The modification assumes that each type of substitution belongs to a group. A group may consist of only a single substitution type or it may encompass many. Substitution types within a group all change rate in the same way. Rates of substitution per year can vary over time, and rates of substitution types within the same group can differ, but relative rates of substitution for types belonging to the same group are assumed to be invariant. This means that the number of independent relaxed clocks is equal to the number of groups rather than the number of substitution types. Our task is to assign predefined substitution types to groups and then use a selected mapping of types to groups for divergence time estimation.

    2. Methods

    Consider K substitution types where each type k (k = 1, 2 , … , K) specifies the starting nucleotide, the context of the starting nucleotide and also the nucleotide that is introduced owing to substitution. For example, if CpG context is considered for a strand-symmetric model, then there are K = 9 substitution types (table 1). While strand symmetry is not necessary for our methods, we make this simplifying assumption about the substitution process here, because our prime motivation is to capture the effects of CpG context, and a CpG dinucleotide in one DNA strand is also a CpG dinucleotide in the complementary strand.

    Table 1.The nine context-dependent CpG substitution types. Complementary substitutions are considered as the same type.

    type k substitutions
    1 non-CpG G → C and C → G
    2 non-CpG G → T and C → A
    3 non-CpG T → A and A → T
    4 non-CpG T → G and A → C
    5 non-CpG G → A and C → T
    6 non-CpG A → G and T → C
    7 CpG G → C and C → G
    8 CpG G → T and C → A
    9 CpG G → A and C → T

    Let g(k) be the group number of substitution type k with Inline Formula where NG is the number of groups and NG ≤ K. The number of types in group g will be Sg. This means

    Display Formula
    We cluster types so that each group has its own independent relaxed clock. Within a group, the relative rates of substitution types are fixed over time. Therefore, if the rate for the group changes by a factor of 2, then all substitution types within the group change by that factor.

    With sequence data alone, the product of nucleotide substitution rate and time can be estimated, but rates and times cannot be disentangled. We refer to the product of time and rate of a type of nucleotide substitution on a branch as a ‘substitution length’ [8]. Rather than employing notation for the time duration of each branch, the confounding of rate and time makes it convenient to introduce the estimation of substitution lengths by (temporarily) treating the time duration of each branch to be 1, so that knowledge of the substitution length yields the substitution rate.

    We write the substitution length of type k on branch j as Inline Formula where ck is the relative rate of substitution type k and Inline Formula is the substitution rate per site on branch j for group g(k). We arbitrarily set the relative rate of one substitution type per group to 1, and thus, there are Inline Formula free parameters for relative rates. Suppose there are B branches on a rooted phylogeny. With the most general case where the K substitution rates evolve independently, there are K × B parameters for substitution lengths. If substitution types are clustered into NG < K groups, then there are Inline Formula parameters for substitution lengths.

    (a) Augmented data likelihood

    The complete history of which substitutions occurred at which times on a phylogeny is typically not directly observed, and instead, only the endpoints of evolution (i.e. the sequences at the tips of a phylogeny) are typically available. However, inferences about the substitution process are often more straightforward when the complete history is directly observed. We therefore introduce our approach for the situation where the history is directly observed. Later, we note that the observed sequence information at the tips (endpoints) of a phylogeny can be augmented to yield a complete history, and we explain our procedure for treating the uncertainty associated with this augmentation.

    We start by considering a single branch j on the rooted phylogeny. Assume the full substitution history on branch j is known, and we are interested in the substitution length of type k on this branch. Motivated by evidence that frequencies of CpG sites have decreased over time in some evolutionary lineages [9], we do not assume stationarity of the substitution process. To represent all the Inline Formula parameters for branch j and the values of all relative rates ck, μj will be used.

    Each of the K = 9 substitution types in table 1 specifies a beginning nucleotide and context. Let b(k) be the beginning nucleotide and context of type k. For instance, b(1) is a non-CpG site with starting nucleotide G or C. The three possible b(k) if CpG context is considered are: non-CpG G or C; non-CpG A or T and CpG C or G. Let Inline Formula be the proportion of time site i has context b(k) on branch j. We will refer to the Inline Formula as dwell proportions. Let nkij be the number of type k changes at site i on branch j. With substitution counts denoted Inline Formula and the summed dwell proportions denoted Inline Formula the log-likelihood for an observed history on j is

    Display Formula
    2.1
    where nj, Ï•j and sj0 are respectively vectors of the substitution counts nkj, the dwell proportions Inline Formula and the initial states sij0.

    Suppose we are interested in the substitution length of a type k* that is in group g(k*). Let G* be the set of substitution types that are in group g(k*). By setting the first derivative of the relevant log-likelihood equal to zero, the maximum-likelihood estimates of Inline Formula are

    Display Formula
    2.2
    Because ck is constant among branches, we need to gather relevant information for ck from all branches to solve for its maximum-likelihood estimate. Because the root is assumed to provide no information about substitution lengths, the log-likelihood for the completely observed history is proportional to
    Display Formula
    2.3
    NG of the cks will be constrained to 1 (i.e. one per group). For the other cks in each group, the maximum-likelihood estimate is
    Display Formula
    2.4

    (b) Parameter estimation

    (i) Unweighted procedure

    For augmented data, equations (2.2) and (2.4) show that maximum-likelihood estimates Inline Formula can be obtained when the ck are known and maximum-likelihood estimates Inline Formula can be obtained when the Inline Formula are known. This suggests parameter estimation via successive substitution. Starting with initial guesses of the ck, equation (2.2) can be applied to estimate the Inline Formula. Then, the Inline Formula estimates substituted into equation (2.4) generate new values of the ck. This cycle of successive substitution can be continued until the values of both kinds of parameters converge.

    The sufficient statistics in equations (2.2) and (2.4) are the dwell times ϕb(k)j and the substitution counts nkj. Typically, neither is directly observed. Lee et al. [8] estimated these sufficient statistics by sampling endpoint-conditioned histories using context-independent substitution models and assuming these histories well approximated a sample of endpoint-conditioned histories from a context-dependent model. Simulation suggested that this approach was reasonable for the primate data analysed by Lee et al. [8]. If we assume that endpoint-conditioned context-independent histories are a good approximation of context-dependent histories, then the sufficient statistics ϕb(k)j and nkj can be approximated by their sample means among histories. These sample histories can then be used in successive substitution in equations (2.2) and (2.4) to yield parameter estimates. Alternatively, the sufficient statistics from an individual sequence history together with successive substitution can yield substitution length estimates based only on that history. Then, the sample mean among histories can be reported. This latter procedure also facilitates approximation of the variances and covariances of substitution length estimates [8] and is what we refer to here as the ‘unweighted procedure’.

    (ii) Importance sampling procedure

    To account for the possibility that endpoint-conditioned context-independent histories do not well approximate context-dependent ones, an importance sampling procedure could be employed along with an expectation-maximization (EM) algorithm. First, we outline the EM algorithm.

    Let X be the DNA sequences observed at the tips of a tree with known rooted topology and Z be the (unobserved) substitution history that specifies exactly which sequence changes occurred at which times on the phylogeny. The sequence at the root will also be specified by Z. Because X represents a subset of the information provided by Z, the complete data are (X, Z) = Z. Because the complete-data log-likelihood in equation (2.3) is in the form of an exponential family, an EM algorithm for parameter estimation could have the form

    • (1) E-step: estimate the sufficient statistics T = T(X, Z) by Inline Formula where Inline Formula represents values of the parameters (μ) at the rth iteration and T(r) represents estimated sufficient statistics at the rth iteration.

    • (2) M-step: solve for new Inline Formula by plugging T(r) into equations (2.2) and (2.4) via successive substitution.

    We do not have analytical formula to compute expectations of the sufficient statistics in the E-step. Instead, we can approximate the expectations by averaging over a large number of sampled histories. To reduce the computational cost in this Monte Carlo EM (MCEM) algorithm, we can recycle the same sampled histories for each MCEM iteration via importance sampling (see [10]) by drawing a sample of approximately independent histories from a context-independent substitution model. At each iteration of the EM algorithm, instead of obtaining a new sample of histories using the most recent estimates of parameter values (substitution lengths), we apply importance weights to histories in the original context-independent sample.

    Let Z(c) be the cth substitution history (of C total) sampled from Pr(Z|X, μ*), where μ* are the parameters from the context-independent model. Let μ be all parameters in the context-dependent model of interest. The EM algorithm iterates between approximation of the expected complete-data likelihood

    Display Formula
    2.5
    and maximization of the Inline Formula over μ. With importance sampling, the E-step becomes
    Display Formula
    2.6
    with Inline Formula To calculate wc, the times of all branches are set to 1 for the probability densities in both the numerator and denominator. This means the rates on branches are adjusted in order to specify the substitution lengths in the numerator and the branch lengths in the denominator.

    With this MCEM approach, the variance–covariance matrix of estimates can be approximated by approximating the inverse observed information. Although we omit the details here, this can be done by adapting the procedure of Wei & Tanner [11] to incorporate the wc values.

    An overview of the unweighted and importance sampling procedures for parameter estimation is given in figure 1.

    Figure 1.

    Figure 1. An outline of the key steps in the unweighted and importance sampling procedures for parameter estimation.

    (c) Approximating Akaike information criterion in the presence of incomplete data

    For a given grouping of substitution types, MLEs can be obtained with the MCEM algorithm that we have described. To select among the alternative groupings, we can try to find the one with the minimum Akaike information criterion (AIC)

    Display Formula
    2.7
    where d is the number of parameters in the grouping [12]. We can approximate the observed data likelihood,
    Display Formula
    2.8
    Suppose we estimate Inline Formula and Inline Formula under two different models, model 1 and model 2, using the MCEM algorithm. Then we can evaluate these models by approximating the AIC difference
    Display Formula
    2.9
    where d1 and d2 are respectively the numbers of parameters in models 1 and 2.

    3. Primate data and analysis details

    Lee et al. [8] employ multidivtime software (see [13]) and the rate evolution model of Kishino et al. [3] to estimate divergence times. Lee et al. [8] give each substitution its own independent relaxed clock, whereas multidivtime was originally developed so that each gene in a multigene dataset could have its own independent relaxed clock. Our extension also employs multidivtime, but has each group rather than each gene or each substitution type with its own independent relaxed clock.

    For the sake of comparison with Lee et al. [8], we analyse the same data as did that study. Specifically, we use an approximately 0.15 Mb subset of data from Kim et al. [6] that corresponds to an intergenic part of human chromosome 16 with CpG islands masked by Repeatmasker [14]. The data represent nine primates related by a well-established topology that was treated as known (figure 2). One of the nine primates (bushbaby) is the outgroup and was only used in our analyses to root the ingroup.

    Figure 2.

    Figure 2. Phylogeny for primate species of this analysis. The outgroup species (bushbaby) roots the eight ingroup taxa. Nodes are numbered from 0 to 14. The depicted divergence times are obtained from TimeTree [15,16], and are shown by the time line with time units of 1 million years. (Online version in colour.)

    To sample a sequence history (i.e. a substitution history on the tree for each site in the sequence), the context-independent GTR substitution model [17,18] and the unrooted version of the tree topology in figure 2 was used. Maximum-likelihood estimates of the GTR rates and branch lengths were obtained by PAMLX software [19]. Fixing the GTR rates and branch lengths at these estimates, a slightly modified version of the PhyloBayes-MPI software [20] was employed to generate 400 independently sampled endpoint-conditioned sequence histories. The endpoint-conditioned procedure implemented in PhyloBayes-MPI is the stochastic mapping procedure of Nielsen [21] with the exception that uniformization for a sequence site is employed if many consecutive attempts of Nielsen's procedure fail to match the observed endpoints for the site (see [22]).

    The PhyloBayes histories are sampled so as to include the history on the path connecting the ingroup root to the outgroup. Because multidivtime does not use them, we do not attempt to estimate substitution lengths on this path or to carefully model it. Even when accounting for context-dependence on the rest of the tree, we assume that this path to the outgroup evolves via the GTR model when applying the importance sampling procedure or the AIC approximation. This assumption simplifies importance sampling because it means that terms pertaining to the path to the outgroup are identical in the numerator and denominator of the ratio that defines wc (see equation (2.6)) and therefore these terms cancel.

    Here, all 400 endpoint-conditioned sequence histories were sampled with the GTR model, homogeneity of rates among sites, and fixed GTR and branch length parameters. Adding gamma-distributed rate heterogeneity and other context-independent improvements to the substitution model has the potential to make paths sampled by a context-independent model better approximate the endpoint-conditioned distribution from a context-dependent model. To examine whether incorporating enhancements to the context-independent model would impact divergence time estimates, we also applied the ‘unweighted procedure’ to obtain substitution lengths from the 312 context-independent sequence paths that were sampled by Lee et al. [8]. These 312 paths were generated from the primate dataset with the CAT–GTR model and a four-category discretized gamma treatment of rate heterogeneity (see [8]).

    For the divergence time analyses, priors are selected to correspond to those of Lee et al. [8] when possible. We tightly constrain the age of the ingroup root (node 14 in figure 2) to have a gamma prior distribution with mean 44.2 million and standard deviation 0.1 million years. This mean age is the value suggested by the TimeTree database [15,16]. A more elaborate analysis would carefully incorporate all pertinent fossil evidence, but we instead constrain node 14, so that the results here can be more easily compared with those of Lee et al. [8]. Again following Lee et al. [8], each group of substitution types has a lognormal relaxed clock with an autocorrelation parameter v that has a gamma prior with mean and standard deviation both 0.023 (the inverse of the prior mean of the root time). Lee et al. [8] used a gamma distribution with mean and standard deviation both 0.0012 substitutions per million years as the prior distribution for the substitution rate at the ingroup root. We use the same prior for the Inline Formula values at the ingroup root. Each multidivtime analysis was run 2 × 107 proposal cycles with 50% of iterations treated as burn-in, and each set of conditions was run twice in order to check convergence.

    We decided to explore five of the possible groupings of the K = 9 substitution types that are listed in table 1. These are (i) each of the types in its own group (the ‘nine-groups’ grouping); (ii) all types in the same group (the ‘one-group’ grouping); (iii) the ‘four-groups’ grouping that separates transitions from transversion and also separates CpG and non-CpG sites, so that the four groups place the substitution types from table 1 into subsets {1,2,3,4}, {5,6}, {7,8} and {9}; (iv) the ‘three-groups’ grouping that differs from ‘four-groups’, because all transversions are in the same subset {1,2,3,4,7,8}; and (v) the ‘two-groups’ grouping that has CpG transitions (i.e. type 9) in one group and the eight other substitution types in the only other group. For each group in each of the five possible groupings that were explored, the substitution type corresponding to the lowest number in table 1 had its ck value constrained to 1.

    4. Results and discussion

    Each of the 400 sequence histories can be separately used to infer substitution lengths. When doing this, we find little variation among estimates and also observe that the history that achieves the maximum of the 400 wc values does not tend to yield estimates that are outliers (figure 3). While histories at individual sequence sites may suggest substantially different substitution lengths among the 400 sampled site histories, the primate sequence alignments are long, so that the mean effect per site in a sequence history is likely to vary little among sequence histories.

    Figure 3.

    Figure 3. Distribution among 400 histories of substitution length estimates for the nine-groups grouping. Results from substitution types 1 and 9 are depicted and are representative of patterns observed for other substitution types. Branches are labelled according to the index of the node in figure 2 that ends the branch. The circle overlaying each boxplot indicates the substitution length of the history with the maximum wc value. (a) Type 1 changes. (b) Type 9 changes. (Online version in colour.)

    Whereas substitution length estimates reflect the average behaviour of sites in a sequence history, the wc values reflect the cumulative behaviour of sites. The variance of an average of random variables decreases as sample size increases, but the variance of a product can increase as sample size increases. We think this cumulative effect is why we observe that one of the 400 wc values tends to greatly exceed all others. For example, after the EM algorithm for the nine-groups grouping converged, the ratio of the second biggest of the 400 wc values to the biggest was about 0.003.

    As expected, reducing the alignment length to 104 or 103 columns resulted in less disparity between the maximum wc value and the others. However, even with 104 or 103 columns, the maximum tended to be much bigger than all others. With the reduced alignment lengths, different regions achieved maximum wc values with different of the 400 histories. These observations indicate that our approach of weighting sequence histories does not approximate well the distribution of histories that would be obtained if all were directly sampled with a context-dependent model via Markov chain Monte Carlo or some other algorithm.

    While these wc results demonstrate that differences between context-dependent and context-independent history samples are easy to detect, they do not demonstrate that substantive problems will arise when using a sample of context-independent histories to infer divergence times with a context-dependent model. Although the context-independent model for sampling histories was different in the earlier study, the simulations of Lee et al. [8] suggest the difference in distributions of histories between a context-independent and a context-dependent model is not a major issue with the primate dataset. Still, the effect of using context-independent histories for context-dependent divergence time inference will be dataset-dependent. A better and more general procedure is needed.

    Because of the extreme wc values, the importance sampling procedure for estimating divergence times was not pursued for the primate data. In addition, failure to obtain a more satisfactory distribution of wc values means that the outlined AIC procedure for approximating AIC differences is unreliable. This finding is reminiscent of the instability of the harmonic mean estimator for approximating Bayes factors (see [23]). The outlined procedure may be effective when histories are sampled from a probability density that is closer to the models being evaluated. For example, the procedure might succeed when histories are sampled according to one grouping of substitution types and that grouping is being compared with another.

    We inferred divergence times for the five different groupings (one-group, two-groups, three-groups, four-groups and nine-groups) with the unweighted procedure. Table 2 summarizes divergence time posteriors that are based on the sample of 400 sequence histories according to the GTR model with rate homogeneity among sites. In table 2, credibility intervals become narrower as the number of relaxed clocks (i.e. groups) increases. This coincides with the pattern discussed by Zhu et al. [24] about credibility intervals shrinking with multilocus studies as the number of relaxed clocks increases. Our results suggest that attention should be paid to how correlated are patterns of rate change over time among substitution types.

    Table 2.Divergence times with different groupings of substitution types. Ninety-five per cent credibility (and prior) intervals are directly below the posterior (and prior) means. Units are millions of years. The nodes are labelled as in figure 2, and the divergence times in parentheses under the node labels are obtained from TimeTree [15,16]. For each grouping, the first two lines summarize the posterior distributions from the 400 histories sampled according to the GTR model, and the two italicized lines that follow summarize the posterior distributions from the 312 CAT–GTR histories sampled by Lee et al. [8]. Entries for the ingroup root (node 14) are not listed here because prior distributions tightly concentrated this node time to a narrow range.

    node 8 9 10 11 12 13
    time (18.8) (8.8) (11.5) (15.1) (18.8) (29.6)
    prior 22.3 11.0 22.2 11.1 22.1 33.1
    (1.1, 43.1) (0.3, 31.4) (4.2, 40.0) (0.4, 31.5) (4.0, 40.0) (12.5, 43.9)
    one-group 18.0 6.3 10.0 18.3 20.6 30.5
    (16.4, 19.9) (5.2, 8.0) (8.4, 12.4) (16.1, 21.0) (18.3, 23.5) (28.5, 32.6)
    22.8 6.7 10.5 17.4 19.1 30.6
    (18.6, 27.0) (5.7, 7.9) (9.1, 12.1) (15.7, 19.8) (17.4, 21.6) (29.0, 32.4)
    two-groups 18.0 6.6 10.4 18.4 20.8 30.7
    (16.6, 19.5) (5.6, 7.9) (9.0, 12.2) (16.6, 20.3) (18.8, 22.8) (29.1, 32.3)
    21.9 6.7 10.5 17.4 19.2 30.5
    (18.9, 25.0) (5.9, 7.8) (9.3, 11.9) (15.9, 19.3) (17.7, 21.2) (29.1, 32.1)
    three-groups 18.0 6.4 10.2 18.4 20.7 30.7
    (16.9, 19.2) (5.6, 7.4) (9.0, 11.6) (16.9, 20.0) (19.2, 22.4) (29.3, 32.0)
    21.1 6.6 10.4 17.6 19.5 30.5
    (18.6, 24.0) (5.8, 7.6) (9.1, 11.8) (16.0, 19.3) (17.8, 21.2) (29.1, 32.0)
    four-groups 17.8 6.3 10.0 18.2 20.5 30.4
    (16.7, 19.0) (5.5, 7.3) (8.8, 11.3) (16.7, 19.7) (18.9, 22.1) (29.1, 31.8)
    20.6 6.5 10.1 17.5 19.3 30.3
    (18.1, 23.3) (5.6, 7.4) (8.9, 11.4) (15.9, 19.1) (17.7, 21.0) (28.9, 31.8)
    nine-groups 17.8 6.1 9.6 17.8 20.0 30.1
    (16.8, 18.8) (5.5, 6.8) (8.8, 10.6) (16.7, 19.0) (18.8, 21.3) (29.0, 31.1)
    19.7 6.2 9.7 17.5 19.2 30.1
    (18.5, 21.5) (5.6, 6.9) (8.8, 10.6) (16.3, 18.7) (18.0, 20.5) (29.0, 31.2)

    The ck values and inferred substitution lengths are included as the electronic supplementary material. If we compare the ratios of estimated ck values for two substitution types that are in the same group, we find that these ratios are not very sensitive to how the other substitution types are grouped. For example, substitution types 1 and 2 are in the same group for our one-group, two-groups, three-groups and four-groups models. In each of these models, the ratio of the estimated ck for type 2 relative to type 1 is close to 1.4. However, we find that inferred substitution lengths on a branch are less robust to how other substitution types are grouped. As expected, substitution lengths for CpG transitions are substantially larger than substitution lengths for other substitution types.

    Results from the 312 histories of Lee et al. [8] that relied on a comparatively realistic context-independent model again show that widths of credibility intervals shrink as the number of independent relaxed clocks increase. While the posterior means of divergence times from the two sets of samples differ somewhat (table 2), it is difficult to draw a general conclusion from the inferred times about the effects of improving the context-independent substitution model used for sampling. Presumably owing to the rate heterogeneity employed when sampling the 312 histories, we do observe that substitution lengths inferred from the set of 312 histories are larger than those from the set of 400 histories.

    While the focus here has been on grouping substitution types, so that types in each group share an independent relaxed clock, an alternative would have relaxed clocks vary among substitution types, but not necessarily be independent of one another. This sort of model is reminiscent of the work by Lartillot & Delsuc [25], who have been exploring rate change via multivariate Brownian processes on trees (see also [26]). Even more advanced models could have the correlation structure among rates of substitution types change over time.

    Divergence time estimation is one reason why understanding and handling rate variation over time is warranted. To accurately determine divergence times when fossil records are sparse, a satisfactory treatment of molecular evolution is particularly desirable. Beyond divergence time inference, the process by which genomes change is of fundamental biological interest and so is characterizing how this process itself evolves. A better understanding of how different kinds of substitutions change rate over time will facilitate this characterization.

    Data accessibility

    The software developed for this research is freely available at https://github.com/HuiJieLee. The aligned sequence data used in this study are available at Dryad, http://doi.org/10.5061/DRYAD.2TS15.

    Authors' contributions

    H.-J.L. and J.L.T. designed the study and wrote the manuscript with substantial contributions from H.K. and N.R. H.-J.L. and N.R. wrote the computer programs. H.-J. L. performed the analysis with substantial contributions from J.L.T. All authors gave final approval for publication.

    Competing interests

    The authors declare no competing interests.

    Funding

    H.-J.L. and J.L.T. were supported by N.I.H. grants GM090201 and GM070806. J.L.T. was also supported by N.I.H. grant GM118508. H.K. was supported by the Japanese Society for the Promotion of Science Grant Number 25280006. N.R. was supported by the Natural Science and Engineering Research Council of Canada.

    Acknowledgements

    We thank Dr Mario dos Reis and an anonymous reviewer for their help.

    Footnotes

    One contribution of 15 to a discussion meeting issue ‘Dating species divergences using rocks and clocks’.

    Published by the Royal Society. All rights reserved.