Automated sequence-level analysis of kinetics and thermodynamics for domain-level DNA strand-displacement systems

As an engineering material, DNA is well suited for the construction of biochemical circuits and systems, because it is simple enough that its interactions can be rationally designed using Watson–Crick base pairing rules, yet the design space is remarkably rich. When designing DNA systems, this simplicity permits using functional sections of each strand, called domains, without considering particular nucleotide sequences. However, the actual sequences used may have interactions not predicted at the domain-level abstraction, and new rigorous analysis techniques are needed to determine the extent to which the chosen sequences conform to the system’s domain-level description. We have developed a computational method for verifying sequence-level systems by identifying discrepancies between the domain-level and sequence-level behaviour. This method takes a DNA system, as specified using the domain-level tool Peppercorn, and analyses data from the stochastic sequence-level simulator Multistrand and sequence-level thermodynamic analysis tool NUPACK to estimate important aspects of the system, such as reaction rate constants and secondary structure formation. These techniques, implemented as the Python package KinDA, will allow researchers to predict the kinetic and thermodynamic behaviour of domain-level systems after sequence assignment, as well as to detect violations of the intended behaviour.

We use a uniform prior distribution on p i , so P(p i ) = 1 for all p i . Thus, we have, where E[p i |N, N i ] is the expectation of p i given the observed values of N and N i and P(p i |N, N i ) is the probability mass function of p i given the same. We can compute P(p i |N, N i ) exactly using Bayes's law and combinatorial techniques. Bayes's law states that P(p i |N, N i ) = P(N, N i |p i ) P(p i ) P(N, N i ) where P(p i ) = 1 and P(N, N i ) = ∫ 1 0 P(N, N i |p i )dp i . We know that where B(α, β) = ∫ 1 0 t α−1 (1 − t) β−1 dt is the beta function. Thus, we have .

A.2 Conformation probability error
As a measure of the confidence in the estimate for p i , we compute the standard deviation of p i given the observed data N and N i . We first compute the a posteriori variance of p i to be: where we again use the fact that B(α, β) = (α−1)!(β−1)! (α+β−1)! when α and β are positive integers. The standard deviation is the square root of the variance, which is given bŷ

A.3 k 1 estimate for bimolecular reactions
For the first-step model reaction we estimate k i 1 using Bayesian inference on the observed simulated sequence-level reaction trajectories. Specifically, assume we observe N Multistrand trajectories, each of which begins with two conformations of A and B sampled from the Boltzmann distribution of secondary structures. Each trajectory is characterized by an indicator variable S n i , which is 1 if and only if the final product multiset P n equals P i , and its collision rate constant k n coll , which is computed as the net rate of forming any initial base pair between A and B in the first step of the simulation. For simplicity in this and the following sections, in the context of estimating k i 1 and k i 2 , we refer to trajectories in which S n i = 1 as successful trajectories, and all others as unsuccessful.
The rate constant k i 1 represents the net rate with which A and B will collide in a CME trajectory leading to P i , so that is the probability of simulating a successful FSM trajectory and k coll,i is the average value of k n coll over successful FSM trajectories. The estimatork i 1 is defined aŝ where P = (P 1 , P 2 , . . . , P N ) and k coll are the vectors of product multisets P n and collision rates k n coll , respectively, for each of the N observed trajectories. Note that in contrast to E [S n i ], which is an expectation taken over a particular distribution of trajectories, p i and k coll,i are underlying parameters of the system and do not vary between trajectories. E [p i k coll,i ] refers to an expectation taken over the space of possible (p i , k coll,i ), which is [0, 1] × [0, ∞). This expectation is only well-defined with priors for each random variable, described below.
To make the algebra that follows more tractable, we make the simplifying assumption that p i and k coll,i are independent random variables, so that It remains to compute the conditional expectations E[p i |P, k coll ] and E[k coll,i |P, k coll ]. Observe that p i is a random variable of the same form as a conformation probability (see Section A.1). Thus, making a similar assumption of a uniform prior on p i over [0, 1], we can follow identical steps to estimate the expectation as: where N i is the number of trajectories for which S n i = 1. To compute the expectation on k coll,i , we first assume that the individual k n coll values for each FSM trajectory are sampled from an exponential distribution with mean k coll,i . This is justified by the fact that the exponential distribution maximizes informational entropy, so that choosing this distribution assumes the least amount of prior knowledge about the k n coll . In addition, this assumption makes the following math tractable.
By definition, we have, where the second equality is due to Bayes's law and P(k coll,i ) is the prior probability distribution on k coll,i . To compute the value of P(P, k coll |k coll,i ), first observe that all trajectories are independent, so that P(P, k coll |k coll,i ) = ∏ 1≤n≤N P(k n coll |k coll,i ).
Because k coll,i gives no information about unsuccessful trajectories, any terms due to these trajectories will be canceled out by the normalization term, and we may drop these terms. For successful trajectories, our assumption regarding the distribution of k n coll (above) gives us: Substituting into the equation for the expectation of k coll,i yields To allow this expression to be evaluated analytically for all nonnegative N i , we use the prior distribution Although this is an improper prior over [0, ∞), the form of the integrand allows both the numerator and denominator of E[k coll,i |P, k coll ] to converge. First, considering the numerator, Similarly, the denominator can be simplified to

A.4 k 1 error estimate for bimolecular reactions
As with conformation probabilities, the spread in k 1 given the observed trajectories can be computed as the standard deviation of the posterior distribution of k 1 . That is, using again the fact that ∫ ∞ 0 x n e −x dx = n! for integral nonnegative n. The width of the posterior distribution is measured by the standard deviation, or square root of the variance. So we estimate the error of the estimate for k 1 to be: .
Note that in the edge case with N i = 0 and N > 0 we estimate an upper bound on k 1 as: and report error bounds as infinite.

A.5 k 2 error estimate for bimolecular reactions
Using Equation (6) of the main text, it is apparent that the unimolecular reaction rate k i 2 is computed as the weighted average of the reaction times, where weights are derived as the k n coll corresponding to each simulated reaction time τ n 2 . We first derive the variance of the mean reaction time τ 2,i for the i th first-step model reaction, from which the variance of k i 2 may be computed. As previously noted, we estimate τ 2,i as the weighted average: for each simulated trajectory.
We assume that all reaction times are drawn from a distribution with mean µ τ n 2 and variance σ 2 τ n 2 . The variance of the individual τ n 2 differs from the variance of their weighted sum, σ 2 τ2,i . Assuming fixed weights w n i , this is given by Thus, we can estimate the variance of the weighted sum from the variance from which the individual reaction times are drawn. Note that since w n i is actually a random variable, our estimates are neglecting this source of variability.
To estimate σ 2 τ n 2 , the variance of the distribution for individual reaction times, we first propose the estimator:σ This estimator is biased, and the bias can be quantified by computing the expectation ofσ 2 τ n 2 , again treating w n i as fixed: E where the summation is over only successful trajectories (i.e. for which S n i = 1). All summations in the remainder of this derivation are similarly performed over only successful trajectories.
Expanding the definition ofτ 2,i yields

Substituting into the expression for E
and, simplifying, E Thus, an unbiased estimator for σ 2 So we can estimate the standard deviation of τ 2,i witĥ and, using the fact that k i 2 = 1 τ2,i , we make a linear approximation for the relationship between k i 2 and τ 2,i , and propagate that to the variance:σ We can further simplify this expression, by letting This form of the expression shows the inverse square-root relationship with respect to the number of samples, which is characteristic of standard errors of the mean over multiple independent samples. Unfortunately, a few of our reported plots and calculations used a previous, incorrect, version of this estimate forσ k i 2 , but the deviations were less than 5% in cases that we tested. Finally, we note that this calculation does not account for the additional uncertainty due to variation in k n coll . Future work may aim to develop a more rigorous estimate that accounts for this source of variation.

B Case Study: Entropy-driven Catalyst [?]
The full system is given in Figure 6A (main text). In addition to the productive reactions shown there, unproductive reactions were included between every pair of resting macrostates for a total of 31 nonspurious reactions (3 productive, 28 unproductive).
Sequences are taken from [?],  A single spurious reaction was observed between Substrate and Fuel under these parameters. We expect this spurious reaction to occur via 0-toehold strand displacement. This spurious reaction becomes more prominent at higher temperatures.  Table S3. System half-completion times for varying initial catalyst C concentrations. Half-completion times estimated by simulating the mass-action ODEs for the rates in Table S2 and finding the time at which [OB]=5 nM. Initial concentrations of S and F are 10 nM and 13 nM for all simulations. Half-completion times were also estimated using the published reaction rates in Zhang et al. [?]. KinDA overestimates the reaction rates so all half times are much lower than those calculated with the published rates, by a factor of about 4-6.  Table S4. Probabilities of p-approximations for each resting complex, with p = 0.7. The Samples column shows number of sampled conformations that p-approximate the resting complex (or are p-spurious) over the total number of sampled conformations.

C Case Study: Multiple Desired Pathways
The full system is given in Figure 8 (main text). The sequences were randomly generated from a fourletter alphabet with an equal probability of each base, with the exception of toeholds s w and s s , which were chosen to be weak and strong toeholds, respectively.
For both the unmodified (with s = s w ) and modified (with s = s s ) systems, the reactions A + B → C + D and A + B → E + F were analyzed. Multistrand simulations were run with the same parameters as the entropy-driven catalyst (Appendix B).

Reactants
Products   (2017) [?] case study. All simulations were run in ordered-complex mode, except for the unexpected reaction, S2 + R → S2-R, which has only one product and was run in count-by-domain mode. The Trajectories column shows the number of trajectories corresponding to each reaction over the total number of trajectories simulated for the reactants. Simulation parameters were the same as for the entropy-driven catalyst (Appendix B).  Table S15. Reaction rate estimates for the AND gate (mode: ordered-complex). The Trajectories column shows the number of trajectories corresponding to each reaction over the total number of trajectories simulated for the reactants.

Reactants
Products  Table S17. Reaction rate estimates for the AND gate (mode: count-by-domain). The Trajectories column shows the number of trajectories corresponding to each reaction over the total number of trajectories simulated for the reactants.