On reaction network implementations of neural networks

This paper is concerned with the utilization of deterministically modelled chemical reaction networks for the implementation of (feed-forward) neural networks. We develop a general mathematical framework and prove that the ordinary differential equations (ODEs) associated with certain reaction network implementations of neural networks have desirable properties including (i) existence of unique positive fixed points that are smooth in the parameters of the model (necessary for gradient descent) and (ii) fast convergence to the fixed point regardless of initial condition (necessary for efficient implementation). We do so by first making a connection between neural networks and fixed points for systems of ODEs, and then by constructing reaction networks with the correct associated set of ODEs. We demonstrate the theory by constructing a reaction network that implements a neural network with a smoothed ReLU activation function, though we also demonstrate how to generalize the construction to allow for other activation functions (each with the desirable properties listed previously). As there are multiple types of ‘networks’ used in this paper, we also give a careful introduction to both reaction networks and neural networks, in order to disambiguate the overlapping vocabulary in the two settings and to clearly highlight the role of each network’s properties.


Introduction
There is a growing interest in synthetic chemical reaction networks that carry out some pre-determined task [1][2][3][4][5][6][7][8][9][10][11][12][13]. The field that develops and analyses these networks often goes by the name 'computation with chemical reaction networks'. The tasks being carried out can range from the pedestrian, such as determining the minimum or sum of two numbers, to the complex. The goal of this style of work is not to devise methods that can match or exceed silicon based computers in terms of speed, but instead it is to develop methods of computation for environments in which silicon based computers cannot currently go-for instance in the cellular environment. A particular type of (complex) computation now found ubiquitously in our daily technology is machine learning via neural networks, and so it is no surprise that there has been recent work on the development of chemical reaction network implementations of neural networks with a fixed set of parameters [8,[14][15][16][17][18][19][20]. More generally, work focused in this context on understanding the connection between biochemical models and the physical mechanisms of information processing stretches back at least through the 1960s [21][22][23][24][25][26][27][28][29].
The papers we are aware of in the literature pertaining to chemical reaction network implementations of neural networks focus on particular constructions. Hence, there is currently little mathematical theory developed that can be utilized in a general manner. (An exception is [8], which develops the necessary theory for chemical Boltzmann machines to be implemented via stochastic models of chemical reaction networks.) Moreover, it is often simulation that is put forth as evidence to demonstrate the validity of a construction as opposed to rigorous proof. Thus, these works are not mathematical in nature. (This should not be taken as a criticism, as these papers were not meant to focus on the mathematics.) The major goal of this work, therefore, is to begin the development of a mathematical framework for the construction of deterministically modelled reaction networks that implement neural networks and machine learning. In particular, the mathematical framework will allow us to prove that the dynamical system associated with the constructed chemical reaction network will (i) implement a given neural network and (ii) have certain desirable properties, briefly outlined below.
Some further details are called for before proceeding. In order to devise deterministically modelled chemical reaction networks that implement neural networks, the following broad strategy may be employed: 1. Fix a neural network with some choice of activation function, w, and parameters (biases and weights), P. Denote the output values of the neural network via Ψ(d), where d is an input (data). 2. Determine a chemical reaction network {S, C, R} for which the associated mass-action ODE system where F is some functional of the solution, x, to (1.1) (note that the solution x depends on d, the initial value). In particular, it is natural to take the output to be the limiting steady-state values of some ordered subset of the species, where I is some index set.
The above is the basic strategy of [14], in which they design a reaction network to learn the XOR function, and of [19]. We note that a different modelling framework is used in [20], in which limiting values are found when certain counts go to zero (and remain there).
The basic strategy outlined above, i.e. using the limiting values of an initial value problem (1.1) to represent the output of a neural network, is quite natural, but it leaves open a number of questions that need to be addressed for a given construction: 1. When will the constructed reaction network admit limiting steady states? 2. Assuming limiting steady-state values exist, under what conditions will they be unique for a given choice of model parameters and for a given initial condition? 3. Assuming there are unique limiting steady states, when will they be smooth in the parameters (which is important for gradient descent and other optimization procedures)? 4. How long will it take the model to converge? In particular, could the time required to determine the output of the system depend strongly on the initial conditions?
We note that these are highly non-trivial questions in the present context as mass-action models of chemical systems are polynomial dynamical systems, and are known to exhibit myriad behaviours including chaotic behaviour [30].
In this article, we develop a mathematical framework that is capable of resolving the questions posed above. Moreover, we use our framework to develop a chemical reaction network implementation of an arbitrarily sized neural network with a smoothed ReLU activation function (see equation (3.2) and figure 4). Using our mathematical framework, we prove that this construction leads to a system that is exponentially reliable (i.e. the output of the system is unique and is smooth with respect to the parameters of the model, and the process converges exponentially fast) and converges from infinity in finite time (so the convergence time is uniformly bounded over all initial conditions). See definitions 4.7 and 5.3 for the precise meaning of these terms.
The applications possible from neural network implementations of chemical reaction networks seem nearly limitless. However, it is the view of these authors that this potential can only be achieved once a solid mathematical foundation is created upon which to build the necessary theory and, eventually, physical implementations-perhaps via DNA strand displacement [9,31,32]. We therefore view this work as a starting point, with follow-up work focused on implementations of neural networks that can perform gradient descent autonomously, allowing us to relax the assumption of a fixed set of parameters, in both supervised and unsupervised settings. Finally, while the focus of the current paper is on implementations of neural networks via deterministically modelled reaction networks, stochastic variants are possible as well. In particular, stochastically modelled reaction networks will be the more natural choice whenever the goal is the approximation of distributions as opposed to functions [8]. Study of such implementations is therefore another exciting avenue of future research.
We end the this section with a brief collection of some notation that will be used throughout this paper. We denote the empty set by . We denote an arbitrary index set by I. We use the notation _ S i[I A i to mean the union By partition of a set S, we mean a collection of non-empty subsets of S, For two vectors u, v, we will denote the Hadamard product, which is simply term-wise multiplication, via . That is, we have For a function f : R c ! R and a vector u ¼ (u 1 , . . . , u c ) we denote by r u f the vector whose ith component is ∂f/∂u i . For a vector valued function f, we denote by f 0 (x) the vector whose ith component is The remainder of the paper is organized as follows. Sections 2 and 3 give primers, including notation used in this paper, on reaction networks and neural networks, respectively. As there are two distinct notions of networks in this paper, it is important to carefully separate the two. In §4, we present our main theoretical results pertaining to ODE implementations of neural networks. In §5, we demonstrate how to utilize our theoretical results to construct a reaction network that implements a given neural network with a fixed set of parameters and a smoothed ReLU activation function. In §6, we provide a detailed example, including a demonstration of how to utilize our theory to implement neural networks with different activation functions.

Reaction networks
Reaction networks are graphical representations of interactions between different 'species'. In this context, the word species may refer to different organisms (for example, if you are modelling the interactions among foxes and hares) or to different chemical compounds (for example, if you are modelling the dynamics of a biochemical process within a cell). In this paper, we are primarily interested in the latter context and will also refer to reaction networks as 'chemical reaction networks', as is common.
Definition 2.1. A reaction network, or chemical reaction network, consists of a nonempty and finite set of species S and directed graph with vertices C and directed edges R satisfying the following conditions: each vertex is a linear combination of the species over the non-negative integers; every species appears with a positive coefficient in at least one vertex; no two vertices are the same linear combination of the species; each vertex is connected by a directed edge to at least one other vertex; there are no directed edges from a vertex to itself.
Vertices of the reaction network are called complexes, and directed edges are called reactions.
We will often denote a reaction network via G ¼ (S, C, R): 4 When considering general/theoretical systems, we will typically denote the species as S ¼ {X 1 , . . . , X n }, in which case our vertices/complexes are of the form We will use the common slight abuse of notation by also associating a complex Y [ C with the vector in Z n !0 whose ith component is b i . Using this convention, we define the reaction vector for a reaction When considering specific examples, we will use more suggestive notation for our species. We present two examples to solidify the notation. It is a common practice, which we use here, to specify a reaction network by writing all the reactions, since the sets S, C and R are contained in this description.
Example 2.2. Consider the following reaction network with two species, S ¼ {X 1 , X 2 }: Here the set of complexes/vertices is {X 1 + X 2 , 2X 2 , X 2 , X 1 }. For example, it could be that X 1 is an active form of a protein and X 2 is the inactive form and two actions can take place: (i) an inactive protein can catalyse the inactivation of an active protein and (ii) an inactive protein can spontaneously become active. For another example, we could use the network to model disease spread, with X 1 representing healthy/susceptible individuals and X 2 representing those that are infected. Whatever the modelling scenario is, the network is the same and consists of two species, four complexes (vertices) and two reactions. The associated reaction vectors are Example 2.3. Consider the following reaction network with three species, S ¼ {X 1 , X 2 , X 3 }: In this example, molecules of X 1 and X 2 enter the system from outside of it via 0 → X 1 + X 2 , X 1 can spontaneously convert to X 3 and vice versa via the two reactions X 1 O X 3 , and X 3 catalyses the removal of X 1 molecules via the reaction X 1 + X 3 → X 3 .
The reaction network tells us the constituent species of a model, the counts of each of the species required for each of the reactions to take place and the counts of the products of each reaction. Moreover, the reaction vectors give the net changes in the counts of the species due to the occurrence of the different reactions. However, the reaction network does not determine the rates at which the different reactions take place.
A common modelling choice is to assume that the vector of concentrations of the species at time t ≥ 0, denoted by where the enumeration is over all of the reactions and } is called the kinetics of the model, and the most common form of kinetics, and the one we use throughout, is termed mass-action kinetics in which for some choice of rate constant k Y! b Y . 0 and where Y i is the ith component of Y viewed as a vector in Z n !0 . When Λ is mass-action kinetics, we say that (G, L) is a mass-action system. When mass-action kinetics is used, it is common to place the reaction rate constant next to the associated arrow in the graph, Y À ! k Y!Ŷ Y:

Neural networks
We give a basic introduction to the type of neural networks we consider in this paper-feed forward. For more on neural networks, see [33][34][35][36][37]. Loosely, a neural network is a graph that gives a visual depiction of a certain type of mathematical function. The class of functions they can represent, which will be detailed below, have many parameters, and are 'universal' in that they can be used to approximate any continuously differentiable function arbitrarily well [38,39]. The power of neural networks comes from the fact that they can be 'trained' from data, which simply means that the parameters of the function can be calibrated algorithmically so as to produce a final function capable of carrying out some pre-determined task (such as image recognition).
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 Below, we will first introduce the basic structure of a neural network. Next, we will explain how each such graph, when combined with a choice of parameters and an 'activation function', is simply a representation for a particular function. We will call such a network, in which all parameters, together with the activation function, are fixed, a 'hardwired' neural network. Finally, we will discuss how neural networks can be trained by finding parameters for the network that minimize (at least locally) a desired cost function. This minimization is often performed by a version of gradient descent and is termed backpropagation in the field.

Structure of a neural network
Formally, a feed-forward neural network G = (V, D) is a directed graph on a set of nodes V and a set of directed edges D ⊆ V × V, such that there is a partition of V into layers L ℓ , V ¼ _ Sm '¼0 L ' , with the property that ( e X 0 , e X) [ D if and only if e X 0 [ L ' and e X [ L 'þ1 for some ℓ ∈ {0, …, m − 1}. We will refer to the set L ℓ as the ℓth layer of G, so G has m + 1 layers, and each L ℓ , with 0 ≤ ℓ ≤ m, contains c ℓ > 0 nodes. The nodes in L 0 are referred to as input nodes, while those in L m as the output nodes. All nodes in S mÀ1 '¼1 L ' are referred to as hidden nodes or intermediate nodes. We use input layer, output layer, and hidden layer to refer to each layer that contains the corresponding nodes. Note that we can partition D as follows: For the sake of brevity, for the remainder of the paper we will refer to feed-forward neural networks simply as neural networks. Indices can often become burdensome when working with neural networks. Thus, we minimized their use in the preceding explanation, and will continue to do so when possible. That said, it will be useful to have an enumeration and so we will denote the jth node in layer ℓ by e X ' j . See figure 1.

A neural network as a mathematical function
We label each non-input node and each directed edge with a real number. A label for a non-input node is termed a bias, whereas a label for an edge is termed a weight. Moreover, we associate an activation function with each non-input node, which will be described fully below. We will call a neural network with such a labelling and a choice of activation function a hardwired neural network. For each ℓ ∈ {1, …, m}, we will denote by b ' [ R c' the vector whose ith component gives the bias for node e X ' i , and will denote by W ' [ R c'Âc'À1 the matrix whose (i, j )th entry represents the weight of the edge between nodes e X 'À1 j and e X ' i Note that the ordering of the indices of W ℓ seems backwards at first glance. However, this ordering will make certain expressions slightly cleaner later, and is standard in the field. We will use the notation B for the assignment of node labels (biases) and W for the assignment of edge labels (weights). That is, for each of ℓ ∈ {1, …, m}, we have is an assignment of labels to G = (V, D). So long as we have also chosen an activation function w, which will be described directly below, we may denote the resulting hardwired neural network via (G, P, w).
Let w : R ! R !0 be a continuous, monotonic function, which is then extended to w : We present a few examples of some so-called activation functions w : R ! R !0 .
1. w 1 (y) = 1/(1 + e −y ). This sigmoid function is a bijection onto the interval (0, 1), and is used quite commonly. See figure 2. 2. w 2 (y) = max(0, y). This function is termed the ReLU function (rectified linear units). See figure 3. 3. Let h ≥ 0 and define This function is a smoothed version of the ReLU function, while remaining strictly monotonic, and will play a key role in the present work. See figure 4.
A pair of consecutive layers L ℓ−1 and L ℓ along with all edges between the two layers, encode a function c ' : Figure 1. The graphical structure of a neural network. The red, blue and green nodes are input nodes, hidden nodes and output nodes, respectively. An arrow from one node to another is a representation of the direction of influence, i.e. an edge in D. The value of the 'tail' node is input for computation of the value of the 'head' node.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 Taking compositions, a hardwired neural network is then simply a visual representation for the function C (G,P,w) : Thus, the function associated with a neural network is simply a sequence of compositions that alternates between linear functions (via matrix multiplication and vector addition) and nonlinear functions (via application of the activation function).
It is useful to provide a bit more notation before moving on. Suppose that d [ R c0 is the input to the function C (G,P,w) (or, equivalently, the function ψ 1 ). We then define a 0 = d and The vector a ℓ (d) is said to give the activations of the nodes in the ℓth layer. With these definitions, we have that for any ' [ {1, . . . , m} Moreover, note that C (G,P,w) ¼ a m , which is a useful compact notation for C (G,P,w) .

Learning from data
Suppose now that we are given N pieces of data of the form (d, t(d)) [ R c0Âcm . For example, and to take a common example, d [ R 784 could be the values of the 28 × 28 = 784 pixels in a greyscale image of a hand-drawn number, and !0 could be the vector e i (the vector with a 1 in the ith digit and zeros elsewhere) if the image is that of a handdrawn i − 1. Here d is considered the input data and τ(d) is considered the 'truth'. We could then construct a neural network with c 0 = 784 and c m = 10 simply by choosing (i) the number of hidden layers, and how many nodes per layer, (ii) biases and weights, P ¼ (B, W), for the nodes and directed edges, and (iii) an activation function w. In such a manner, our hardwired function C (G,P,w) is determined.
At this point, we could ask how closely our function matches the 'truth' by looking at some cost function. Therefore, assume that we have a cost function of the form where the sum is over all the data and C is a function giving a measure of how closely C (G,P,w) (d) ¼ a m (d) approximates τ(d).
The second equality above points out that for notational convenience we will typically suppress the dependence of the parameters P ¼ (B, W) in C. Some of the most commonly used cost functions are given below: 1. The quadratic cost function, in which case 2. The one-norm cost function, in which case 3. The cross-entropy cost function, in which case In this paper, we will take C to be given by the quadratic cost function (3.7). This choice of cost function does not play a significant role in the present work.
Of course, we did not specify how we chose our parameters P ¼ (B, W) for the model. Supposing we choose them randomly somehow, there is no reason our function C (G,P,w) should be a good approximation for τ for the given data. Therefore, we would like to find those parameters P that minimize the cost function and to do so it is natural to use gradient descent. Thus, we need to be able to efficiently compute r b ' Cost and @Cost @W ' ij for each appropriate value of ℓ, i, and j. Because of    royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 the sum in (3.6), it is sufficient to compute the gradient of C(d), and these can be computed as follows [36]: and where r a m C(d) is the gradient of C(d) with respect to a m . For example, if C is given by the quadratic cost function (3.7), we have

Neural networks and ODEs
Fix a hardwired neural network G = (V, D) with parameters P ¼ (B, W), whose ℓth layer contains c ℓ nodes, in which each node has activation function w. Let W ℓ , a ℓ and β ℓ be as in the previous section. Now consider a system of ODEs defined recursively via !0 to denote our initial condition as it represents the input 'data' to the system. Note that x ℓ−1 is acting as an external 'forcing function' on x ℓ . In particular, the system above has a natural feed-forward structure. For r [ {0, 1, . . . , m}, we denote by F r the subsystem of (4.1) and (4.2) consisting of only those terms x ' i for which ℓ ≤ r. Note that for any 1 ≤ r ≤ m, F r contains F rÀ1 and that F m is all of (4.1) and (4.2).
Then we say that the system (4.1) and (4.2) implements the neural network (G, P, w). ▵ Note that in order for a system to implement a neural network according to the above definition, it is not enough for the system to simply convert inputs, d, to the correct outputs, a m (d) ¼ C (G,P,w) (d). Instead, we require that the system calculates the activations for each node in the network, i.e. a ℓ (d) for all ℓ ≤ m, and do so for any choice of initial condition in layers 1 through m. where We claim that the system (4.1) and (4.2) with this choice of f ' i implements a neural network with the smoothed ReLU function (3.2). This statement will be proved rigorously below once we have some additional mathematical machinery.
For a particular choice of ℓ and i, we can think of the onedimensional system (4.2) as simultaneously implementing both the linear updating step (3.4) and evaluation with the activation function (3.5) for node i in layer ℓ. This observation motivates the following. Definition 4.3. If the system (4.1) and (4.2) implements the neural network (G, P, w), then (4.2) is termed the activation system for node i in layer ℓ. ▵ The following definition is added for completeness.
Definition 4.4. We will say that y : R !0 ! R n converges exponentially to b y [ R n , and will write y(t) À ! exp b y if there are c, h > 0 for which jy(t) À b yj c e Àht for all t ≥ 0. ▵ The following definition characterizes some nice properties that activation systems (4.2) can have. 2. System (4.5) is said to be exponentially feed-forward if for Thus, the system (4.5) has q-polynomial decay if it decays faster than the solution to _ u ¼ Àcu q when (i) the forcing function takes values that are not too large (quantified by K) and (ii) the current value of the process is large (quantified by M). Note that for u(0) > 0, the solution to _ u ¼ Àcu q converges from infinity in finite time if q > 1. For completeness, we have proven this in proposition A.1 in appendix A.
The usefulness of a system of the form (4.5) being exponentially feed-forward comes from the fact that we would like to be able to understand the long-term behaviour of _ x ¼ f(y(t), x(t)) via an understanding of the long-term behaviour of _ x ¼ f(b y, x(t)). We note with the following simple example that one is not always able to do so. The system with y(t) = 1 + e −t satisfies y(t) À ! exp 1. However, for this particular choice of y(t), we have y(t) > 1 for all t ≥ 0. Thus, x(t) = x(0) + t, which does not converge to the fixed point of _ x ¼ f(1, x), which is zero regardless of x(0).
Given the discussion above, it will be useful to consider dynamical systems of the form royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 where y should be thought of as a (time-independent) collection of parameters, but now x is allowed to be higher-dimensional.
.0 is a parametrized dynamical system such that for any choice of x(0) [ R n !0 and y [ R p .0 the system has a unique solution. We will say that the system Note that the definition of reliable does not rule out the existence of fixed points outside of R n !0 . The main question we have is the following: when can we conclude that the fully parametrized system (4.1) and (4.2) has our desirable properties (reliability, convergence from infinity in finite time, and exponential reliability). The following theorem shows that these properties follow from easily checked conditions on the functions f ' i : In the theorem below, the vector of parameters y should be thought of as a steady state value for x ℓ−1 (t).
has q-polynomial decay for some q > 1 and is exponentially feed-forward. Then the system (4.1) and (4.2) converges from infinity in finite time and is exponentially reliable.
Proof. The proof proceeds by induction on r for the systems F r , where we remind the reader that the systems F r are defined below (4.1) and (4.2). Consider the case ℓ = 1, where we have Here, reliability of x 1 i follows by our assumption. The convergence of x 1 i from infinity in finite time follows by the assumption of q-polynomial decay (compare with _ u ¼ Àcu q ). Finally, the exponential reliability of x 1 i follows from the exponential feed-forward assumption (here x 0 À ! exp x 0 trivially). Hence, the system F 1 satisfies all the desired properties. Now suppose the result holds for F r with r < m. Then there is a compact set K , R cr !0 and a time T > 0 so that x r (t) [ K for all t ≥ T, and moreover x r À ! exp b x r . Hence, by the assumption of q-polynomial decay, x r+1 (t) converges from infinity in finite time, and we may conclude that the system F rþ1 does as well. Finally, by the exponential feed-forward assumption on layer r + 1, together with the assumption that _ x i ¼ f rþ1 i (y, x i (t)) is reliable, we may conclude that F rþ1 is exponentially reliable, and the proof is complete. ▪ We return to the activation system presented in example 4.2.
Proposition 4.9. Consider the hardwired neural network (G, P, w) and the system (4.1) and (4.2) with and h > 0. This system implements, in the sense of definition 4.1, the hardwired feed-forward neural network (G, P, w) where w is given as the smoothed ReLU function (3.2). Moreover, the system converges from infinity in finite time and is exponentially reliable.
Proof. The fixed points of _ Note that the strict inequalities follow from h > 0. The positive equilibrium is continuously differentiable in the argument z. Moreover, for any x(0) [ R !0 , asymptotic stability follows from standard methods. Hence, each of _ x ¼ f ' i (z, x(t)) is reliable. For each ℓ and i, the system _ x(t) ¼ f ' i (y(t), x(t)) has 2-polynomial decay. Hence, to apply theorem 4.8 and complete the proof we simply need to show that _ x(t) ¼ f ' i (y(t), x(t)) is exponentially feed-forward.
Thus, consider _ x(t) ¼ f ' i (y(t), x(t)) and suppose that y(t) À ! exp b y. Denote Then, by adding and subtracting appropriately, By assumption y(t) À ! exp b y, and so by linearity we have that r ' i (y(t)) À ! exp r ' i (b y). Moreover, standard methods can be used to show that x(t) is uniformly bounded in time. Combining the above allows us to conclude that royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 where 0 ≤ a(t) ≤ c e −ht for some c, h > 0. Hence, by Gronwall's inequality, see appendix A, where we can select h ≠ M by taking h slightly smaller if need be. Taking square roots shows that x(t) À ! exp x þ as desired. ▪

Reaction network implementation of a hardwired neural network with a smoothed ReLU activation function
This section is split into two parts. In §5.1, we give some preliminary definitions and concepts. In §5.2, we give the explicit construction.

Preliminaries
Consider a reaction network G ¼ (S, C, R). It is convenient to separate the species set S into a disjoint union of dynamic and enzymatic species.
A species is said to be a dynamic species if it is not an enzymatic species. ▴ Thus, an enzymatic species is one whose concentration is fixed for all time to its initial value, regardless of the initial value of the system. Enzymatic species are referred to as such because they facilitate reactions to occur, just like biological enzymes; higher availability of enzymes results in a proportional speedup of reactions. We will use the notation S dyn and S enz for the set of dynamic species and enzymatic species, respectively, and since any species can only be one or the other, S ¼ S dyn _ S S enz .

Example 5.2. Consider the reaction network
The concentrations of enzymatic species are time-invariant by definition, and so they satisfy the trivial ODE de/dt = 0, where e refers to the concentration of some enzyme E. This ODE obviously has the solution e(t) = e(0) for all t ≥ 0, independent of the dynamics of the other variables, and so it is without any loss of information that we can withhold the ODEs for the enzymes from our description. We simply regard the initial values of the enzymes as parameters in the dynamical system. Thus, we would say that the parametrized mass-action dynamical system associated to the network in example 5.2 is _ x ¼ Àk 1 exy þ k 2 fy and _ y ¼ k 1 exy À k 2 fy, where we regard e and f as positive parameters similar to k 1 and k 2 .
An alternative approach is to remove all enzymatic species and to 'absorb' their time-invariant concentration into the rate constant of the reaction. For instance, the network in example 5.2 is dynamically equivalent to the following network: in the sense that both give rise to an identical system of differential equations (5.1). Even though the former construction, in which enzymatic species are included in the model description, may seem superfluous, it offers flexibility that will be found to be useful later when we construct reaction networks modularly, and then take unions of them. In these situations species that were once enzymatic for one of the subnetworks can be dynamic for the resulting larger network. This perspective will also be useful in later work when we change our outlook from a reaction network implementation of a hardwired neural network to a neural network capable of learning. For a preview, suppose that we add a reaction to the network in example 5.2, so the resulting network is Then F has lost its status as an enzyme and has been moved to the set of dynamic species. The species partition for the new network is S dyn ¼ {X, Y, F} and S enz ¼ {E, Z}. Addition of the reaction Z + F → Z allows us to modulate the concentration of F and therefore also the rate at which the reaction Y + F → X + F occurs. The example is illustrative of some general properties, which we now state. A subnetwork of a reaction network G ¼ (S, C, R) is a reaction network G 0 ¼ (S 0 , C 0 , R 0 ) such that R 0 # R. It necessarily follows that S 0 # S, since every species in S 0 must participate in some reaction in R 0 and therefore also in R, and by similar reasoning C 0 # C. While S 0 dyn # S dyn , the containment for enzymes runs backwards, i.e. S enz > S 0 # S 0 enz . For example, the reaction network in example 5.2 is a subnetwork of the reaction network (5.2). The above-mentioned containments are easily checked to hold for this particular example.
With the assumption of mass-action kinetics, and for any particular choice of reaction rate constants, a reaction network can be translated into a system of ODEs via (2.1). Given this mapping, it is natural to say that a dynamical property of the parametrized ODE system is a property of the underlying reaction network itself. We proceed by fixing some notation, which will allow us to translate definition 4.7 to the reaction network setting. Let G ¼ (S, C, R) be a reaction network with S ¼ S dyn _ S S enz , and fix some (arbitrary) ordering of the dynamic species set S dyn . Let n :¼ jS dyn j and x(t) [ R n !0 denote the vector of concentrations of the dynamic species with respect to the ordering. We also arbitrarily order the set of parameters, which includes the reaction rate constants and the initial concentrations of enzymes. With the ordering, the parameters can be identified with a vector in y [ R p .0 where p :¼ jRj þ jS enz j.
.0 is a parametrized dynamical system obtained royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 by applying mass-action kinetics to G ¼ (S, C, R) with S ¼ S dyn _ S S enz , n :¼ jS dyn j and p :¼ jRj þ jS enz j. Suppose that for any choice of , x(t)) has a unique solution. We say that G is reliable, converges from infinity in finite time, or exponentially reliable if the parametrized system has those respective properties according to definition 4.7. 4

Construction
We will give the construction of a reaction network that implements a neural network with the smoothed ReLU activation function. We will specifically design the network so that the ODE system has f ' i as given in example 4.2. The construction will proceed in the following manner. First, we build an explicit reaction network implementation of only a single edge, as depicted in figure 5, of the neural network, and describe the resulting parametrized ODE system. Second, we build on the previous step by giving a reaction network implementation of a single fixed node e X in the neural network along with all of its inputs, and again describe the resulting parametrized ODE system. Finally, we describe the reaction network implementation of the entire neural network, which results in the parametrized ODE system in (4.1) and (4.2). For the sake of readability, we will limit the amount of enumeration used in our construction.
Step 1. The first step of the process, producing the reactions necessary for the implementation of a single edge, is carried out in table 1. The species sets for this particular reaction network are S dyn ¼ {X}, and S enz ¼ {H, W þ , W À , B þ , B À , X 0 }. The associated mass-action ODE system is one-dimensional, in the variable x, and if we assume all reactions occur with a rate constant of 1, is d dt We will assume from here on that q = 2 and note that it is easy to make the necessary changes in the description for a general value different from 2. Note that when q ≠ 2 the resulting activation function will be different from the smoothed ReLU.
Step 2. For the second step, we implement via reaction network the neural network depicted in figure 6, which now simply consists of e X along with all its inputs. For this particular node, we assume there are c > 0 inputs. The construction proceeds by simply taking the union over the c edges ( e X 0 i , e X) of the reaction networks described in Step 1. After this union, we once again have that X is the only dynamic species and the mass-action ODE for its concentration is given by where ρ x is defined by the equation above (and is analogous to r ' i from (4.4)). Note that the above corresponds with the equations in example 4.2.
Step 3. The third step is to construct the final network by taking the union of the construction described in the second step over all non-input nodes e X. In terms of dynamical systems, this constitutes taking a union of the systems of ODEs given by (5.4), with the appropriate indices applied to the variables and parameters. The final system of equations appears in (4.1) and (4.2) and in example 4.2. The entire system is repeated here for convenience of the reader: for ' [ {1, . . . , m}: Note that many species that were enzymatic in a particular network, for example the species associated with the terms x 0 i in the second step, are dynamic species in the final model.

An example
In this section, we provide an example to visually demonstrate several aspects of our theory and our constructions. The focus of this paper was not on training a network-that will be the focus of our next work. Instead, in this paper we focused on the different qualitative properties of possible constructions, as detailed in definitions 4.7 and 5.3, and so this example will primarily share that focus. We will showcase how the limiting values of the ODE associated with a reaction network that implements the modified ReLU activation function, as detailed in §5, match precisely with the more standard implementation of the neural network via direct use of the activation function (3.2). Moreover, we will demonstrate the fast convergence of the ODE, a property we have proven to hold in proposition 4.9. Next, we will demonstrate the flexibility of the developed theory by chemically implementing a different activation function: one that grows like ffiffi ffi y p , as y → ∞ (as opposed to linear growth in the case of ReLU), and converges to 0, as y → −∞. This new implementation will still satisfy the conditions of theorem 4.8, and hence still enjoy the properties of definitions 4.7 and 5.3. Finally, we will explain how any activation function with growth of the form y 1/k , as y → ∞, for any integer k ≥ 1, can likewise be implemented chemically.
As it is a standard example in the field, we use the MNIST dataset of handwritten digits [40]. See figure 7 for four representative images from this dataset. These images have 784 = 28 × 28 pixels, and the task of the neural network is to take a greyscale image of such a hand drawn digit, and correctly identify the digit. For example, we want the output to correctly identify the images in figure 7 as 8, 2, 6 and 7, respectively.
The input to the neural network can be regarded as a single vector of size 784. Further, it is natural to choose the number of output nodes to be 10, with each node representing a different digit from the set {0, 1, …, 9}. To complete the specification of the structure of the neural network, we will, somewhat arbitrarily, choose to have a single hidden layer with 40 nodes. Therefore, our neural network has: Figure 5. A single edge in a neural network, with one input and one output node.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 As already mentioned, we will use the construction detailed in §5.2, yielding a smoothed ReLU (3.2) as our activation function, and we will choose h = 1 as our smoothing parameter. We will now clearly specify how we implemented our neural network in Matlab. For the sake of reproducibility, we first set the seed of our random number generator by using the command 'rng(1234)'. We used this seed for every computation we are reporting in this section. We then initialized our weights and biases randomly by using scaled Gaussians via the following commands: W1 ¼ ð1=sqrtðc0ÞÞ Ã randnðc1; c0Þ; W2 ¼ ð1=sqrtðc1ÞÞ Ã randnðc2; c1Þ; beta1 ¼ randnðc1; 1Þ; beta2 ¼ randnðc2; 1Þ; In order to ensure that we use exactly the same random variables as does the reaction network implementation, we also defined an initial condition via the command x00 ¼ 10 Ã randð½50; 1Þ; as that call is necessary in our reaction network implementation in which each of the hidden and output nodes is a dynamic variable and therefore uses an initial condition. While present in the code, this term is not used in the standard neural network implementation.
We used a quadratic cost function (3.7) in which the 'truth', denoted τ(d), was a vector with a 10 in the place of the true digit (i.e. if the digit represented by d is zero, then τ(d ) has a 10 in the first component, if the digit represented by d is 1, then τ(d) has a 10 in the second component, etc.), and has ones in all other components. In order to implement gradient descent, we used a learning rate of η = 0.1 so that after each iteration of the neural network, we update our parameters via for appropriate ℓ, i and j. In order to estimate the derivatives above, we utilized stochastic gradient descent by using a batch of 300 randomly selected elements from the first 60 000 entries in the MNIST dataset. The specific call we used in our Matlab code was Vals ¼ randpermð60000; BatchSizeÞ; where BatchSize had been set to 300. See figure 8 for (i) the estimate of the cost function and (ii) the number correctly predicted, out of the randomly chosen batch of 300, by the neural network over 1000 iterations of the learning process. Note that near the end of the 1000 iterations, the neural network is correctly identifying just over 95% of the digits. For the sake of comparison, in figure 9 we give similar plots for the standard ReLU activation function (i.e. taking h = 0). Now the neural network correctly identifies around 88% of the digits. The superiority of the smoothed version of the ReLU activation function was apparent in nearly all the seeds of the random number generator that we tried (data not shown). The precise reason for the superiority of the smoothed version of the ReLU activation function in the present setting is unclear to us, though perhaps the lack of a zero derivative for y < 0 is playing a role. We now demonstrate the learning of the reaction network in a different manner: by visualizing the output trajectories of a subset of the nodes on a particular image Table 1. Components of an elementary reaction network-chemical implementation of a single directed edge ( e X 0 , e X) along with nodes e X 0 and e X of the neural network. A neural network is naturally viewed as a disjoint union of its edges, which allows putting together a chemical implementation as an appropriate union of elementary reaction networks.
which aspect of neural network is implemented chemically? chemical implementation of a single directed edge ( e X 0 , e X) of the neural network which term results in the ODE for the species X?
input e X 0 and weight of the edge ( e X 0 , e X) Step two: node e X along with all its c inputs. royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210031 from the database, but after a different number of iterations of the learning process. We arbitrarily chose the 30th image in the database, which is the 7 presented as the right-most image in figure 7. Note that since the image is that of a 7, we hope and expect that the equilibrium value associated with the 8th output node of our system will eventually converge towards 10, whereas the values of the other output nodes will converge towards 1. See figure 10 for trajectories of output nodes 1, 2, 6 and 8 (associated with the digits 0, 1, 5 and 7), and hidden nodes 1 and 32. As expected, the equilibrium value associated with the 8th output node does indeed separate from the others and moves towards 10, as the number of iterations increases, whereas the other output nodes remain near the value 1. Also of interest is that the equilibrium value associated with output node 2, which is associated with the digit 1, converges towards 1 slower than do the other output nodes. We assume this is because the digit, which is a seven, has characteristics similar to the digit 1. For the purposes of this particular calculation, our initial condition for all 50 nodes was chosen to be equal to one.
As mentioned above, the fact that a neural network using a ReLU activation function (smoothed or not) can be trained to identify the hand-drawn digits from the MNIST dataset is not the point of this paper, and is very well known. Instead, we now focus on the behaviour of the ODE associated with the reaction network implementation. Using the same set-up as detailed above (including the randomized initial conditions), but with both the BatchSize and the number of iterations set to 1, we may output the values of the activations a ℓ for the neural network with the smoothed ReLU activation function with h = 1. There are a total of 50 terms (one for each node) in these vectors, which is too many to visualize. We therefore arbitrarily selected the first and third nodes from the output layer and the first and 32nd nodes from the hidden layer. The resulting values are a 2 1 ¼ 1:54691661955827, a 2 3 ¼ 0:885658219979311, a 1 1 ¼ 1:06208572398989, a 1 32 ¼ 0:72187173499449: Next, we solved the system of 50 ODEs associated with our reaction network construction, with randomized initial   Figure 9. Performance of the ReLU cost function (i.e. h = 0). (a) Estimate of the cost function over each iteration of the neural network (from 300 randomly selected elements from the MINST dataset). (b) Total number of images from the 300 whose digits were correctly identified. For each image, the x-axis represents the iteration number of the learning process.
conditions detailed above, while using exactly the same random variables as in the standard neural network implementation. We solved the resulting system of 50 ODEs, and representative plots for the chosen four nodes are given in figure 11. We simulated until time 5 and found x 2 1 (5) ¼ 1:54703516476441, x 2 3 (5) ¼ 0:885627963885228, x 1 1 (5) ¼ 1:06216260026126, x 1 32 (5) ¼ 0:721901309123957: As our theory guaranteed, the values match those of the standard neural network very well. Of course, it is impossible to 'demonstrate' convergence from infinity. Instead, we simply modified the initial conditions to  Figure 10. Plots of trajectories from the ODEs associated with our reaction network construction after different numbers of iterations. Note how the equilibrium of the 8th output node, which is associated with the correct digit of 7, seems to be converging towards 10 as the number of iterations increases, whereas the equilibria associated with the other output nodes converge towards 1.  Figure 11. Representative plots with 'regularly sized' initial conditions. We chose to visualize nodes 1 and 3 from the output layer and nodes 1 and 32 from the hidden layer.
values were which, again, match the values from the previous iterations.

Modifying the activation function
We slightly modify the reaction network construction of §5 by using q = 3 instead of q = 2 in the final column of table 1. Thus, for each of the hidden and output nodes we simply change the reaction network so that it includes the reaction 3X → X instead of 2X → X. This change modifies the ODE for a particular node (hidden or output) to be _ x ¼ h þ r Á x À 2x 3 , ( 6 :1) with h > 0 and ρ as before. Note that the above is the analogue of f ' i in (4.3), and we are suppressing subscripts and superscripts for the sake of clarity (as we have done at times throughout the paper).
By Descartes' rule of signs, for each particular choice of h and ρ the system (6.1) has precisely one positive fixed point. As can be shown by standard methods, this fixed point is stable. Fixing h > 0, the activation function for the resulting system is found by solving for the unique positive fixed point as a function of ρ. See figure 13 for a plot of this activation function when h = 1. We see that the function is monotonic, grows like ffiffiffiffiffiffiffi ffi y=2 p , as y → ∞, and converges to zero, as y → −∞. Finally, it can be shown by similar arguments as in the proof of proposition 4.9 that the resulting chemical system implements, in the sense of definition 4.1, the hardwired feed-forward neural network (G, P, w) where w is given in figure 13, and that the system converges from infinity in finite time (due to it having 3-polynomial decay) and is exponentially reliable.
In order to implement this chemical system, we solved the associated ODEs, as detailed above. However, in the case when q = 2 we had a nice analytic formula for the activation function, w, given by (3.2), which we could easily differentiate to find w 0 (z), and plug that expression into the relevant terms in (3.8) for the purposes of gradient descent. In this case, we are not so fortunate. However, this derivative can be calculated in a straightforward manner. For a fixed value of z, we may denote the unique positive fixed point of (6.1), with z = ρ, via w(z), in which case we have that w(z) is defined implicitly via 0 ¼ h þ z Á w(z) À 2w(z) 3 : Differentiating with respect to z and solving yields w 0 (z) ¼ À w(z) z À 3 Á 2 Á w(z) 2 : As w(z) is the output from the ODE solver, we also get the derivative in a straightforward manner.
With all the details in place, we can run the system and implement the neural network via our new chemical output node 1:  is added for the sake of comparison, which would be the corresponding activation function if h were taken to be zero while q = 3. reaction network. In figure 14, we provide plots of the estimated cost and the number of images correctly identified, out of a batch of 300, using this new chemical system and activation function when all other variables (i.e. numbers of layers, hidden nodes, seed of the random number generator, etc.) are kept the same as above. We note that this activation function performs similarly to the ReLU activation function ( figure 9).
Finally, we note that we could also select q to be any integer greater than 3, and a similar analysis can be carried out. In particular, when q is an integer greater than or equal to 2, we get an activation function that grows like y 1/(q−1) . Moreover, the derivative can be calculated as above, and found to satisfy w 0 (z) ¼ À w(z) z À q(q À 1)w(z) qÀ1 : These systems with different activation functions could be useful in different settings.
Data accessibility. All code used to produce the images in this paper are available upon request.  Figure 14. Performance of the new activation function (when q = 3). (a) Estimate of the cost function over each iteration of the neural network (from 300 randomly selected elements from the MINST dataset). (b) Total number of images from the 300 whose digits were correctly identified. For each image, the x-axis represents the iteration number of the learning process.