Basic principles of the genetic code extension

Compounds including non-canonical amino acids (ncAAs) or other artificially designed molecules can find a lot of applications in medicine, industry and biotechnology. They can be produced thanks to the modification or extension of the standard genetic code (SGC). Such peptides or proteins including the ncAAs can be constantly delivered in a stable way by organisms with the customized genetic code. Among several methods of engineering the code, using non-canonical base pairs is especially promising, because it enables generating many new codons, which can be used to encode any new amino acid. Since even one pair of new bases can extend the SGC up to 216 codons generated by a six-letter nucleotide alphabet, the extension of the SGC can be achieved in many ways. Here, we proposed a stepwise procedure of the SGC extension with one pair of non-canonical bases to minimize the consequences of point mutations. We reported relationships between codons in the framework of graph theory. All 216 codons were represented as nodes of the graph, whereas its edges were induced by all possible single nucleotide mutations occurring between codons. Therefore, every set of canonical and newly added codons induces a specific subgraph. We characterized the properties of the induced subgraphs generated by selected sets of codons. Thanks to that, we were able to describe a procedure for incremental addition of the set of meaningful codons up to the full coding system consisting of three pairs of bases. The procedure of gradual extension of the SGC makes the whole system robust to changing genetic information due to mutations and is compatible with the views assuming that codons and amino acids were added successively to the primordial SGC, which evolved minimizing harmful consequences of mutations or mistranslations of encoded proteins.

The basic diversity of proteins fulfilling a wide range of functions within organisms is 2 based on 20 naturally occurring amino acids. The proteins are also modified 3 post-translationally, which extends their properties. However, it is tempting to increase 4 this variety with artificially designed amino acids or other molecules. They can be 5 introduced directly into proteins or modified in a given proteinaceous molecule, but 6 more universal and stable solution is such modification of the standard genetic code 7 (SGC) that the newly created proteins including non-canonical amino acids (ncAAs) are 8 constantly produced by a given organism. Several approaches were invented to achieve 9 this goal [Chin, 2014]. 10 The first approach uses stop translation codons (e.g. rarely used UAG) to encode 11 non-canonical amino acids [Noren et al., 1989, Chin, 2017, Italia et al., 2017 1/24 Schultz, 2018]. This method requires a modified aminoacyl-tRNA synthetase which 13 charges a tRNA with an ncAAs. This suppressor tRNA must recognize the stop codon 14 and then ncAA is incorporated into a protein during its synthesis. However, this 15 method enables utilization of up to two stop codons because one of the three codons 16 must be left as a termination signal of translation [Ozer et al., 2017]. 17 Another method applies quadruplet codons, which consist of an infrequently used 18 triplet codon with an additional base [Hohsaka et al., 1996, Anderson et al., 19 2004, Neumann et al., 2010]. Such a quadruplet is decoded by a modified tRNA 20 containing a complementary quadruplet anticodon. Then, ncAA associated with this 21 tRNA is added into a newly synthesized protein due to frameshifted open reading frame. 22 However, the typical triplet can be decoded by a typical tRNA competitively, which 23 decreases the efficiency of this procedure. 24 It is also possible to assign various sense codons to different ncAAs by withdrawing 25 the cognate amino acid and aminoacyl-tRNA synthetase, and adding pre-charged 26 ncAA-tRNAs bearing the corresponding anticodons [Forster et al., 2003, Goto et al., 27 2011, Josephson et al., 2005. This method, however, sacrifices a natural amino acid. A 28 new method overcomes this problem and frees sense codons for non-canonical amino 29 acids without elimination of natural ones [Tajima et al., 2018]. This is achieved by 30 utilization of appropriate synonymous codons, depletion of their corresponding tRNAs 31 and addition of tRNAs pre-charged with ncAAs. This method enables expanding the 32 repertoire to 23 potential ncAA via division of multiple codon boxes [Iwane et al., 2016] 33 but can influence the efficiency and speed of translation as well as protein folding due to 34 altered codon usage [Plotkin and Kudla, 2011]. 35 A weakness of these methods is that they rely on the set of four canonical bases, 36 which can generate a limited set of codons, up to 64. Therefore, a promising approach is 37 using unnatural base pairs, which can generate much larger number of genuinely new 38 codons. This approach does not interfere with the natural system because it does not 39 involve the canonical codons, while the new ones are free of any natural role. Such 40 experiments with at least three pairs of the fifth and the sixth nucleotide were already 41 carried out and appeared promising [Ishikawa et al., 2000, Ohtsuki et al., 2001, Yang 42 et al., 2007, Kimoto et al., 2009, Malyshev et al., 2009, Dien et al., 2018, Hamashima 43 et al., 2018. Protein synthesis using this approach occurred successfully in 44 semi-synthetic bacteria [Zhang et al., 2017]. 45 The inclusion of one pair of unnatural nucleotides can extend the standard genetic 46 code even up to 216 codons, which is nearly three times larger than the set of 64 47 canonical codons. The new unassigned codons, in the number of 152, raise an exciting 48 possibility of adding many unnatural amino acids or similar compounds and creating a 49 new extended genetic code (EGC). Therefore, it is reasonable to pose a question about 50 the rules according to which we can extend the code. There are many possibilities to do 51 this. Here we propose a way assuming that the genetic code should be a system 52 resistant to point mutations, which can change the encoded information. In other words, 53 we present a formal description of the genetic code expansion to minimize the cost of 54 changing codons due to the mutations. The presented procedure of incremental 55 expansion of the genetic code ensures robustness of the extended code against loosing 56 genetic information. This assumption seems attractive in the context of the hypothesis 57 postulating that the genetic code evolved to minimize harmful consequences of 58 mutations or mistranslations of coded proteins [Woese, 1965, Sonneborn, 1965, Epstein, 59 1966, Goldberg and Wittes, 1966, Haig and Hurst, 1991, Freeland and Hurst, 60 1998a, Freeland and Hurst, 1998b, Gilis et al., 2001, Freeland et al., 2003, Goodarzi et al., 61 2005. The extension of the standard genetic code 64 We start our investigation by applying a similar approach to that presented by [B lażej 65 et al., 2018a], in which the standard genetic code is described as a graph G(V C , E C ), 66 where V C is the set of vertices (nodes), whereas E C is the set of edges. V C corresponds 67 to the set of 64 canonical codons using four natural nucleotides {A, T, G, C}, while the 68 edges are induced by all possible single nucleotide substitutions between the codons. 69 Therefore, the graph G(V C , E C ) is a representation of all possible single point mutations 70 occurring between canonical codons.

71
In this work, we introduce a more general graph G(V, E), in which the set of vertices 72 corresponds to 216 codons, using six-letter alphabet, while the set of edges is defined in 73 a similar way as E C .

74
Definition 1. Let G(V, E) be a graph in which V is the set of vertices representing all 75 possible 216 codons, whereas E is the set of edges connecting these vertices. All 76 connections between the nodes fulfil the property that two nodes, i.e. codons u, v ∈ V , 77 are connected by the edge e(u, v) ∈ E (u ∼ v), if and only if the codon u differs from the 78 codon v in exactly one position.

79
In order to simplify our notation we use further G instead of G(V, E). It is clear that 80 the set of edges E of the graph G represents all possible single nucleotide substitutions, 81 which occur between codons created by the set of natural nucleotides {A, T, G, C} as 82 well as one pair of unnatural nucleotides {X, Y }. Assuming that all changes are equally 83 probable, we obtain that G is an undirected, unweighted and regular graph with the 84 vertices degree equal to 15. Moreover, the set of 64 canonical codons V C used in the according to the following definition.

87
Definition 2. If G(V, E) is a graph, and S ⊂ V is a subset of vertices of G, then the 88 induced subgraph G[S] is the graph, whose set of vertices is S and whose set of edges 89 consists of the edges in E, which have both endpoints in S.

90
Following this definition, let us denote by V k a subset of vertices (codons) involved into a given extended genetic code (EGC) with exactly k ≥ 1 non-canonical codons. This subset must fulfil the following property: i.e. V k must be an extension of the set of canonical codons. As a result, we can define a 91 graph G[V k ], which is a subgraph of the graph G generated by V k . Therefore, the main 92 goal of this work is to test the property of the graph G[V k ], which can be interpreted as 93 a structural representation of the extended genetic code. Thus, we develop methodology 94 to describe features of the graph G.

95
The properties of the graph G

96
Interesting features of G can appear, when the set of vertices V is divided into the 97 partition of eight disjoint and non-empty sets. It induces a specific connections between 98 these vertices by edges. This partition includes also V C , i.e. the set of natural codons.

99
Proposition 1. Let G(V, E) be a graph, where V represents the set of all possible 216 100 codons and E is the set of edges generated by single nucleotide substitutions. Then, the 101 set of vertices V can be split unambiguously into eight disjoint subsets. These are V C , 102 X 1 , X 2 ,X 3 , X 12 ,X 13 ,X 23 and X 123 , where:

120
The graphical representation of relationships between these sets is presented in 121 Figure 1.

122
Based on such partition, we can investigate properties of the EGC. In order to do this, let us introduce the following notation. We denote another two subsets of V : and XX = X 12 ∪ X 13 ∪ X 23 ∪ X 123 .
They are disjoint and constitute also a partition of V . We call the set XV C "close neighbourhood" of V C because it contains all codons that differ from the set V C in at most one position in a codon. In contrast to that, XX is not directly connected with V C . Moreover, we introduce also the set XXV C , defined as follows: It is clear that XXV C and X 123 are disjoint and also constitute a partition of V .

123
In the next proposition we give several properties of edge connections between the 124 selected sets of nodes.

125
Proposition 2. Let us consider the codon sets introduced in Proposition 1 and two 126 subsets of nodes XV C , XX. Then we have the following properties:

127
(a) Each codon c ∈ X i , i = 1, 2, 3 has exactly four edges crossing from X i to V C ;

128
(b) Each codon c ∈ X i , i = 1, 2, 3 has exactly four edges crossing from X i to XX;

129
(c) There does not exist any connection between XX and V C ;

130
(d) Each codon c ∈ X ij , i = j, i, j = 1, 2, 3 has exactly eight edges crossing to XV C ; 131 (e) Each codon c ∈ X ij , i = 1, 2, 3 has exactly two edges crossing from X ij to X 123 ; 132 4/24 (f) There does not exist any connection between X ij and V C ;

133
(g) There does not exist any connection between X 123 and XV C .

134
It is also interesting to describe some properties of subgraphs generated by codon 135 sets X 1 , X 2 , X 3 and X 12 , X 13 , X 23 , respectively. They are formulated in the following 136 two lemmas: ] are isomorphic to each other.

138
Proof. According to the definition of the graph isomorphism, there must exist a In this case, such a bijection can be easily defined as a swap between respective codon 142 positions, where nucleotides X and Y occur. 143 We observe the same property in the case of codon sets X 12 , X 13 and X 23 . Thus, we 144 can formulate a similar lemma: Proof. The proof is analogous to the proof of lemma 1.

147
What is more, in the construction of the optimal EGC, we also use the fact that the 148 graphs G[X n ], n ∈ {1, 2, 3, 12, 13, 23, 123}, can be represented as Cartesian products of 149 other graphs. This important feature is presented in the following three propositions.

150
Proposition 3. The graphs G[X n ], n ∈ {1, 2, 3} can be represented as a Cartesian product of graphs: where K 2 and K 4 are complete graphs of sizes two and four with the set of vertices 151 {X, Y } and {A, T, G, C}, respectively. In this case, two vertices (x, y, z), (x , y , z ) are 152 connected by the edge e((x, y, z), (x , y , z )), if (x = x and y = y and z ∼ z ) or 153 (x = x and y ∼ y and z = z ) or (x ∼ x and y = y and z = z ).

154
Proposition 4. The graphs G[X n ], n ∈ {12, 13, 23} can be represented as a Cartesian product of graphs: where K 2 and K 4 are complete graphs of sizes two and four with the set of vertices 155 {X, Y } and {A, T, G, C}, respectively. In this case, two vertices (x, y, z), (x , y , z ) are 156 connected by the edge e((x, y, z), (x , y , z )), if (x = x and y = y and z ∼ z ) or 157 (x = x and y ∼ y and z = z ) or (x ∼ x and y = y and z = z ).

158
Proposition 5. The graph G[X 123 ] can be represented as a Cartesian product of graphs: where K 2 is a complete graph of size two with the set of vertices {X, Y }. In this case, 159 two vertices (x, y, z), (x , y , z ) are connected by the edge e((x, y, z), (x , y , z )), if 160 (x = x and y = y and z ∼ z ) or (x = x and y ∼ y and z = z ) or (x ∼ x and 161 y = y and z = z ).

5/24
The optimality of codon group 163 Similarly to [B lażej et al., 2018a], we introduce two measures describing properties of 164 codon groups. They are the set conductance and the k-size conductance, which 165 characterize the quality of a given codon sets in terms of non-synonymous mutations 166 which lead to a replacement of one amino acid to another.

167
Definition 3. For a given graph G let S be a subset of V . The conductance of S is defined as: where E(S,S) is the number of edges of G crossing from S to its complementS and 168 vol(S) is the sum of all degrees of the vertices belonging to S.

169
The measure φ(S) can be a fraction of non-synonymous substitutions between S and 170 S, if S is a group of codons encoding the same amino acid andS includes codons 171 bearing other genetic information. It is interesting that the optimal codon group, in 172 terms of its robustness to point mutations, should be characterized by low values of the 173 set conductance. Therefore, the number of nucleotide substitutions that change a coded 174 amino acid should be relatively small in comparison to the total number of all possible 175 nucleotide mutations involving all codons belonging to the given set. In this context, it 176 is also interesting to calculate the k-size-conductance φ k (G), which is described as the 177 minimal set conductance over all subsets of V with the fixed size k.

178
Definition 4. The k-size-conductance of the graph G, for k ≥ 1, is defined as: In consequence, k · φ k (G) gives us a lower bound on the number of edges going 179 outside the set nodes of the size k and this characteristic is useful in describing the 180 optimal codon structures.

182
In this section, we present a step by step procedure which allows us to extend the SGC 183 from 64 up to 216 meaningful codons. Codons are added to the code gradually in three 184 stages. The first step extends the SGC to 160 codons, the second step to 208 codons 185 and the third to all possible 216 codons. The EGC created at each stage must be 186 optimal in terms of minimization of point mutations. we propose some optimization criteria in order to find the best possible solution.

193
Using the notation from in the previous sections, let us define which is a set of not assigned codons. Moreover, let us denote by A k a set of k new codons involved in a given genetic code extension: where 1 ≤ |A k | ≤ 96. Thanks to that, we can define two measures describing the properties of G[V k ]. They are: and where E(V C , A k ) is the total number of edges, extracted from the graph G, crossing 194 from the set of canonical codons V C to A k , whereas E(V k , V k ) is the total number of 195 edges crossing from the set of codons which constitute the extended genetic code V k to 196 unassigned codons.

197
Interestingly, by applying (4) and (5), it is possible to characterize the properties of 198 a given subgraph G[V k ] and at the same time the EGC induced by the codons belonging 199 to V k . In the definition below we give some conditions which constitute the EGC 200 optimality. Thanks to that, we can to find the best genetic code extended by where A k possesses the feature: These two restrictions have a sensible interpretation. By minimizing the 203 condition (6), we reduce the possibility that a point mutation can generate a codon 204 belonging to the "non coding zone" V k , i.e the set of unassigned codons. On the other 205 hand, maximizing the value of A k according to (7), we claim that the number of 206 connections between two sets, namely, the canonical and newly assigned codons 207 E(V C , A k ) is as large as possible. These two assumptions maximize the number of 208 connections between standard and newly incorporated codons and simultaneously 209 decrease the probability of losing genetic information from the whole system due to 210 point mutations. Therefore, we would like to focus on the V k sets, when A k = V k \V C 211 fulfils the property (7). Then, let us denote by a class of all sets V k with exactly k non-canonical codons and let us assume that 213 A k = V k \V C fulfils the property (7). It is clear that all optimal EGCs, in terms of (7)   214 and (6), belong to V k .

215
These features appear to be very useful for characterizing possible extensions of the 216 SGC. In the next theorem, we describe the optimal extension of the SGC up to 160 217 meaningful codons. Interestingly, this extension can be described in terms of k-size 218 conductance φ k (G[X i ]), i = 1, 2, 3 calculated for induced subgraphs G[X i ], i = 1, 2, 3. 219 We begin our investigation with a lemma, which gives us some characterizations of the 220 optimal sets V k .

221
Lemma 3. Let V k ∈ V k be a set of codons, where 1 ≤ k ≤ 96 and A k = V k \V C , then Proof. The proof of this lemma follows directly from the proposition 2(a,c) and the 222 definition of V k .

7/24
Thanks to that, we can formulate a theorem, which gives us a lower bound on the 224 number of edges crossing from V k to its complement.

225
Theorem 1. Let V k ∈ V k be a set of codons, where 1 ≤ k ≤ 96. Then the following inequality holds: Proof. We begin the proof with an observation: Interestingly, following the definition 3, we can calculate the set conductance of V C . In this case, we have φ(V C ) = 0.4. Hereby, we get immediately: In addition, using the proposition 2(b) we get the following equality: Therefore, we can rewrite the equality (8) in the following way: In our next step, we observe: according to the proposition 2(b). As a consequence, we can reformulate the equation (10) as follows: Furthermore, taking into account that the set V k ∩ X i = X i \(A k ∩ X i ) and using the definitions 3 and 4, we have Finally, we obtain 228 8/24 Therefore, to extend the SGC using 1 ≤ k ≤ 96 codons in the optimal way according to the definition 5, we have to choose codons only from the sets X 1 , X 2 and X 3 . Interestingly, the lower bound on the value of E(V k , V k ) presented in this theorem depends on the the k-size conductance of new codon groups. What is more, the EGC being optimal in terms of the definition 5 and including 160 codons is described by the set XV C because in this case we get: 1.2 The properties of the optimal genetic code including up to 229 160 meaningful codons 230 We would like to pose a question about the properties of the optimal codon set for which the lower bound: is attained under the additional restriction 1 ≤ k 1 + k 2 + k 3 ≤ 96. Moreover, it is also 231 interesting to find the best possible genetic code extension for every 1 ≤ k ≤ 96.

232
We begin our consideration with presenting some features of induced graphs which are optimal in terms of k-size conductance. Therefore, for every V k we can find a 242 lower bound, i.e. an extended genetic code which is composed of the subsets of 243 lexicographically ordered codons belonging to X 1 , X 2 and X 3 .

244
In Table 1, we present the list of all G[X 1 ] nodes taken in the selected lexicographic order. What is more, we evaluate also all possible k-size conductance values for the respective sets. Using these results, we can propose a method for finding the best possible genetic code extension in the class V k , 1 ≤ k ≤ 96. Let us start with the following observation: if k 1 , k 2 and k 3 defined in the theorem 1 fulfil the condition k 1 + k 2 + k 3 ≤ 32, then we get the following inequality: This formula results from the fact that the calculated values of φ k (G[X 1 ]) decrease, in 245 general, with the size of codon groups k (Tab 1). Therefore, to create the optimal 246 genetic code extension V * k , it is enough to choose new codons from the set X i until the 247 total number of codons k exceeds 32. Then, this procedure should be continued and 248 additional codons from the next X i -type set should be selected till the total number of 249 codons reaches 64. In order to extend the genetic code over 160 meaningful codons, we have to make some 253 observations. From (12), we get immediately that XV C is the best genetic code 254 9/24 extension involving 96 additional codons. In addition, applying the proposition 2(c), we 255 get that XV C includes all non-canonical codons that are directly connected with V C . As 256 a result, the condition (7) is non restrictive in the case when we try to extend V C in 257 consecutive steps using the definition 5 for k > 96. Therefore, we propose to reformulate 258 the problem of optimal V C extension into the question of optimal extension of the XV C 259 set.

260
Let us denote by V k a set of codons such that XV C ⊆ V k with exactly k, 1 ≤ k ≤ 48 261 new codons in comparison to XV C . Therefore, the optimal genetic code extension can 262 be characterized in the following way.

263
Definition 6. The set V * k , XV C ⊂ V * k with exactly k additional codons is optimal if where A k possess the feature: Similarly to the method presented in the previous subsections, we introduce a 265 definition which is useful in describing the optimality of the extended genetic code.

266
Definition 7. Let us define by V k a class of sets V k , whose 1 ≤ k ≤ 48 additional codons and A k = V k \XV C fulfil the property (14). Then, is a class of all possible extensions of the XV C set with exactly k new codons.

267
Thanks to that, we are able to give the optimal XV C extension with a given size k. 268 In order to increase the SGC by over 160 codons in total, it is enough to extend the set 269 XV C by incorporating new codons from the set XX in such a way that the number of 270 connections between XV C and XX is minimized according to the condition (13), 271 whereas the number of possible connections between the "basic" coding system XV C 272 and newly added codons is maximized at the same time according to the condition (14). 273 Interestingly, we can find the optimal XV C extension for 1 ≤ k ≤ 48 in similar way 274 to that presented in the subsection 1.1. We begin with introducing the following lemma. 275 Lemma 4. Let V k ∈ V k be a set of codons where 1 ≤ k ≤ 48 and A k = V k \XV C . If A k fulfils the condition (14), then Proof. The proof of this lemma follows directly from the proposition 2(d,g).

276
Then, we can formulate the following theorem.

277
Theorem 2. Let V k ∈ V k be a set of codons, where 1 ≤ k ≤ 48 and A k = V k \XV C fulfil the condition (14). Then the following inequality holds: where kij = |A k ∩ Xij|, ij kij = k and G[Xij] is the induced subgraph of G.

278
Proof. Similarly to the proof of the theorem 1, we start with the equation: Using the equation (12) and the proposition 2 (d,g), we get immediately two equalities: Therefore, we can rewrite the equation (15) in the following way: In the next step, we make a simple observation: where ij E(A k ∩ X ij , XXX) = 2k according to the proposition 2(e). Then following the definitions 3 and 4, we get: In consequence, we can reformulate the inequality 15 as follows:

279
As a result, we found the lower bound of the value of E(V k , V k ), where the size k of the set V k is a number between 1 ≤ k ≤ 48. Similarly to the theorem 1, the optimality of the extended genetic code depends strongly on the properties of newly created codon groups. Clearly, the best codon groups attain the k-size conductance values φ kij (G[X ij ]) for their size k ij . What is more, the optimal EGC, in terms of the definition 6 with 208 codons in total is described by the set XXV C = XV C ∪ X 12 ∪ X 23 ∪ X 13 because in this case we get:  (14), is determined 284 by the codon blocks that are optimal in terms of the k-size conductance. Following the 285 results presented in the section 1.2, we have to consider some properties of induced 286 subgraphs G[X ij ], ij = 12, 13, 23, because they allow us to describe the optimal codon 287 groups. Using the lemma 2, we obtain that graphs G[X ij ] are isomorphic to each other. 288 Thanks to that, it is sufficient to consider the properties of the graph G[X 12 ] (Fig. 3). 289 Similarly to the previous results, G [X 12 ] can be represented as a Cartesian product of 290 graphs (the lemma 4). Therefore, using again the theorem 2.3 from [Bezrukov, 1999], we 291 obtain that the set of the first k codons (nodes) of G[X 12 ] ordered in the lexicographic 292 order possess the optimal k-size conductance φ k (G[X 12 ]). In Table 2, we present the list 293 of all G[X 12 ] nodes ordered in the lexicographic order. What is more, we evaluated also 294 all possible values of φ k (G[X 12 ]) for the sets composed of the first k nodes.

295
Similarly to the previous results, the best genetic code extensions, namely, 296 V * k , 1 ≤ k ≤ 48 have the nested structure of the optimal codon blocks. It can be 297 obtained by addition of the subsequent codons according to their lexicographic order.

298
The new codons are selected from the subsequent sets of type X ij until the total 299 number of included codons in a given set reaches 16. The methodology presented in the previous section allows us to extend XV C up to the 303 XXV C set of codons involving 208 out of 216 possible codons. In order to extend the 304 genetic code by over 208 meaningful codons, we must conduct a reasoning. From (16) 305 we get that XXV C is the best XV C extension involving 48 additional codons. What is 306 more, applying the proposition 2(g) we get that XXV C includes all non-standard 307 codons, which are connected to XV C . As a result, the property 14 is not restrictive in 308 the case, when we try to extend XV C in consecutive steps using the definition 6 for 309 k > 48. Therefore, similarly to the method presented in the previous section, we 310 reformulate the problem of the optimal XV C extension into the question of the optimal 311 extension of the XXV C set. 312 where A k possess one additional feature: We introduce also a definition which is useful in describing the optimality of the 314 genetic code extension.

315
Definition 9. Let us denote by V k a class of sets XXV C ⊂ V k with k ≥ 1 additional codons and A k = V k \XXV C fulfils the property (18). Then is a class of all possible extensions of the XXV C set with exactly k, 1 ≤ k ≤ 8 316 additional codons.

317
Using the definition 8 of optimality, we get the following characterization of the set 318 V * k .

319
Theorem 3. For every 1 ≤ k ≤ 8 the following equation holds: where V * k ∈ V k and A k ⊆ X 123 is optimal in terms of φ k (G[X 123 ]).

320
Proof. The proof of this theorem is an immediate consequence of the proposition 2(g) 321 and the definition 4.

12/24
Furthermore, the induced subgraph G[X 123 ] (Fig. 4) can be also represented as a

323
Cartesian product of graphs (the proposition 5). Using again the theorem 2.3 324 from [Bezrukov, 1999], we obtain that the collection of the first k codons of G[X 123 ] 325 taken in the lexicographic order possess the optimal k-size conductance φ k (G[X 123 ]). In 326 the Table 3, we present the list of all G[X 123 ] nodes taken in the selected lexicographic 327 order. We evaluated also all possible values of the k-size conductance for the sets 328 composed of the first k nodes. genetic code evolved to minimize harmful consequences of mutations or mistranslations 343 of coded proteins [Woese, 1965, Sonneborn, 1965, Epstein, 1966, Goldberg and Wittes, 344 1966. The SGC turned out to be quite well optimized in this respect when compared 345 with a sample of randomly generated codes [Haig and Hurst, 1991, Freeland and Hurst, 346 1998a, Freeland and Hurst, 1998b, Freeland et al., 2000, Gilis et al., 2001 but the 347 application of optimization algorithms revealed that the SGC is not perfectly optimized 348 in this respect and more robust codes can be found [B lażej et al., 2018a, B lażej et al., 349 2016, Massey, 2008, Novozhilov et al., 2007, Santos et al., 2011, Santos and Monteagudo, 350 2017, Wnetrzak et al., 2018, B lażej et al., 2018b, B lażej et al., 2019b. The minimization 351 of mutation errors is important from biological point of view, because it protects 352 organism against losing genetic information. Then, the reducing of the mutational load 353 seems favoured by biological systems and can occur directly at the level of the 354 mutational pressure [Dudkiewicz et al., 2005, Mackiewicz et al., 2008, B lażej et al., 355 2013, B lażej et al., 2015, B lażej et al., 2017. Nevertheless, in the global scale, the SGC 356 shows a general tendency to error minimization [B lażej et al., 2018b, Wnetrzak et al., 357 2018], which is more exhibited by its alternative versions [B lażej et al., 2019a], evolved 358 later. Therefore, the extension of the SGC according to this rule seems to be a natural 359 consequence of its evolution.

360
Our approach assumes a stepwise extension of the code similarly to the gradual 361 addition of new amino acids to the evolving primordial SGC, when they were produced 362 by increasingly more complex biosynthetic pathways evolving in parallel [Di Giulio, 363 1997, Di Giulio and Medugno, 1999, Di Giulio, 2004, Di Giulio, 2008, Di Giulio, 364 2016, Di Giulio, 2017, Guimaraes, 2011, Wong, 1975, Wong et al., 2016, Wong et al., 2007. 365 The addition of amino acids was also driven by the selection for the increasing diversity 366 of amino acids [Higgs and Pudritz, 2009, Koonin and Novozhilov, 2017, Sengupta and 367 Higgs, 2015, Weberndorfer et al., 2003 as well as decreasing disruption of already coded 368 proteins and their composition [Higgs, 2009]. The similar assumptions are included in  . The graph is induced by the partition of the set of vertices V from the graph G into eight subsets V C , X 1 , X 2 , X 3 , X 12 , X 13 , X 23 , X 123 , where the edges between the groups are induced by edges connecting codons, which belong to different sets. The edges correspond to single point mutations between the codons. V C corresponds to the SGC, N refers to a standard base and X a non-standard base. ] which is an induced subgraph of the graph G(V, E). Each node is a codon belonging to the set X 1 , X 1 ⊂ V , whereas its edges are taken from the set E.    Figure 3. The graphical representation of the graph G[X 12 ] which is the induced subgraph of the graph G(V, E). Each node is a codon belonging to the set X 12 , X 12 ⊂ V , whereas the edges are incorporated from the set E.   E). Each node is a codon belonging to the set X 123 , X 123 ⊂ V , whereas the edges are incorporated from the set E.