A Beta-splitting model for evolutionary trees

In this article, we construct a generalization of the Blum–François Beta-splitting model for evolutionary trees, which was itself inspired by Aldous' Beta-splitting model on cladograms. The novelty of our approach allows for asymmetric shares of diversification rates (or diversification ‘potential’) between two sister species in an evolutionarily interpretable manner, as well as the addition of extinction to the model in a natural way. We describe the incremental evolutionary construction of a tree with n leaves by splitting or freezing extant lineages through the generating, organizing and deleting processes. We then give the probability of any (binary rooted) tree under this model with no extinction, at several resolutions: ranked planar trees giving asymmetric roles to the first and second offspring species of a given species and keeping track of the order of the speciation events occurring during the creation of the tree, unranked planar trees, ranked non-planar trees and finally (unranked non-planar) trees. We also describe a continuous-time equivalent of the generating, organizing and deleting processes where tree topology and branch lengths are jointly modelled and provide code in SageMath/Python for these algorithms.

The function SplittingPermutation translates a sequence of n real numbers (our splitting points, later) into a permutation of [n] = {1, . . . , n} by returning a list of n integers such that the i-th element of the list is the index of the i-th smallest number in the initial sequence. In the function MakePartitionAndTree, we construct m samples of a tree with n splits (or n + 1 leaves), using the B(a + 1, b + 1) distribution of the coordinates B i of the generating sequence. We rst obtain the sequence of points in [0,1] where the splits occur according to the generating sequence and store them in SplitPoints. Then we use the standard binary_search_insert method to obtain a binary search tree that organizes the points in SplitPoints into an unranked planar binary tree. Recall that the split points are inserted from the root of the tree such that the new point that is smaller/larger than the point at the root node descends into the left/right subtree of the root and recursively takes left/right subtree depending on whether it is lesser or greater than the next point it encounters at an internal node that is already stored in the tree. Each of these trees is recorded at several resolutions: the ranked planar trees (recorded in the list SplittingPermutationSamples, unranked planar trees (recorded in PlanarShapeSamples) and unranked non-planar trees (recorded in PhyloShapeSamples).
For the nest resolution of ranked planar trees, we use the bijection between ranked planar trees with n + 1 leaves and permutations of [n] (note that the cardinality of each set is n!). This bijection is detailed in Flajolet and Sedgewick (2009), Ex. 17, p. 132, and follows this idea. Say the permutation of [n] of which we want to draw the tree is [i 1 , i 2 , . . . , i n ]. We rst construct the planar skeleton of the ranked internal nodes incrementally by starting with a single node. To place the second node, check whether i 2 < i 1 , in which case the left child of the root becomes node 2, or whether i 2 > i 1 and the right child node of the root becomes node 2. To place the third node, nd whether it goes to the left or to the right of the root by checking whether i 3 < i 1 or i 3 > i 1 . Once this is decided, if the second and third nodes are on the same side of the root, then compare i 3 to i 2 to decide whether node 3 should be the left or right child of node 2. Proceeding in the same way for the other terms of the permutation, we construct a planar ranked skeleton with n nodes. There remains to attach the n + 1 leaves to the terminal nodes of the skeleton to obtain a ranked planar tree with n splits. For ease of representation, the output corresponding to the ranked planar trees is thus the list SplittingPermutationSamples recording the permutations corresponding to the m trees. We also provide some additional functions computing the probabilities of a given tree at a particular resolution under the Beta-splitting model. '''probability of various resolutions of tree T under beta-splitting model a,b>-1, where (a+1,b+1) are the parameters of the beta distribution''' spS=splitsSequence(T) numSplits=len(spS) # non-cherry splits ncspS=filter(lambda x: x!=(0,0), spS) numCherries=numSplits-len(ncspS) probRPT = prod(map(lambda x:beta(x[0]+a+1,x[1]+b+1)/beta(a+1,b+1), ncspS)) catCoeff = prod(map(lambda x:binomial(x[0]+x[1],x[1]),ncspS)) # prob of (non-ranked) planar tree probPT = catCoeff * probRPT probRT = 2^(numSplits-numCherries)*probRPT numIsoSplits=numIso(T) probT=2^(numSplits-numIsoSplits)*probPT return (numSplits,numIsoSplits,numCherries,catCoeff,probRPT,probPT,probRT,probT)

A reversal property
Although Aldous' leaf deletion property does not seem to hold in general for the random tree obtained through the generating and organizing process, at the resolution of the unranked planar trees it is possible to dene a transition kernel ← − P from the set of trees with n + 1 leaves to the set of trees with n leaves in such a way that the tree obtained after (i) creating a tree with n + 1 leaves thanks to the generating and organizing process, and (ii) choosing a (cherry) node to withdraw in order to come back to a tree with n leaves, has the same distribution as the tree obtained from running the generating and organizing process for only n − 1 steps. That is, writing T n for the random unranked planar tree with n leaves, we have for every tree t n with n leaves: (1) Indeed, let us set ← − P (t n+1 → t n ) = P(T n = t n | T n+1 = t n+1 ).
(2) Note that this probability is 0 if t n and t n+1 are not compatible, that is if we cannot obtain t n+1 from t n by splitting one of the leaves of t n . Then, we trivially have t n+1 which shows that (1) is satised.
Let us now give an explicit formula for the r.h.s. of (2) in the case where t n and t n+1 are compatible. It is easier to come back to the resolution of ranked planar trees to compute the conditional probability appearing in the r.h.s. Indeed, as explained in the section on unranked planar trees, the probability of a given ranked planar tree τ n does not depend on the ranking. As a consequence, conditionally on T n = t n , all ranked planar trees whose unranking yields t n have the same probability 1/#t n to be that created by the generating and organizing process, where #t n denotes the number of ranked trees corresponding to the unranked tree t n . Recall from the section on unranked planar trees that Writing T n for the random ranked planar tree with n leaves and τ n ≺ t n to denote the fact that forgetting the ranking in the ranked planar tree τ n yields t n , we have P(T n = t n | T n+1 = t n+1 ) = τ n+1 ≺t n+1 P(T n = t n | T n+1 = τ n+1 )P(T n+1 = τ n+1 | T n+1 = t n+1 ) = 1 #t n+1 τ n+1 ≺t n+1 P(T n = t n | T n+1 = τ n+1 ) = 1 #t n+1 τ n+1 ≺t n+1 τn≺tn P(T n = τ n | T n+1 = τ n+1 ). (3) Since we now work with ranked planar trees, for any tree τ n+1 with n + 1 leaves the probability in the r.h.s. of (3) is zero unless τ n is the tree τ −1 n+1 obtained by withdrawing the n-th split in τ n+1 (in which case the probability is 1). Hence, the r.h.s. in (3) can be written But now recall that t n and t n+1 are assumed to be compatible. Hence, for every tree τ n satisfying τ n ≺ t n there is one and only one way to add a last step to obtain a tree τ n+1 ≺ t n+1 (namely, add the missing split in the tree and label it by n). As a consequence, we obtain that P(T n = t n | T n+1 = t n+1 ) = #t n #t n+1 .
Note that the same denition (2) would work at the resolution of the unranked non-planar trees, but nding an explicit expression for the quantity in the r.h.s. is dicult due to the many symmetries of non-planar trees.