Philosophical Transactions of the Royal Society B: Biological Sciences
Open AccessResearch articles

A geometric relationship of F2, F3 and F4-statistics with principal component analysis

Benjamin M. Peter

Benjamin M. Peter

Max-Planck-Institute for Evolutionary Anthropology, Leipzig 04103, Germany

[email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

    Abstract

    Principal component analysis (PCA) and F-statistics sensu Patterson are two of the most widely used population genetic tools to study human genetic variation. Here, I derive explicit connections between the two approaches and show that these two methods are closely related. F-statistics have a simple geometrical interpretation in the context of PCA, and orthogonal projections are a key concept to establish this link. I show that for any pair of populations, any population that is admixed as determined by an F3-statistic will lie inside a circle on a PCA plot. Furthermore, the F4-statistic is closely related to an angle measurement, and will be zero if the differences between pairs of populations intersect at a right angle in PCA space. I illustrate my results on two examples, one of Western Eurasian, and one of global human diversity. In both examples, I find that the first few PCs are sufficient to approximate most F-statistics, and that PCA plots are effective at predicting F-statistics. Thus, while F-statistics are commonly understood in terms of discrete populations, the geometric perspective illustrates that they can be viewed in a framework of populations that vary in a more continuous manner.

    This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.

    1. Introduction

    As in most species, the genetic diversity of human populations has been influenced by our history and environment over the last several hundred thousand years [e.g 15]. In turn, an important goal of population genetics is to use observed patterns of variation to investigate and reconstruct the demographic and evolutionary history of our species [6,7].

    The complicated genetic structure observed in present-day human populations [8,9] is caused by the interplay of demographic and evolutionary processes with both discrete and continuous components [1014]. In particular, populations are expected to differentiate if they are isolated from each other [15,16]. In humans, this may be caused because continental-scale geographic distances limit migration, causing a pattern known as isolation-by-distance [17,18]. However, these patterns are usually not uniform, but shaped by geography, particularly barriers to migration such as mountain ranges, oceans or deserts [1,19]. In addition, major historical population movements such as the out-of-Africa [20], Austronesian [21] or Bantu expansions [22] lead to more gradual patterns of genetic diversity over space [23]. Local migration between neighbouring populations will reduce differentiation, and long-distance migrations [24], and secondary contact between diverged populations such as Neanderthals and modern humans [25] may lead to locally increased diversity [26].

    Particularly for large and heterogeneous datasets, disentangling all these processes is challenging, and we cannot expect to devise a single model catching both broad strokes and minute details of human history. A commonly used analysis paradigm is thus to combine tools based on different sets of assumptions, each emphasizing particular aspects of the data.

    A typical analysis starts with data-driven, exploratory methods that summarize data making minimal assumptions [e.g. 6]. Examples are population trees [16,27,28], principal component analysis (PCA [1,29,30]) structure-like models [31,32] or multidimensional scaling (MDS [33]). These methods are limited in their ability to estimate biologically meaningful parameters, but provide useful summaries and visualizations. Typically, these analyses are then complemented with methods based on explicit demographic models, which are used to estimate parameters or test hypotheses [3436].

    When the number of populations exceeds a few dozen, even codifying reasonable population models can be prohibitively difficult. One approach is to pick a small set of ‘representative’ samples, and restrict modelling to this subset [e.g. 37,38]. However, this has the drawback that a large proportion of the available data remains unused. An increasingly popular alternative approach, particularly in the analysis of human ancient DNA, is therefore to focus on the relationship between two, three or four populations, commonly using F-statistics sensu Patterson [3941]. Formal definition will be given in the Theory section; but an informal motivation starts with the null model that populations are related as a tree, in which each F-statistic measures the length of a particular set of branches (figure 1; [41,42]).

    Figure 1.

    Figure 1. Representation of F-statistics on trees and two-dimensional PCA plots. The schematics show four populations and their representation using an (arbitrarily rooted) tree (top row) or a two-dimensional PCA plots (bottom row). (a) F2 represents the squared Euclidean distance between two tree leafs, and in PC space. (b) F3(X1; X3, X4) corresponds to the external branch from X1 to the internal node joining the populations, and is proportional to the orthogonal projection of X1X3 onto X1X4. (c) F4(X1, X4; X2, X3) corresponds to the internal branch in the tree, or to the orthogonal projection of X2X3 onto X1X4. (d) F4(X1, X2; X3, X4). The two paths from X1 to X2 and X3 and X4 are non-overlapping in the tree, which corresponds to orthogonal vectors in PCA space. (Online version in colour.)

    In most applications, F-statistics are estimated from data, and then used as tests of treeness. In particular, under the assumption of a tree, F3 is restricted to be non-negative, and many F4-statistics will be zero [40,42], and data that violates these constraints is incompatible with a tree-like relationship between populations. The canonical alternative model is an admixture graph (or phylogenetic network) [40,43], which is a tree which allows for additional edges reflecting gene flow (figure 2a). However, admixture graphs are not the only plausible alternative model, and expected F-statistics can be calculated for a wide range of population genetic demographic models [41].

    Figure 2.

    Figure 2. Admixture representation on two-dimensional-PCA plot. The schematics show four populations and their representation using an admixture graph (a) or a two-dimensional-PCA plot. (a) Admixture graph, with population Xy originating as an admixture of X2 (with proportion 1 − α) and X3 (proportion α). Subsequent drift (highlighted branch) will change allele frequency to sampled admixture population Xx. (b) PCA representation of the scenario in (a). Xy originates on the segment connecting X2 and X3, and subsequent drift may move it in a random direction. (c) Negative region (light grey) for F3(Xx; X2, X3) and for F^3(2)(Xx;X2,X3) based on two dimensions (dark grey). (d) F4(X1, Xx; X3, X4) will no longer be zero (compare to figure 1d). (Online version in colour.)

    (a) F-statistics and principal component analysis

    The practical issue addressed in this study is how F-statistics can be compared with PCA, one of the most widely used data-driven modelling techniques. One way PCA can be motivated is as generating a low-dimensional representation of the data, with each dimension (called a principal component, PC) retaining a maximum of the variance present in the data. To understand population structure, the use of PCA has been pioneered by Cavalli-Sforza et al. [44], who used allele-frequency data at a population level to visualize genetic diversity [1]. Currently, PCA is most commonly performed on individual-level genotype data [e.g. 30,45], making use of the hundreds of thousands of loci available in most genome-scale datasets. The PCA decomposition has been studied for a number of explicit population genetic models including trees [16], spatially continuous structure [46], the coalescent [47] and discrete population models [48]. Here, in order to link PCA to F-statistics, I interpret both of them geometrically in allele frequency space, i.e. as functions of a high-dimensional Euclidean space. For F-statistics, this interpretation was recently developed by Oteo-Garcia & Oteo [49], and for PCA it follows naturally from the interpretation of approximating a high-dimensional space with a low-dimensional one.

    In the next section, I will formally derive the connection between F-statistics and PCA, and show how F-statistics can be interpreted geometrically, with a particular emphasis on two-dimensional-PCA plots. In the Results section, I will then discuss how some of the most common applications of F-statistics manifest themselves on a PCA, and illustrate them on two example datasets, before ending with a discussion.

    2. Theory

    In this section, I will introduce the mathematics and notations for F-statistics and PCA. A comprehensive treatise on PCA is given by e.g. Jolliffe [50], a useful technical primer is Pachter [51], and a helpful guide to interpretation is Cavalli-Sforza et al. [1]. Readers unfamiliar with F-statistics may find [40,41] or [49] helpful.

    (a) Formal definition of F-statistics

    Let us assume we have a set of populations for which we have single-nucleotide polymorphism (SNP) allele frequency data from S biallelic loci, and for simplicity, I will assume that there is no missing data. Let xil denote the frequency of an arbitrary allele at the lth SNP in the ith population; and let Xi = (xi1, xi2, …xiS) be a vector collecting all allele frequencies for population i. As Xi will be the only data summary considered here for population i, I make no distinction between the population and the allele frequency vector used to represent it.

    The three F-statistics are defined as

    F2(X1,X2)=1Sl=1S(x1lx2l)22.1a
    F3(X1;X2,X3)=1Sl=1S(x1lx2l)(x1lx3l)2.1b
    andF4(X1,X2;X3,X4)=1Sl=1S(x1lx2l)(x3lx4l).2.1c
    The normalization by the number of SNPs S is assumed to be the same for all calculations and is thus omitted subsequently. Both F3 and F4 can be written as sums of F2-statistics
    2F3(X1;X2,X3)=F2(X1,X2)+F2(X1,X3)F2(X2,X3)2.2a
    and
    2F4(X1,X2;X3,X4)=F2(X1,X3)+F2(X2,X4)F2(X1,X4)F2(X2,X3).2.2b

    Commonly, a distinction is made between statistics estimated from data (denoted with lowercase-f), and theoretical quantities (defined in equation (2.1)). I do not make this distinction, but will explicitly mention when I analyse statistics calculated from data.

    F-statistics have been primarily motivated in the context of trees and admixture graphs [40]. In a tree, the squared Euclidean distance F2(X1, X2) measures the length of the path between populations X1 and X2 (figure 1a); F3 represents the length of an external branch (figure 1b) and F4 the length of an internal branch, respectively (figure 1c). The length of each branch can be thought of in units of genetic drift, and is non-negative [40]. Crucially, this means that F4 will be zero for pairs of populations from non-overlapping clades, which means that the tree lacks the branch corresponding to this statistic (as in figure 1d).

    Thinking of F-statistics as branch lengths is useful for a number of applications, including building multi-population models [40,52], estimating admixture proportions [38,53] and finding the population most closely related to an unknown sample (‘Outgroup’-F3-statistic).

    Most commonly however, F3 and F4 are used as tests of treeness [40]: negative F3 values correspond to a branch with negative genetic drift, which is not allowed under the null assumption of a tree-like population relationship. Similarly if four populations are related as a tree, then at least one of the F4-statistics between the populations will be zero [40,54].

    The most widely considered alternative model is an admixture graph [40]; an example is given in figure 2a. Here, the (typically unobserved) population Xy is generated by a mixture of individuals from the ancestors of X2 and X3. Over time, genetic drift will change Xy to Xx, which is the admixed population we observe. In this case, all F4-statistics involving Xy and both admixture sources will be non-zero, and, in some cases, F3(Xy; X2, X3) will be negative (exact conditions can be found in [41]).

    (b) Geometric interpretation of F-statistics

    An implicit assumption in the development of F-statistics in the context of admixture graphs has been that population lineages are mostly discrete, and that gene flow is rare. Recently, Oteo-García & Oteo [49] showed that this is not, in fact, necessary. Specifically, they interpret the populations Xi as points or vectors in the S-dimensional allele frequency space RS. In this case, the F-statistics can be thought of as inner (or dot) products, and they showed that all properties and tests related to treeness can be derived in this larger space. In this framework, the F-statistics can be written as

    F2(X1,X2)=1Sl=1S(x1lx2l)2=1SX1X2,X1X2=1SX1X222.3a
    F3(X1;X2,X3)=1Sl=1S(x1lx2l)(x1lx3l)=1SX1X2,X1X32.3b
    andF4(X1,X2;X3,X4)=1Sl=1S(x1lx2l)(x3lx4l)=1SX1X2,X3X4,2.3c
    where denotes the Euclidean norm and 〈 · , · 〉 denotes the dot product. Some elementary properties of the dot product between vectors a, b, c that I will use later are
    a,b=iaibi2.4a
    a,b=abcos(ϕ)2.4b
    a,a=a22.4c
    anda+c,b=a,b+b,c,2.4d
    where ϕ is the angle between a and b. The inner product is closely related to the vector projection
    projba=a,bb2b,2.5
    which is a vector colinear to b whose length measures how much vector a points in the direction of b. Thinking of F-statistics as projections also holds on trees: in e.g. a F4(X1, X4; X2, X3)-statistic (figure 1c), the internal branch is precisely the intersection of the paths from X1 to X4 and from X2 and X3. On trees, all disjoint paths are independent (i.e. orthogonal) from each other, and thus the external branches vanish under the projection.

    One issue with the geometric approach of Oteo-García & Oteo [49] is that each SNP (commonly in the millions) adds a dimension, but high-dimensional spaces are hard to visualize, interpret and analyse. Fortunately, it has been commonly observed that population structure is often quite low-dimensional, and only a few PCs frequently provide a good approximation of the covariance structure in human genetic variation data [30]. Therefore, we may hope that PCA could yield a reasonable approximation of the allele frequency space, and that F-statistics as measures of population structure may likewise be well approximated by the first few PCs.

    (c) Formal definition of principal component analysis

    PCA is a common way of summarizing genetic data, and so a large number of variations of PCA exist, e.g. in how SNPs are standardized, how missing data are treated or whether we use individuals or populations as units of analysis [1,30]. The version of PCA I use here is set up such that the similarities to F-statistics are maximized, and does not reflect how PCA is most commonly applied to genome-scale human genetic variation datasets. In particular, I assume that a PCA is performed on unscaled, estimated population allele frequencies, whereas many applications of PCA are based on individual-level sample allele frequency, scaled by the estimated standard deviation of each SNP [30]. The differences this causes will be addressed in the discussion.

    Let us again assume we have allele frequency data as above, but let us now assume we aggregate the allele frequency vectors Xi of many populations in a matrix X whose entry xil reflects the allele frequency of the ith population at the lth genotype. If we have S SNPs and n populations, X will have dimension n × S. Since the allele frequencies are between zero and one, we can interpret each population Xi of X as a point in RS.

    PCA allows us to approximate the points in the high-dimensional allele frequency space by a K-dimensional subspace of the data. If all PCs are considered, K = n − 1, in which case the data are simply rotated. However, the historical processes that generated genetic variation often result in low-rank data [55], so that Kn explains a substantial portion of the variation; for visualization, K = 2 is frequently used.

    There are several algorithms that are used to perform PCAs, the most common one is based on singular value decomposition (e.g. [50]). In this approach, we first mean-centre X, obtaining a centred matrix Y

    yil=xilμl,
    where μl is the mean allele frequency at the lth locus.

    PCA can then be written as

    Y=CX=(UΣ)VT=PL,2.6
    where C = I − (1/n)1 is a centring matrix that subtracts row means, with I, 1 the identity matrix and a matrix of ones, respectively. For any matrix Y, we can perform a singular value decomposition Y = UΣVT which, in the context of PCA, is interpreted as follows: the matrix of principal components P = UΣ has size n × n and contains information about population structure. The SNP loadings L = VT form an orthonormal basis of size n × S, its rows give the contribution of each SNP to each PC. It is often used to look for outliers, which might be indicative of selection (e.g. [56]). Alternatively, the PCs can also be obtained from an eigendecomposition of the covariance matrix YYT. This can be motivated from (2.6)
    YYT=PLLTPT=PPT,2.7
    since LLT = I.

    (d) Connection between principal component analysis and F-statistics

    (i) Principal components from F-statistics

    PCA, as defined above, and F-statistics are closely related. It is a classical result that PCA is equivalent to multidimensional scaling using squared Euclidean distances [57]. Since F2-distances are squared Euclidean, we calculate the pairwise F2(Xi, Xj) between all n populations, and collect them in a matrix F2. Multidimensional scaling then proceeds by double-centring it, so that its row and column means are zero, and perform an eigen decomposition of the resulting matrix

    PPT=12CF2C.2.8
    Although we arrive at P from a very different angle, as long as we make the same choices about normalization and units of analysis, we will get the exact same results.

    (ii) F-statistics in PCA space

    By performing a PCA, we rotate our data to reveal the axes of highest variation. However, the dot product is invariant under rotation, and F-statistics can be thought of as dot products [49]. What this means is that we are free to calculate F2 either on the uncentred data X, the centred data Y or any other orthogonal basis such as the principal components P. Formally,

    F2(Xi,Xj)=l=1L(xilxjl)2=l=1L((xilμl)(xjlμl))2=F2(Yi,Yj)=k=1n(pikpjk)2=F2(Pi,Pj).2.9
    A derivation of this change-of-basis is given in appendix A, equation (A1). As F3 and F4 can be written as sums of F2 terms (equations (2.2a) and (2.2b)), analogous relations apply.

    In most applications, we do not use all PCs, but instead truncate to the first K PCs, which explain most of the between-population genetic variation. Thus,

    F2(Pi,Pj)=k=1K(pikpjk)2F2^(K)(Pi,Pj)+k=K+1n(pikpjk)2ϵ(K)(Pi,Pj).2.10
    In this notation, F2^(K) is the approximation of F2 with only the first K PCs considered, and ε(K) is the corresponding approximation error. I will omit the superscript of F2^ when the exact number of PCs is not relevant. If we sum up the squared approximation errors over all pairs of populations in our sample, we obtain
    i,jϵ(K)(Pi,Pj)2=i,j(F2^(K)(Pi,Pj)F2(K)(Pi,Pj))2=F2F2^F2,2.11
    where the Frobenius-norm F2 of a matrix is defined as the square root of the sum-of-squares of all its elements. This is precisely the function that is minimized in MDS [50]. In that sense, F^2(K) is the optimal low-rank approximation of F2 for any K in that it minimizes the sum of approximation errors of all F2-statistics.

    (iii) F-statistics and samples projected onto PCA

    One of the easiest ways of dealing with missing data in PCA is to calculate the principal components (equation (2.6)) only on a subset of the data with no missingness, and then to project the lower quality samples with high missingness onto this PCA. The simplest way to do this is to note that

    YLT=PLLT=P,
    and so a new (centred) population Ynew can be projected onto an existing PCA simply by post-multiplying it with LT
    Pproj=YnewLT;
    the kth entry of Pproj gives the coordinates of the new sample on the kth PC. However, it is likely that Ynew lies outside the variation of the original samples. In this case, there is a projection error
    YnewPprojL2=F2(PprojL,Ynew).
    If we project with missing data, a similar projection can be used where we remove the rows from Ynew and L where data in Ynew is missing, and add a scaling factor [30].

    Thus, if we compare the F-statistic of a projected sample, we have

    F2(Xi,Xnew)=F2(Yi,Ynew)=F2(Pi,Pproj)+F2(PprojL,Ynew)=F2^(Pi,Pj)+ϵ(Pi,Pj)+F2(PprojL,Ynew).2.12
    The second row follows because the projection error and projection are orthogonal to each other. The main implication of equation (2.12) is that both truncation and projection introduce some error, and that F2^(Pi,Pj) will be a good approximation to F2(Pi, Pj) only if both errors are small.

    3. Material and methods

    The theory outlined in the previous section suggests that F-statistics have a geometric interpretation in PCA space, which can be approximated on PCA plots. In the next section, I explore this connection in detail, and illustrate it on two sample datasets that I briefly introduce here. Both are based on the analyses by Lazaridis et al. [58]. The data are from the Reich laboratory compendium dataset (v.44.3), downloaded from https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data, using data on the ‘Human Origins’-SNP set (597,573 SNPs). SNPs with missing data in any population are excluded. The code used to write this paper, create all figures and analyses is available on doi:10.5281/zenodo.6424178.

    (a) ‘World’ dataset

    This dataset is a subset of the ‘World Foci’ dataset of Lazaridis et al. [58], where I removed samples that are not permitted for free reuse. These populations span the globe and roughly represents global human genetic variation (638 individuals from 33 populations) As adjacent sampling locations are often thousands of kilometres apart, I speculate that gene flow between these populations may not be particularly common; and their structure may therefore be well approximated by an admixture graph. A file with all individuals used and their assigned population is given in electronic supplementary material, File S1.

    (b) Western Eurasian dataset

    This dataset of 1119 individuals from 62 populations contains present-day individuals from the Eastern Mediterranean, Caucasus and Europe. Lazaridis et al. [58] used this dataset as a basis of comparison for ancient genetic analyses of Western Eurasian individuals, and PCAs based on similar sets of samples have been used in many other ancient DNA studies (e.g. [59,60]). Genetic differentiation in this region is low and closely mirrors geography [45]. I thus speculate that gene flow between these populations is common [61], and a discrete model such as a tree or an admixture graph might be a rather poor reflection of this data. A file with all individuals used and their assigned population is given in electronic supplementary material, File S2.

    (c) Computing F-statistics and PCA

    All computations are performed in R. I use admixtools 2.0.0 (https://github.com/uqrmaie1/admixtools) to compute F-statistics. To obtain a PC-decomposition, I first calculate all pairwise F2-statistics, and then use equation (2.8) and the eigen function to obtain the PCs. The right-hand-side matrix of equation (2.8) is expected to have non-negative eigenvalues (i.e. −CF2C is positive-semidefinite). However, when F2-statistics are estimated from data, sampling noise might make some of them slightly negative, which would lead to imaginary PCs. I avoid this by setting all negative eigenvalues to zero.

    4. Results

    The transformation from the previous section allows us to consider the geometry of F-statistics in PCA space. The relationships we will discuss formally only hold if we use all PCs. However, the appeal of PCA is that frequently, only a very small number Kn of PCs contain most information that is relevant for population structure, in which case the geometric interpretations become very simple. Thus, throughout the schematic figures, I assume that two PCs are sufficient to characterize population structure. In the data applications, I evaluate how deviations of this assumption may manifest themselves in PCA plots.

    (a) F2 in PC space

    The F2-statistic is an estimate of the squared allele-frequency distance between two populations. On a tree (figure 1a), this corresponds to the branch between two populations. In allele-frequency space, it corresponds to the squared Euclidean distance, and thus reflects the intuition that closely related populations will fall close to each other on a PCA plot, and have low pairwise F2-statistics. However, since F2 can be written as a sum of squared (non-negative) terms for each PC (equation 2.9), the distance on a PCA plot will always be an underestimate of the full F2 distance. Thus, PCA might project two populations with high F2 distance very close to each other, which would indicate that these particular PCs are not suitable to understand and visualize the relationship between these particular populations, and likely more PCs need to be investigated to understand how these populations are related to each other. In converse, populations that are distant on the first few PCs are guaranteed to also have a large F2-distance, since the distance contributed from the omitted PCs cannot be negative.

    (b) When are admixture-F3-statistics negative?

    Consider again the admixture scenario in figure 2a, where population Xy is the result of a mixture of X2 and X3, and subsequent drift changes the allele frequencies of the admixed population from Xy to Xx. How is such a scenario displayed on a PCA? Since the allele frequencies of Xy are a linear combination of X2 and X3, it will lie on the line segment connecting these two populations (figure 2b), at a location predicted by the admixture proportions. Subsequent drift will change the allele frequencies of Xy (to say, Xx), and so in general it might fall on a different point on a PCA plot. An exception occurs when Xx (and no other populations related to Xx) are not part of the construction of the PCA, so that XxXy is orthogonal to all PCs, i.e.

    XxXy,XiXj=XxXy,Pi=0
    for all populations i, jn. In this case, Xx and Xy project to the same point, and the location on the PCA can directly be used to predict the admixture proportions [47,49,62]. However, if either Xx is included in the construction of the PCA, or if some gene flow occurred between Xx and any of the populations used to construct the PCA, Xx and Xy may project on different spots (figure 2b).

    Thus, a natural question to ask is given two source populations X2, X3, can we use PCA to predict which populations might have negative F3-statistics? This condition can be written as

    2F3(Xx;X2,X3)=2XxX2,XxX3=XxX22+XxX32X2X32<0.4.1
    By the Pythagorean theorem, F3 = 0 if and only if X2, X3 and Xx form a right-angled triangle. The associated region where F3 = 0 is an n-sphere (or a circle in two dimensions) with diameter X2X3¯ (the overline denotes a line segment). F3 is negative when the triangle is obtuse, i.e. Xx could be considered admixed if it lies inside the n-ball with diameter X2X3¯ (figure 1b, equation (A2)).

    (c) F3 and dimensionality reduction

    If we project this n-ball on a two-dimensional plot, X2X3¯ will usually not align with the PCs; thus the ball may be somewhat larger than it appears on the plot. This geometry is perhaps easiest visualized on the example of projecting F3(Xx; X1, X2) from a two-dimensional space onto a single dimension (figure 3). In that example, the distance between X1 and X2 has a substantial contribution from PC2, and so the negative region (light grey) is larger than what would be predicted from just one dimension (dark grey bar), but if F^3F3, the two areas would be very close. Thus, if considering a reduced-dimension PCA plot, some points (such as X3) may project inside the negative region, but have positive F3 because they are outside the n-ball in higher dimensions. The converse interpretation is more strict: if a population lies outside the circle on any projection, F3 is guaranteed to be bigger than 0 (see equation (A4) in the appendix). An intuitive example is given by X4 in figure 3: all points projecting to the same point on figure 3b as X4 lie outside the circle.

    Figure 3.

    Figure 3. Admixture-F3-statistic and dimensionality reduction. (a) The grey circle is defined by X1 and X2 and reflects the area for which F3 is negative, i.e. F3(Xx; X1, X2) < 0 < F3(X3; X1, X2) ≤ F3(X4; X1, X2). (b) Projection on one dimension. The full rejection region (light grey bar) is given by the boundaries of the circle in a, and is larger than that predicted from the distance between X1 and X2 (dark grey bar) on PC1 alone. Points like Xx with negative F3 are guaranteed to lie in the light grey (but not dark grey) region. X3 also projects onto the grey bar, even though it is outside the circle and F3 is positive, and it would also lie inside the positive area if projected onto PC2. However, points such as X4 that project outside the rejection region have F3 > 0, since the associated projection line does not intercept the circle. (Online version in colour.)

    (i) Example

    As an example, I visualize the admixture statistic F3(X; Sardinian, Finnish), on the first two PCs of the Western Eurasian dataset (figure 4a). In this case, the projected n-ball (light grey) and circle based on two dimensions (dark grey) have similar sizes. However, several populations that appear inside the circles (e.g. Basque, Canary Islanders) have, in fact, positive F3 values, so they lie outside the n-ball. This reveals that the first two PCs do not capture all the genetic variation relevant for European population structure. Consequently, approximating F3 by the first two or even 10 PCs (figure 4b) only gives a coarse approximation of F3, and from figure 4c we see that many higher PCs contribute to F3 statistics.

    Figure 4.

    Figure 4. PCA and F-statistics for the Western Eurasian dataset. (a) PCA biplot; the light grey circle denotes the region for which F3(X; Sardinian, Finnish) may be negative, the dark circle is based on just the first two PCs. Populations for which F3 is negative are coloured in red. (b) F3 approximated with two (blue) and 10 (red) PCs versus the full spectrum. (c) Boxplot of contributions of PCs 1–10 to each F3-statistic. (d) Projection angle and correlation interpretation of F4(X, Saudi; Sardinian, Finnish) based on two PCs (green), three PCs (blue) or full data (red). (e) Contribution of the first 10 PCs to select F4-statistics, with the first three PCs containing the majority of contributions. (Online version in colour.)

    However, many populations, particularly from Western Asia and the Caucasus, on the right-hand side of the plot, fall outside the circle. This allows us to immediately conclude that their F3-statistics must be positive, and we should not consider them as a mixture between Sardinians and Fins.

    (d) Outgroup-F3-statistics as projections

    A common application of F3-statistics is, given an unknown sample XU, to find the most closely related population among a reference panel (Xi) [63]. This is done using an outgroup-F3-statistic F3(XO; XU, Xi), where XO is an outgroup. The reason an outgroup is introduced is to account for differences in sample times and additional drift in the reference populations (figure 5a). The outgroup-F3-statistic F3(XO; XU, X3) represents the branch length from XO to the common node between the three samples in the statistic, and the closer this node is to XU, the longer the branch and hence the larger the F3-statistic.

    Figure 5.

    Figure 5. Outgroup-F3-statistics. Interpretation of outgroup-F3-statistic on a tree (a) and PCA plot (b). The highlighted segment represents F3(XO; XU, X3) and the dashed segment reflects F3(XO; XU, X1) and F3(XO; XU, X2), which have the same value. (Online version in colour.)

    To make sense of outgroup-F3-statistics in the PCA context, I use the association of F3-statistics to projections (equation 2.5): on a PCA plot, we can visualize this F3-statistic as the projection of the vector XiXO onto XUXO

    projXUXO(XiXO)=F3(XO;XU,Xi)XUXOF2(XO;XU).

    Of the right-hand-side terms, only the F3 term depends on the Xi. The fraction can be thought of as a normalizing constant, so the F3-statistic is proportional to the length of the projected vector. This means that the outgroup-F3-statistic is largest for whichever Xi projects furthest along the axis from the outgroup to the unknown population; in figure 5, this is X3.

    (i) Example

    In figure 6a, I use the World dataset to visualize the outgroup-F3-statistic F3 (Mbuti; Sardinian, Xi), in i.e. a statistic that aims to find the population most closely related to Sardinian (a Mediterranean island), assuming the Mbuti are an outgroup to all populations in the dataset. On a PCA, we can interpret this F3-statistic as the projection of the line segment from Mbuti to population Xi onto the line through Mbuti and Sardinians (black line). For each population, the projection is indicated with a grey line. In the full data space, this line is always orthogonal to the segment Mbuti-Sardinian, but on the plot (i.e. the subspace spanned by the first two PCs), this is not necessarily the case. The colouring is based on the F3-statistic calculated from all the data, with brighter values indicating higher F3-statistics. In this case, the first two PCs approximate the F3-statistic very well: particularly, the samples from East Asia and the Americas project almost orthogonally, suggesting that most of the genetic variation relevant for this analysis is captured by these first two PCs. We can quantify this and find that the first two PCs slightly underestimate the absolute value of F3 (figure 4c), but keep the relative ordering. I also find that many PCs, e.g. PCs 3–5, 7 and 10, have almost zero contribution to all F3-statistics (figure 4d), PCs 6, 8 and 9 having a similar non-zero contribution for almost all statistics, likely because these PCs explain within-African variation.

    Figure 6.

    Figure 6. PCA and F-statistics for the World dataset. (a) Visualization of outgroup-F3-statistic F3(Mbuti; Sardinian, X) on a PCA biplot. The colour of points correspond to the value of the F3-statistic, with brighter yellows indicating higher values, i.e. higher similarity to Sardinians. The F3-projection axis is given by a black line, the projection of populations onto this axis by thin grey lines. In the full allele frequency space, these projection are orthogonal to the axis. (b) Projection along the axis Sardinian-Mbuti (X-axis), and PCA on residual of this projection (PC1 on Y-axis, PC2 as colouring). (c) Approximation of F3(Mbuti; Sardinian, X) using the first two (blue) and first 10 (red) PCs, respectively. (d) Contributions of first 10 PCs to all statistics of the form F3(Mbuti; Sardinian, X). (e) Contributions of the first ten PCs to select F4-statistics. (Online version in colour.)

    (e) F4-statistics as angles

    One interpretation of F4 on PCA plots is similar to that of F3: as a projection of one vector onto another, with the difference that now all four points may be distinct. F4-statistics that correspond to an internal branch in a tree (as in figure 1c) can be interpreted as being proportional to the length of a projected segment on a PCA plot, again with the caveat that we need to scale it by a constant. If the F4-statistic corresponds to a branch that does not exist in the tree (figure 1d), then, from the tree interpretation, we expect F4(X1, X2; X3, X4) = 0 implying that the vectors X1X2 and X3X4 are orthogonal to each other, i.e. that X1 and X2 map to the same point on the projection axis X3X4¯. In the case of an admixture graph, this is no longer the case: Both population Xy and Xx in figure 2d do not map to the same point as X1 or X2 do, implying that statistics of the form F4(X1, Xx; X3, X4) ≠ 0.

    Since F4 is a covariance, its magnitude lacks an interpretation. Therefore, commonly, correlation coefficients are used, as there, zero means independence and one means maximum correlation. For F4, we can write

    Cor(X1X2,X3X4)=F4(X1,X2;X3,X4)X1X2X3X42=cos(ϕ),4.2
    where ϕ is the angle between X1X2 and X3X4. Thus, independent drift events lead to cos (ϕ) = 0, so that the angle is 90, whereas an angle close to zero (cos (ϕ) ≈ 1) means most of the genetic drift on this branch is shared.

    (i) Example

    To illustrate the angle interpretation, I return to the Western Eurasian data. The PCA biplot shows two roughly parallel clines (figure 4a), a European gradient (from Sardinian to Finnish and Chuvash), and an Asian cline from Arab populations (top right) to the Caucasus (bottom right). This is quantified in figure 4d, where I plot the angle corresponding to F4(X, Saudi; Sardinian, Finnish). For most Asian populations, using two PCs (green points) gives an angle close to zero, corresponding to a correlation coefficient between the two clines of r > 0.9. Just adding a third PC (blue), however, shows that the clines are not, in fact, parallel, and the correlation for most populations is low. The finding that three PCs are necessary to explain this data can also be seen from the spectrum of these F4-statistics (figure 4e), which have high contributions from the first three PCs. Both results indicate that adding a third PC would give a much better description of the data, and the relationship between within-European variation to Saudis in particular.

    (f) Other projections

    So far, I used equation (2.9) to interpret F-statistics on a PCA plot, but the argument holds for any orthonormal projection in the data space. This is useful in particular for estimates of admixture proportions, which are often done as projections into a low-dimensional reference space defined by F-statistics [38,40,49,53].

    For example, a common way to estimate admixture proportion α of X1 is the F4-ratio

    α=F4(R1,R2;XX,X1)F4(R1,R2;X2,X1)=proj[R1R2]XXX1proj[R1R2]X2X1,4.3
    which can be interpreted as projecting XXX1 and X2X1 onto R1R2, and the ratio of the lengths gives the proportion of Xx contributed by X1 [49].

    The admixture graph motivating this statistic is visualized in figure 7a, and the PCA-like interpretation in figure 7b. In both panels, the solid grey line is the projection axis, and the dotted line gives the residual, i.e. the branches or genetic variation that is ignored by the projection.

    Figure 7.

    Figure 7. Admixture proportion estimates. (a) Visualization of the admixture graph scenario used to estimate the proportion α contributed from X1 to XX, using references R1 and R2. The full grey line corresponds to the projection axis, and the dotted grey lines correspond to the branches ignored in the projections. The admixture proportion α corresponds to the length of the dashed red line relative to the black line between X1 and X2. (b) The same scenario, but in Euclidean space, X1, X2 and XX align on a line both in the (low-dimensional approximation of the) residual space and on the projection axis. (Online version in colour.)

    The PCA-like projection can be used to visualize admixture proportions, as the horizontal position of XX relative to X1 and X2 (red dashed line versus black line) directly represents the estimated admixture proportion α. In addition, the residuals can be used to verify assumptions of the admixture graph model. In particular, since XX arises as a linear combination of X1 and X2, if admixture is recent we might expect the three populations to be collinear; if they are not this means that either of the populations experienced gene flow from some other population which might bias results [53].

    In addition, the external tree branches X1X1′ and X2X2′ are disjoint which means they should be orthogonal. On a one-dimensional residual plot (figure 7b), this cannot be verified, but the statistic

    F4(X1,X1;X2,X2)=04.4
    can be calculated for all samples.

    (i) Example

    I use the World dataset as an example, using Sardinian and Mbuti as references populations (figure 6b). The data are the same as in the PCA (figure 6a), but it is now rotated such that the axis between the reference population (black line in figure 6a) is aligned with the x-axis. For any pair of populations X1 X2, their horizontal projection distance reflects F4 (Sardinian, Mbuti; X1, X2) and the relative horizontal distance corresponds exactly to F4-ratio admixture estimates. For many sets of populations, this is of course not sensible, and just looking at the first PC of the residual shows many examples where the populations are not collinear. For example, on the x-axis, the South American Surui are between Papuans and Georgians, but since the Surui clearly are not on the line between Papuans and Georgians, this cannot be the result of admixture.

    5. Discussion

    Particularly for the analysis of human genetic variation with a large number of individuals with heterogeneous relationships, F-statistics are a powerful tool to describe population genetic diversity. Here, I show that the geometry of F-statistics [49] leads to a number of simple interpretations of F-statistics on a PCA plot.

    (a) The geometry of admixture

    Previous interpretation of PCA in the context of population genetic models have focused on explicit models and aimed at directly interpreting the PCs in terms of population genetic parameters [16,23,46,48]. My interpretation here is different in that the utility of PCA is to simplify the geometry of the data, rather than attributing meaning to the produced PCs. One consequence is that the results here are less directly impacted by sample ascertainment, sample sizes or number of PCs, which are common concerns in the interpretation of PCA [23,4648]; adding more PCs will provide a successively better approximation of the F-statistics.

    The two datasets I analysed here suggest that two PCs (for the World dataset), and three PCs (for the Western Eurasian data), respectively, already provide a very good approximation for F4-statistics (figures 4e and 6e), reflecting the observation that frequently the first few PCs provide a good approximation of the overall population structure. On the other hand, for admixture F3-statistics, more PCs are needed (figure 4b). This is likely due to PCA approximating the global structure in the dataset; statistics that only involve distantly related populations will only require a few PCs for good approximations, whereas statistics that contain a term measuring local variation, such as F3-statistics or F4 between closely related populations will require more PCs for good approximations, because local variation is often found on higher PCs.

    My focus on the geometry of the data allows for direct and quantitative comparisons between F-statistic-based results and PCA biplots. As PCA is often ran in an early step in data analysis, this may aid in generation of hypotheses that can be more directly evaluated using generative models, typically using a lower number of populations. It also allows reconciling apparent contradictions between F-statistics and PCA plots. In many cases, differences between the two data summaries will be due to variation on higher PCs. In this case, plotting additional PCs, or further subsetting the data to a more local set of populations seems prudent.

    (b) Assumptions

    In addition to the selection of PCs, the other cause for disagreements between F-statistics and PCA are differences in assumptions. The version of PCA I use for my analyses is chosen such that the similarities to F-statistics are maximized. In particular, I assume here that (i) we have no missing data, (ii) SNPs are equally weighted, (iii) that individuals can be grouped into populations and (iv) we use estimated allele frequencies. By contrast, most data analyses have to grapple with missing data, SNPs are often weighted according to their allele frequencies, and observed, individual-level genotypes are used as the basis of PCA.

    (i) Missing data

    The matrix decompositions underlying PCA assume complete data, and thus cannot be used when some data is missing [50]. As missing data are a very common practical problem, there is a large number of algorithms for imputing missing data. The simplest approach is to replace missing data with zeros (as implemented e.g. in [30]), but more sophisticated algorithms exist to ‘learn’ the missing values from surrounding data (e.g. [64,65]). By contrast, missing data in F-statistics is most commonly handled by estimating a standard error by resampling along the genome [40], and so missing data results in larger standard errors.

    These strategies are distinct, and reflect the original purposes of the approaches. For statistical tests based on F-statistics, we wish to isolate a set of three or four populations and get our best guess based on just that subset of data. By contrast, methods for PCA can leverage the additional individuals, and thus will likely result in more accurate estimates.

    However, the way we handle missing data is not tied into the method. For example, we could evaluate the robustness of a PCA by resampling data. Similarly, the theory developed here suggests that we could obtain accurate F-statistics with missing data by first performing a PCA using a method that handles missing data, and then calculate F-statistics from these PCs.

    (ii) Normalization

    In PCA, SNPs are typically normalized to have expected variance of one, a step that is omitted in calculating F-statistics [40]. The F-statistic framework assumes that each SNP is an identically distributed (but not independent) random variable, which holds regardless of weighting. Thus, normalization of SNPs is largely a matter of convention; for F-statistics the dependency on additional samples (through mean allele frequencies) is often unwanted, but could be advantageous for tools that aim to do joint inference from many F-statistics such as qpAdm [38,40]. As genetic differentiation between human populations is low, the normalization used may matter little in practice, but could be explored in future work [28].

    (iii) Estimated versus observed allele frequencies

    The third difference between F-statistics and PCA is on the usage of estimated allele frequencies versus individual-based genotypes. The fact that PCA does not distinguish between sampling error and the underlying structure is a well-known drawback of PCA, and applying the theory presented here to individual-based PCA would result in F-statistics that incorporate some sampling noise. Probabilistic PCA is one class of approaches that aim to separate the population structure from sampling noise (e.g. [66]). It seems likely that probabilistic PCA would yield a representation of the data that is more closely aligned with F-statistics than regular PCA.

    (iv) Individual versus population-based analyses

    The final issue is that PCA is commonly run on individual-based data, whereas F-statistics often group individuals into populations. However, population-based PCA has been the default in the past [1], and F-statistics are often applied to individuals (e.g. [25,67,68]). Often, an individual-based PCA is used to justify grouping individuals into populations; i.e. individuals that form a tight cluster on a PCA plot have similar relationships to everyone else in the dataset, and can thus be treated as a unit of analysis. Thus, if the assumptions are satisfied, F-statistics for individual-based and population-based analyses are expected to be very similar. PCA, on the other hand, is strongly impacted by the number of individuals from each population (e.g. [47]); as each individual is weighted equally, variation related to populations with many samples will be overrepresented on the first PCs.

    (v) Summary

    Motivated by F-statistics, the PCAs I consider here are based on estimated population allele frequencies, whereas most genome-scale studies of human genetic variation use observed individual-based allele frequencies that are normalized by overall allele frequencies. Thus, some care will be required to directly extend the interpretations developed here to individual-based PCAs. However, the differences are largely due to conventions, and particularly for studies where the description of population structure is a major focus, results might be easier to interpret if conventions regarding missing data, normalization and estimation of allele frequencies are used consistently between F-statistics and PCA.

    (c) The apportionment of human diversity

    Most genetic variation in humans is shared between all of us, but the around 15% that can be explained by population structure can be leveraged to study our history and diversity in great detail [1,69,70]. For some datasets, it is possible to predict an individuals’ origin at a resolution of a few hundred kilometres [45,71], and direct-to-consumer-genetics companies are using this variation to analyse the genetic data of millions of customers.

    However, understanding, conceptualizing and modelling this variation is far from trivial, particularly in a historical context in which mistaken ideas about human variation have been used to justify racist, eugenic and genocidal policies. Lewontin’s landmark 1972 paper on the apportionment of human genetic diversity was one of the first to quantify how little of between-population genetic variation could be attributed to ‘racial’ continental-scale groupings [70]. Over the last five decades, this view has been corroborated, refined and extended many times [1,7274].

    From a practical perspective, formulating hypotheses and designing studies in terms of discrete populations with ‘uniform’ genetic backgrounds is often sensible, as it enables e.g. prediction of phenotypes [75,76], inference of demographic parameters, and schematic models of human genetic history [40]. In a similar vein, when interpreting F-statistics in the context of admixture graphs, we make the implicit assumption that populations are discrete, related as a graph, and that gene flow between populations is rare [38,40]. However, these simplifications do come at a cost, both in terms of model violations that may invalidate statistical results, and in terms of deemphasizing that people do not rigidly fall into predefined genetic groups.

    In many parts of the world, and particularly at more local scales, distinctions between populations begin to blur, and everyone could be considered admixed to some degree [77]. This provides a challenge for interpretation, as most F3 and F4-statistics will indicate departures from treeness. A naive interpretation of the F-statistics from my Eurasian example (figure 4a) would identify a substantial fraction of Europeans as (significantly) admixed between Finnish and Sardinians. By contrast, PCA reveals that the variation in this dataset is not due to a single event, and so an arguably better description of the dataset is one where Finnish and Sardinians lie on opposite ends of a more gradually structured population.

    Thus, a more general way we could think about modelling population structure using F-statistics is as identifying orthogonal drift components. In a tree model, orthogonality arises because changes in allele frequencies on distinct branches of the tree are independent from each other (and high-dimensional random vectors are almost surely orthogonal). The classical model of admixture as a result of contact between long-separated and isolated populations is one of potentially many demographic models that results in non-orthogonality; gene flow of any kind will result in correlated genetic drift, and hence in non-zero F3 or F4-statistics.

    Thinking of population structure in terms of orthogonal components may be abstract, but it is quite similar to how PCA is sometimes interpreted. In a PCA, we frequently make the informal observation that a particular PC is associated with variation within a specific region, or separates out distinct populations [1]. These associations are not exact, and are impacted e.g. by the size and composition of the analysed dataset. On the other hand, we could use F-statistics to formally test orthogonality, or to quantify correlations. Motivated by PCA, we could set up F4 as tests of orthogonality due to either space (distinct populations evolve independently) or scale (i.e. between-population diversity is independent of within-population diversity). This slight generalization of F-statistics could allow us to reframe many questions about gene flow or divergence that are currently asked as tests of orthogonality, without assuming that lineages or admixture events are discrete.

    6. Conclusion

    F-statistics and PCA have both proven to be tremendously useful to study, visualize and test aspects of population structure. Here, I show that these approaches are closely connected, and highlight a few implications. First, PCA and F-statistics should never be treated as independent analyses. If they agree, this can be used as a sanity check that no major assumptions about, e.g. population groupings, are violated, but the underlying biological relationships investigated are the same in both approaches.

    If sufficiently many PCs are considered, both F3 and F4-statistics do have simple interpretations in PCA space. If used as a test for admixture, F3 corresponds to testing whether the admixed population lies in an n-sphere between the potential source population in PCA space. This result makes the informal notion that an admixed population should lie between its sources on a PCA plot more precise. Furthermore, I show that while it is necessary for an admixed population to lie between its sources on a PCA plot, this is not a sufficient condition and unadmixed populations may also project inside the n-sphere. By contrast, if a population falls outside the n-sphere on any PCA plot, this is sufficient to determine that this population has a positive F3 statistic.

    Interpretations of the outgroup-F3-statistics and F4-statistics in a PCA framework rely heavily on the geometric concepts of projections and orthogonality: in a tree or admixture graph framework, we can loosely interpret F4(A, B; C, D) as the overlap of the path from A to B onto the path from C to D, or equivalently as the projection of AB onto CD; only edges shared between the two paths will contribute non-zero amounts to this statistic. This interpretation holds when we replace the populations by points in PCA space, and enables us to study population models beyond discrete graphs. Thus, the geometric framework enables us to expand the application of F-statistics beyond current uses, allows us to better understand the meaning of these statistics and helps us to avoid overinterpretations, particularly in cases when population structure is continuous.

    Data accessibility

    All results are based on public data. The code used to obtain all results is available at doi:10.5281/zenodo.6424178. The data are provided in electronic supplementary material [78].

    Competing interests

    I declare I have no competing interests.

    Funding

    Open access funding provided by the Max Planck Society.

    Appendix A. Derivations

    Depending on a reader's background in linear algebra, these results may appear elementary; I include them here for reference and because they were not obvious to me at the onset of this project.

    (a) F-statistics are invariant under a change-of-basis

    F2(Xi,Xj)=l=1S((xilμl)(xjlμl))2=F2(Yi,Yj)=l=1S(kLklPikkLklPjk)2=l=1S(kLkl(PikPjk))2=l=1S(kLkl2(PikPjk)2+2kkLklLkl(PikPjk)2)=k(l=1SLkl2)1(PikPjk)2+2kk(l=1SLklLkl)0(PikPjk)2=k(PikPjk)2.A 1

    In summary, the first row shows that F2 on the centred data will give the same results (as distances are invariant to translations); in the second row, we apply the PC decomposition. The third row is obtained from factoring out Llk. Row four is obtained by multiplying out the sum inside the square term for a particular l. We have k terms when for (k2) terms for different ks. Row five is obtained by expanding the outer sum and grouping terms by k. The final line is obtained by recognizing that L is an orthonormal basis, where dot products of different vectors have lengths zero.

    Note that if we estimate F2, unbiased estimators are obtained by subtracting the population heterozygosities Hi, Hj from the statistic. As these are scalars, they do not change above calculation.

    (b) The region of negative F3-statistics is an n-ball

    Without loss of generality, assume that X1 = (r, 0, 0, …) and X2 = (−r, 0, 0, …), and let us assume that Xx has coordinates (x1, x2, …, xS). Assuming F3(Xx; X1, X2) = 0, equation (4.1) becomes

    2F3(Xx;X1,X2)=XxX12+XxX22X1X22=0=[(x1r)2+i=2Sxi2]+[(x1+r)2+i=2Sxi2]4r2=2[i=1Sxi2+r2+x1rx1r]4r2F3(Xx;X1,X2)=r2+i=1Sxi2=r2+Xx2=0,A 2
    which is the equation of an n-sphere with radius r and centre at the origin, as assumed from the placing of X1 and X2. Now, assume that F3 is negative, i.e. F3(Xx; X1, X2) = −k < 0. Moving r2 to the left, we obtain
    r2k=Xx2,A 3
    which is another n-sphere with a smaller radius, showing that all points inside the n-sphere will have negative F3 values.

    (c) If a population lies outside the circle of this n-sphere in any two-dimensional projection, F3 is positive

    Assume the centre of the n-sphere C = (X1 + X2/2) = (c1, c2, …, cS), and Xx = (x1, x2, …, xS). Then,

    F3(Xx;X1,X2)=XxC2r2=(x1c1)2+(x2c2)2>r2+i=3S(xici)20r2>0.A 4
    The condition (x1c1)2 + (x2c2)2 > r2 is satisfied whenever Xx is outside the circle obtained from projecting the n-sphere on the first two dimensions. An analogous argument applies for any low-dimensional representation.

    Footnotes

    One contribution of 15 to a theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.

    Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.5898677.

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.