Size distribution of function-based human gene sets and the split–merge model

The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.


WL, 0000-0003-1155-110X
The sizes of paralogues-gene families produced by ancestral duplication-are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a powerlaw size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.

Introduction
A narrowly defined gene family within a genome [1,2] refers to all working genes, thus not pseudo-genes, that were produced by duplication events [3,4]. Genes in such a gene family share sequence similarity owing to their common origin, and are called paralogues [5,6]. The role of such a defined gene family in functional biology, however, is less clear as paralogues may diverge in functions [7][8][9]. Alternatively, a group of genes can be connected in a variety of other ways, such as similar expression patterns [10], participating in the same biology pathway, carrying out similar biological process, contributing to the same phenotype, sharing a common motif, etc. These groupings of genes can be called gene sets [11][12][13][14]. Genes in a gene set, besides being paralogues or homologues, can be based on a relationship with respect to physical proximity, chemical interaction, biological function and phenotype or medical condition [14].
To clearly distinguish gene families created by gene duplication and those by grouping using any other possible criteria, here we use the term 'paralogues' and 'gene sets' to refer to the two situations, and 'gene family' when such a distinction is not important. The size of paralogues in a genome can be quite uneven, with some large ones but more small ones. If S denotes the number of genes (size) in a paralogue, and f (S) the number of paralogues with size S, f (S) as a function of S has a power-lawlike trend with negative exponents [15]. This statistical trend for paralogue sizes has been confirmed by other publications [16,17]. Similar trends have been extended to the protein universe for folds, domains, clusters, families and superfamilies [18][19][20][21][22][23].
There are three goals in this paper: (i) although the long-tailed size distribution of paralogues is well established with sufficient understanding, most analyses, with some exceptions [47], were focused on bacterial genomes. Here, we examine the human paralogues in more detail. (ii) For gene families not constructed by homology, which we call gene sets, we show that power-law rank function is not a good fitting model. A more flexible function called beta rank function can be used [48,49]. This greatly expands our toolbox in a quantitative analysis of gene families. (iii) Just as duplication operation is essential for the observed power-law distribution of paralogue sizes, we examine the effect of a combined operation of splitting and merging of gene sets. We show that these operations lead to a deviation from the power-law distribution, thus justifying the introduction of new types of fitting function such as the beta rank function.

Homology-based human gene family (paralogues) data
We use the Ensembl family (in human) which has a stable ID beginning with ENSFM [50]. The number of genes within a family (paralogue) is counted by the number of distinct Ensembl Gene IDs beginning with ENSG that share the same family ID. The data are obtained from the Ensembl Biomart (www.ensembl.org/biomart) [51] by selecting the following boxes: (i) Ensembl genes 82 in 'choose database'; (ii) Homo sapiens genes (GRCh38.p3) in 'choose dataset'; (iii) filtering 'region (chromosome)': 1-22,X,Y; (iv) filtering 'gene (gene type)': protein_coding and (v) filtering 'gene (multi species comparisons)': paralogous human genes.

Biological function-based human gene sets data
We use the Human Gene Nomenclature Committee (HGNC) gene families (www.genenames.org/cgibin/genefamilies/) [52,53]. The broad guideline for gene families (which we call gene sets here) is that they 'signify groups of genes related by function, by sequence or by phenotype caused' [54]. Although the hierarchical organization of gene sets is advocated [54], many high-level concepts for gene (super)families do not have an HGNC family ID. The number of genes in a gene set is counted by the number of gene names with the same gene family name or ID. Non-coding genes, RNAs, microRNAs and genes without an approved name are not counted. We examine the effect of including or excluding pseudo-genes.

Inverse power-law rank function and Zipf's law
Suppose there are n F gene families (paralogues or gene sets); these gene families are ranked from large (with many genes in a family) to small (few genes in a family), with the rank r = 1, 2, . . . n F . The gene family size S r as a function of rank r can be called a rank-size distribution. The Zipf's law or Pareto's law uses an inverse power-law function: log(S r ) = c − a log(r) to fit the data, in log-log scale [55,56].
In the linear regression framework, the relation can be written as log(S r ) ∼ log(r) with the purpose of minimizing the error between expected (from the regression line) and observed in log(S r ) scale. Without the log-log scale, that fitting S r = C/r a is to minimize the error between expected and observed in the original S r scale. The two regressions may result in different fitting parameters. It is well known that the power-law Zipf's law S r = c/r a corresponds to a power-law probability density function f (S) ∝ 1/S 1/a+1 [57], as the two are connected by a switching of x and y axes, and a derivative [58].

Beta rank function
The beta rank function [48,49,[58][59][60] is the following functional form for ranked data: with two free parameters (a and b, where c is constrained by the normalization condition). In the regression framework, there are again two versions. One is a multiple linear regression in log-log scale: log(S r ) = c + b log(n F + 1 − r) − a log(r) or log(S r ) ∼ log(n F + 1 − r) + log(r). The second is a nonlinear regression S r ∼ (n F + 1 − r) b /r a . The multiple linear regression aims at minimizing predicted-observed error in the log(S r ) scale, whereas the nonlinear regression minimizes that in the S r scale. The beta rank function becomes a power-law rank function when b = 0.

Measuring fitting performance of a rank function by mean-squared error and Pearson's correlation
The fitting performance of a function/regression can be measured in different ways. We use two measures here: (i) mean-squared error (MSE): where 'observed' O r can either be S r or log(S r ) and 'expected' E r can either be regression-predicted S r or log(S r ). The quantity without the normalization factor (1/n F ) is called sum of squared errors (SSE), or residual sum of squares (RSS), or sum of squared residuals (SSR). (ii) Pearson's correlation coefficient between O r and E r , again in either the S r scale or log(S r ) scale. The coefficient of determination R 2 , used in, e.g. Martínez-Mekler et al. [49], is the square of r OE [61, p. 29]: (2.4) (iii) Kolmogorov-Smirnov (KS) distance between the cumulative distribution function (cdf) from observed data and cdf from the fitted values. KS distance in cdf is defined as the maximum distance between the two cdfs in the cumulative probability axis. That axis is equivalent to the rank axis in a rank-frequency plot, normalized by the maximum rank.

Comparing the fitting performance of two functions with different number of fitting parameters by Akaike information criterion and Bayesian information criterion
This part is essential identical to page 185 of Venables & Ripley [62] and the appendix of Li & Miramontes [59]. If we assume the difference between observed and fitted values follow a normal distribution at all ranks, the likelihood of the fitting function is where the variance of the noise can be estimated from the data:σ 2 = MSE. The Akaike information criterion (AIC) [63] is defined as: AIC = −2 logL + 2p, whereL is maximized likelihood, p is the number of parameters in the statistic model and Bayesian information criterion (BIC) [64] defined as BIC = −2 logL + p log n F . Replacing σ by the estimatedσ , we obtained the maximized likelihood, which after log is The fitting function with the lowest AIC/BIC is a better function [65].

Measuring fitting performance of a rank function by empirical p-value through simulation
Following Clauset et al. [57], we compare the distance between the observed data and fitted function with the distances generated from re-samplings (or bootstraps [66]). The re-sampling uses the true function to generate data, and the produced distances are supposed to be noise. When the observed distance tends to be larger than those from the re-sampling, we may question the validity of the fitted function. The proportion of re-sampling distances larger than the observed one is called empirical p-value. Then, a smaller empirical p-value can be used to reject the fitted function being a good function for this data [57] (see also, e.g. the appendix of Moreno-Sánchez et al. [67]). We carry out these steps: (i) fitting a ranked data by a ranking function (e.g. Zipf's function or beta rank function equation (2.1)); (ii) use the fitted rank function to construct an empirical cdf, noting that rank plot can be directly converted to an empirical cumulative plot [58]. We may write y = cdf(x) with y ∈ (0, 1); (iii) randomly sampling ys from the uniform distribution in (0, 1), and through the empirical cdf, obtain a simulated dataset which is generated by the fitted rank function (x = cdf −1 (y)). The xs are rounded off to integers, and integer x = 1 is discarded as we require gene families to have at least one member; (iv) the simulated values xs are ranked and fitted by a fitted function (same functional form as the one used in the observed data, e.g. Zipf's law or beta rank function). We use the KS distances as the distance measure between the data and the fitted function. KS distance is the maximum difference between the observed and fitted points in the y-direction. The KS distances between the simulated set and the corresponding fitted function are calculated; (v) repeat steps ii-iv. The proportion of KS distances from re-sampling replicates larger than the KS distance from the original observed data is the empirical p-value.

Computer program used
We use the R statistical package (www.r-project.org), where the function lm is used for linear regression, cor.test for correlation coefficient, ks.test for KS distance, approxfun for empirical cdf, etc.

List of transcription factor genes
A list of 1987 transcription factor (TF) genes is obtained from the supplementary table 2 of Vaquerizas et al. [68], with the URL: http://www.nature.com/nrg/journal/v10/n4/extref/nrg2538-s3.txt.   We use the re-sampling approach described in the Method section to assess the fitting of Zipf's law. Of the 10 000 replicates, 9692 have larger KS distances with their own fitted function than the KS distance from the observed data, and 46 have equal KS distances. This leads to an empirical p-value of 0.97, indicating that Zipf's law is a very good fitting function of the observed data. Another indirect confirmation is by drawing an error bar (± √ y) [70] for the data point in figure 1d. The shaded area enveloped by error bars maintains the straight line trend. We use the same re-sampling procedure to test the performance of Zipf's law for the whole range (for gene set sizes excluding pseudo-genes). In one round of re-sampling with 10 000 replicates, two of the KS distances between the randomly generated data and their fitting function are larger than the observed one. This 0.0002 empirical p-value indicates that Zipf's law is a poor fitting function. The shaded area in figure 2d enveloped by error bars also shows that straight lines do not fit the data well.

Curated human gene family sizes are better characterized by the beta rank function
It is a common practice to dismiss the tail that does not conform to a power-law as finite-size effects, based on the assumption that we do not have enough samples to observe the true value [71]. Although we may gradually improve the fitting performance of Zipf's law by removing the smaller families (empirical p-values are 0.01, 0.16, 0.20, 0.48 if we only fit family sizes larger than or equal to 8,9,10,11), we actually remove common events by doing so, not rare events as implied in the idea of finite-size effects.
Not only could the finite-size-effect argument not justify smaller gene sets being rare events, we also have a general belief that all points in the rank-frequency plot such as figure 2d should be fitted, and not to use that argument of finite-size effects arbitrarily ( [72]). Previous studies illustrate great potential of the beta rank function in fitting observational data [48,49]. The fitted parameter values in equation (2.1) are a = 0.85 and b = 0.34 by linear regression of log size over log-rank (a = 0.82 and b = 0.32 with pseudo-genes). Because a > b, we may consider our fitting function a modification of the Zipf's law. Figure 2d shows a much better fitting of the data by the beta function (blue) than by the powerlaw function (red). The improvement of the fitting performance can be further quantified by MSE (0.0179 without pseudo-genes, 0.0155 with pseudo-genes) and R 2 (0.985 without pseudo-genes, 0.986 with pseudo-genes). Because it is not fair to compare the fitting performance between two functions of different numbers of fitting parameters, we require that the increase of likelihood (or decrease of the fitting error) is more than compensated by the cost of using one more parameter The first term in equation (3.1) and equation (3.2) is −1065.9 (without pseudo-genes) and −1086.4 (with pseudo-genes), proving that the rank beta function is better than the power-law function.

Simulating a beta rank function by a split-merge model
We propose a simple model to explain rank functional form such as equation (2.1) which describes the sizes of curated gene sets. We first assume gene set sizes to initially follow a power-law distribution. A curator of the gene sets may fine-tune the collection by (i) splitting the larger (and the second largest) gene sets into two smaller ones and (ii) merging 6% of randomly chosen sets into larger ones by a two-toone operation. The reason for not choosing the smaller gene sets, but any gene sets, is that once we chose a threshold gene set size to decide which one is small, a discontinuity is introduced to the distribution. The motivation for the split operation is that curators of gene family databases may find subtle functional difference between genes in the largest families. The motivation for merging gene families could be due to a realization of a common function between two families. These operations are distinctly different from duplication which is an intrinsic mechanism to grow gene family sizes. Our two operations can be considered to be extrinsic ones imposed by curators.
We use the paralogues from Ensembl as the starting gene sets. The ranked gene set sizes after every 10 rounds of operation according to our dynamic model are plotted in figure 3. At the 100th round, the ranked gene set sizes are fitted by a beta rank function (solid line in figure 3). Visual impression indicates that beta rank function fits well the data points at the 100th step. Simulation with other initial distributions has also been carried out, such as uniform distribution, absolute normal distribution, lognormal distribution, chi-square distribution, etc. The resulting distributions after 100 or so iterations  are all similar, although there are subtle differences. We also observe that the b > a in equation (2.1) holds true in our simulation of this particular model. Because the initial gene set sizes, which are the sizes of paralogues, should be a consequence of duplication and deletion [15] or duplication and mutation [24], the sizes of gene sets produced by our dynamic model is a consequence of all these operations: duplication, deletion, mutation, splitting and merging, though their relative contribution to the final distribution is unclear. Besides modelling of gene family sizes, there has been theoretical modelling of power-law distribution on various other entities in genomes [73][74][75][76][77][78][79][80][81][82][83][84][85]. Although these works do not model gene family sizes directly, the connection between a mechanism (duplication) and a feature of the distribution (power-law) is unmistakable. We expect the extra mechanisms discussed here could be a cause of deviating the power-law distribution in these entities also.

Gene family size information is important for enrichment tests
Gene set is an indispensable part of a microarray expression analysis [13,[86][87][88][89]. A typical question is whether the overlapping genes between a gene set with a specific function and a list of differentially expressed genes (between two phenotypes) from microarray data are more than expected by chance. If such an overlap is unlikely by chance, measured by the p-value from a statistical test on enrichment, one may infer the relevance of that gene set to the phenotype under study. Because the test result depends on the size of the gene set, enrichment results of two different gene sets may not be comparable. In Subramanian et al. [13], gene sets with size S smaller than 15 are not considered, and enrichment score is normalized by a permutation-obtained mean value which is gene-set-size-dependent. Thus, it is important to characterize the distribution of all gene sets and one should be aware of the gene set size before reaching conclusion on the relevance of a gene set.

Distribution of transcription factor genes among gene families
The beta rank function equation (2.1) used in figure 2d expands our ability to fit gene set size data. Here, we show two more examples to illustrate its flexibility. We examine how human TF genes are clustered in different HGNC gene sets. Towards this, we use the 1987 TFs listed in Vaquerizas et al. [68], and search them in the HGNC gene sets. Some HGNC gene sets are clearly exclusively TFs.    figure 4). A beta rank function is used to fit the data. Re-sampling with 10 000 replicates, 1618 with KS distance larger than observed, and 2180 replicates with the same KS distance as observed, leads to an empirical p-value of 0.38. As a comparison, an empirical p-value using the power-law rank function is 0.12.

Distribution of drug target genes among gene families
A similar analysis is carried out for the pharmacological or drug target genes [69]. We use the target list obtained from the The International Union of Basic and Clinical Pharmacology/British Pharmacological Society (IUPHAR/BPS) Guide to PHARMACOLOGY site, which only includes those with a UniProtKB/Swiss-Prot ID, thus also having an HGNC name. The first groups of targets annotated in detail include G protein-coupled receptors, ion channels and nuclear hormone receptors [90]. Many proteins labelled as receptor, channel, transporter, kinase, etc. are drug targets or target candidates. Figure 5 shows the distribution of these target genes in the HGNC gene families. The majority of genes in the solute carrier gene family (390/396) are IUPHAR/BPS targets. Solute carriers are transporters that transport a large variety of smaller molecules such as ions, amino acids, neurotransmitters, sugars, etc, and are known to be good candidates for drug targets [91]. G protein-coupled receptors (GPCRs) reside on the cell membrane (pass through it seven times), sense molecular signals from outside the cell and initiate signal transduction inside the cell. As a result, GPCRs are also candidates for drug targets [92]. Although CD molecules or cell surface molecules provide the second most numerous targets, 60% of CD proteins are still not listed as targets. As seen in figure 5, the beta rank function provides a qualitative trend of the distribution of drug target genes in the HGNC gene families. The systematic deviation from the fitting curve in figure 5 is very similar to that for TFs (figure 4), even though the gene families involved in the two examples are completely different.

The effect of including pseudo-genes on gene family sizes
There are more than 10 000 pseudo-genes in the human genome [93]. As most pseudo-genes do not have any known function, only a small proportion of them show up in the HGNC gene families/sets.   Excluding pseudo-genes reduces the total number of entries in the gene sets by 1396. In figure 6, we plot the increase of gene set size due to the inclusion of pseudo-genes, as a function of the gene set size. Pseudo-genes have the most dramatic impact on olfactory gene sets [94,95], as expected because the reduced importance of sense of smell to human survival leads to a larger amount of disabling mutations in these genes. The gene set vomeronasal receptors, which is related to the olfactory sense, is labelled individually: it contains three genes without pseudo-genes, whereas 130 genes with pseudo-genes are included. Pseudo-genes also cause large increase of size for immunoglobulin gene sets. On the other hand, pseudo-genes have either no or very little impact on zinc finger gene sets. In conclusion, we aim at bridging two aspects of biological fields, evolutionary biology and functional biology (as well as human biomedical research), by comparing two types of gene families: paralogues and gene sets. While the size distribution of paralogues is well studied and its understanding is through the dynamics of duplication and deletion/mutation, the size distribution of gene sets has rarely been studied and its functional form was unknown. We observed that the size distribution of functional gene sets does not need to follow a power-law distribution, and its deviation from the power-law can be fitted by beta rank function. We propose a mechanism to drive the size distribution away from a power-law function, by splitting the largest gene sets and by randomly merging a small proportion of other gene sets. These results are potentially useful in a gene enrichment analysis as gene set size affects a gene enrichment test result.