Abstract
Ecologists are often required to estimate the number of species in a region or designated area. A number of diversity indices are available for this purpose and are based on sampling the area using quadrats or other means, and estimating the total number of species from these samples. In this paper, a novel theory and method for estimating the number of species is developed. The theory involves the use of the Laplace method for approximating asymptotic integrals. The method is shown to be successful by testing random simulated datasets. In addition, several real survey datasets are tested, including forests that contain a large number (tens to hundreds) of tree species, and an aquatic system with a large number of fish species. The method is shown to give accurate results, and in almost all cases found to be superior to existing tools for estimating diversity.
1. Introduction
Estimating how many species occupy a region from sample data is an important research area in biodiversity science and ecology [1–5]. Such estimates have numerous applications, including the comparison of relative richness of different ecosystems [6], the evaluation of different sampling methodologies [7], the identification of ‘hot-spot’ areas for conservation action [8,9], the estimation of the total number of species on the planet [10,11] and monitoring changes in the number of species in time to allow for inference of climatic effects or other disturbances.
Various methods have been used to study these problems, including log–log [1] and log–linear [4] species area curves, the Chao statistic [12], bootstrap and extrapolation methods [13], jackknife estimates [14], and the abundance coverage-based estimator (ACE) [15,16]. Each method makes different statistical assumptions about the ways in which species form communities. Good reviews of the available methods may be found in earlier studies [6,13,17–19]. There is little ecological theory or evidence to support any one of these methods over the others [20] and the results of the various estimators can be substantially different. Without a well-grounded theory, species richness estimation methods can be compared only by empirically testing on benchmark ecosystem datasets that are mapped out in detail. There has been recent interest [6,15,17,18,21] in resolving the differences in the various estimation methods and it is now understood to be a research direction that needs much more attention.
Here, we present a new approach based on asymptotic analysis of sampling data. To set the stage, we suppose that the true number of species in the entire area under investigation is S*. Suppose also that we have accurate surveys of n quadrats that together only cover a small part of the total area. Given that the number of species is known in each quadrat, is it possible to estimate the total number of species, S*?
In answering this question, our starting point, as in many other studies (e.g. [13]), is the well-known expression

Previous methods typically assume specific functional forms for S(n), such as exponential or hyperbolic [13], and extrapolate from the data to estimate S*, which from equation (1.1) is equal to the large n limit of S(n). Our approach, on the other hand, assumes a general smooth but unknown scaling functional form for the pj probabilities, from which we deduce an asymptotic form for S(n) to estimate S* in the large n limit. A similar approach has been presented recently by Chao et al. [22] to determine asymptotic relations between low species abundance numbers (see also [17]). Although the asymptotic methods described are similar to ours, the underlying assumptions are quite different. Our asymptotic forms are derived in appendix A where comparisons are made with the work of Chao et al. [22].
This article is organized as follows. In §2, we outline the derivation of the estimation method and describe the main assumptions on which it is formed. In §3, we first show that the method gives good results for randomly generated datasets in various scenarios. Following this, we test our method in comparison with other estimation techniques on various real datasets. These results typically show that in most cases our method outperforms others. In §3b, we present an analysis of the asymptotic S(n) curves for a large number of samples. This analysis shows that the expected behaviour in the large n regime is as our theory predicts, therefore validating our underlying assumptions. Finally, in §4, we summarize our main findings.
2. Material and methods
The values of the pj in equation (1.1) are unknown. We can, however, assume without loss of generality that the species are ordered or labelled so that

We now assume that pj are independent of n and can be expressed in terms of a (monotonic) scaling function f(x) for 0 ≤ x ≤ 1 as

To illustrate this, figure 1 shows pj for quadrats of 30 × 30 m (0.09 ha) within the Barro Colorado Island (BCI) forest dataset [23–25]. Figure 1a shows pj for all species—it is evident that it has the convex form that we assume. Figure 1b is a zoom in on small j/S* (the rare species). The solid line is a quadratic fit (y = 0.42x2 − 0.008x + 0.0013). Note that the coefficient of the quadratic term is several orders of magnitude larger than the others, indicating that the dominant term is the quadratic and not the linear or constant terms. The general shape of the curve and the quadratic fit confirm our assumptions about f(x). We discuss this further in §3b.
Figure 1. Barro Colorado Island f(x). (a) We order the pj = f(j/S*), the probability that species j is found on a single quadrat, in the BCI dataset calculated with quadrats of 30 × 30 m. The pj are plotted here as a function of x = j/S*. The convex shape is evident. (b) A zoom in on the small rectangle in the bottom left corner of panel (a). The solid line is a quadratic fit (see text) to f(x) for small x; that is, the rare species.
To continue the outline of the derivation, we express equation (1.1) as

. For large S*, given our assumptions, we can approximate I(n), using equation (2.2), as an integral
and thus obtain an accurate estimate for the total number of species. Therefore, instead of calculating a complicated integral, we just sum the main contributions using the method of Laplace. The rigorous derivation is briefly outlined for the present application in appendix A.In ecological terms, the dominance of small x-values in the asymptotic evaluation of equation (2.4) means, from equation (2.2), that rare species are the prime determinants of total species richness (S*), irrespective of the precise form of the relative densities of species. The same statement in fact applies to the Chao statistic [12] and jackknife estimates [14,27] that are determined by species with low abundance numbers (specifically singletons and doubletons). Intuitively, individuals of the rarer species are far less likely to be observed in a random collection of quadrats that samples only a small part of the total region. Thus, in the derivation rarer species are given a higher weight when calculating the integral.
Although the functional form of f(x) is not known, based on our assumptions it is reasonable to assume that for small x

If α = 1, this would correspond to a linear form so that the probability that species j is found in a quadrat increases linearly with the species index j. If α = 2, this would correspond to a quadratic increase. For real datasets, we have found α = 2 is often the more realistic choice (see §3b). If one makes the additional assumption that f(x) is smooth around x = 0, and recalling that f(0) = 0, one can make the power series approximation

Inserting this expression in the integral of equation (2.4), Laplace's method allows us to find the asymptotic form of I(n) and thus S(n) in equation (2.3) (see appendix A for details). Because of the sharp maximum in the integrand, around x = 0, the terms for f(x) that contribute mostly to the final value of the integral are readily identified. As shown in appendix A, based on the simplest approximation for f(x), S(n) in equation (2.3) has the asymptotic form

We refer to S(n) as calculated by equation (2.7) as our first-order predictor for the total number of species S* (from here on ‘Laplace 1’). In practice, even better estimates of S* may be achieved by refining the model to include higher-order terms obtained directly from Laplace's method. The second-order approximation (from here on ‘Laplace 2’) which is derived in appendix A is given by

Assuming that the function f(x) is convex and analytic, the parameter α appearing in the equations of Laplace 1 and 2 is chosen to take the value α = 2, as discussed in §3b. As we have already noted, and as shown in figure 1, this choice works extremely well for the BCI data and other datasets we test. Thus, working with the species accumulation curve S(n) obtained from quadrat data, one can numerically fit model equation (2.8) to estimate the three parameters A, B and (most importantly) total species numbers S*.
As discussed shortly, Laplace 2 is our model of choice because of its accurate estimates and relative ease of implementation. In the following section, we present some case studies and compare our model predictions with those of others mentioned previously.
3. Results
(a) Analysis of real and simulated datasets
We first show that the method yields accurate and consistent estimates based on computer-generated random datasets. This requires the construction of a random set of nq quadrats which can be considered typical samples of the entire area. The species accumulation curve S(n) may then be obtained from the sampled quadrats. Although there are numerous ways to achieve this, the method suggested below appears to have the most advantages.
(i) First, it is necessary to choose the pj, the probabilities that species j is found on a given quadrat. To accomplish this, we select an arbitrary polynomial f(x) (polynomial of x = j/S*) satisfying our basic assumptions, so that f(x) is of the form: ![]() 3.1 | |||||
(ii) We then ‘build’ the nq quadrats by randomly selecting the species present on each quadrat. Essentially, this requires scanning over all S* species and allowing species j to reside on a given quadrat with probability pj. | |||||
(iii) Once the nq quadrats are generated, it is possible to construct the S(n) curve. As mentioned, S(n) is the number of species observed in n of the total nq quadrats. In our case, the order of the accumulation of species among the n quadrats is insignificant, and therefore, in order to get a smooth S(n) curve, we averaged over 100 permutations out of the | |||||
(iv) A standard Matlab routine (nlinfit) is then used to fit model equation (2.8) to the S(n) curve to obtain the estimate for S*, the number of species. Carrying out the fit actually means we are extrapolating the S(n) curve for n → ∞. | |||||
One should note that in the case of random data, which are generated according to the above procedure, the quadrat size is irrelevant as pj determines directly the probability that a species will be in a quadrat. However, when analysing real data, given that the number of quadrats is fixed, then the quadrat size will obviously have a large impact on the estimation capability.
Figure 2a–c demonstrates how the method works. The diamond markers are the values for S(n) as a function of the number of quadrats n. The curves are fitted with the first- and second-order models to obtain the estimated number of species. Figure 2a,b is for random datasets generated with f1(x) = 0.23(x2 + x3 + x4) with 320 species, while figure 2c is for the BCI dataset (also 320 species with a quadrat size of 25 × 25 m).
Figure 2. The three panels demonstrate how the method works. In panels (a,b), we used f1(x) = 0.23(x4 + x3 + x2). (a) Laplace 1 with random data, (b) Laplace 2 with random data, and (c) Laplace 2 with BCI data.
Looking more closely at figure 2a,b, based on n = 8 quadrats, one observes the curve begins with S(1) = 53 and ends with S(8) = 170. The goal of the predictor is to extrapolate the graph for large n. Clearly, by eye alone it is impossible to make an accurate guess for S* where the curve saturates. Yet the second-order predictor remarkably estimates S* within 5–10% of the true value (S* = 320).
Table 1 summarizes results for several scenarios including results for random datasets generated with two functions f1(x) = 0.23(x2 + x3 + x4) and f2(x) = 0.3x2. The general trends are that the method is more accurate (i) when using second order instead of first order, (ii) when the number of quadrats is increased and (iii) when the number of species is larger. Moreover, the method is robust to different choices of the function f(x). Overall, table 1 shows that our method gives accurate estimates with a relatively small number of quadrats.
| nq | order | random, f1(x) S* = 200 | random, f1(x) S* = 320 | random, f2(x) S* = 320 | BCI, S* = 320 |
|---|---|---|---|---|---|
| 8 | first | 144 ± 6 | 229 ± 8 | 197 ± 10 | 235 ± 10 |
| 20 | first | 160 ± 5 | 256 ± 6 | 247 ± 7 | 259 ± 7 |
| 8 | second | 184 ± 11 | 293 ± 15 | 303 ± 18 | 284 ± 16 |
| 20 | second | 189 ± 7 | 302 ± 9 | 328 ± 11 | 297 ± 10 |
In order to examine our method in detail, we used five forest datasets (BCI, Sherman, Cocoli, Luquillo and Oosting [23–25]), a fish survey dataset consisting of data of 243 hauls from the Mediterranean Sea [28], and a random dataset created as previously described. The parameters we used and general information regarding the datasets is summarized in table 2. For each dataset, we tested eight methods, of which three were parametric: Laplace 2, Negative exponential and Michaelis–Menten (the last two involve fitting the S(n) curve to functions A(1 − exp(−Bn)) and
, respectively). The other five were non-parametric: Chao 2, jackknife first and second order, bootstrap [13], and ACE [15,16], which are all based on different statistics of the quadrat data. Figure 3 shows the results of the comparison of the methods for all datasets except Oosting, which is found in the electronic supplementary material.
| dataset | area (ha) | location | species no. | nq | quad. size (m × m) | total area sampled (ha) | per cent sampled | no. individuals per quadrat |
|---|---|---|---|---|---|---|---|---|
| Barro Colorado Island | 50 | Panama | 320 | 20 | 30 × 30 | 1.8 | 3.6 | ≈660 |
| Sherman | 5.96 | Panama | 239 | 15 | 20 × 20 | 0.6 | 10 | ≈160 |
| Cocoli | 4 | Panama | 177 | 20 | 20 × 20 | 0.8 | 20 | ≈100 |
| Luquillo | 16 | Puerto Rico, USA | 90 | 40 | 20 × 20 | 1.6 | 10 | ≈30 |
| fish survey | – | Mediterranean Sea | 144 | 25 | – | 243 (hauls) | 10 | ≈310 |

Figure 3. We compared the performance of eight different estimation methods in five real datasets taken from the literature and with a randomly generated dataset. In all cases, the true number of species of the whole area was known. The methods are: A, Laplace 2; B, negative exponential; C, Michaelis–Menten; D, Chao 2; E, jackknife 1; F, jackknife 2; G, bootstrap; H, ACE. It is notable that Laplace 2 outperforms the other methods in the BCI, Sherman, Cocoli and random datasets, and does at least as equally well as jackknife 2 for the Luquillo and fish survey datasets.
We note from the results of the comparison that when the number of species is large (figure 3a–c,f) the Laplace 2 predictor outperforms the other methods tested. As the theory is based on an asymptotic expansion for a large number of species and for a large number of quadrats, this is expected. But even for a smaller number of species Laplace 2 appears at least as good as any other method, and with smaller confidence intervals. What also stands out is that the same pattern of relative success of the different predictors repeats itself for different datasets (e.g. negative exponential always underestimates by a large percentage, Chao 2 has relatively large confidence intervals, etc.).
In addition to comparing the various methods on different datasets, we performed an analysis comparing the different methods for different sampling areas (number of quadrats). The results of this analysis are in figure 4. It is apparent that increasing the area covered by the quadrats leads to better and more accurate estimates of the number of species. Again, the pattern of relative success of the different methods is preserved for different fractions sampled of the total area.
Figure 4. A comparison of the methods using the BCI data. Each panel analyses a different percentage of the area as covered by the sampled quadrats. A, Laplace 2; B, negative exponential; C, Michaelis–Menten; D, Chao 2; E, jackknife 1; F, jackknife 2; G, bootstrap; H, ACE.
(b) Asymptotic analysis of empirical data
In order to find the best-fitting α parameter, we analysed empirical datasets for their asymptotic behaviour. As we are interested in the limit of large n and large S*, we chose the datasets of BCI, Sherman and Cocoli, as they have a large number of species and contain a relatively large area, enabling us to use a large number of quadrats. For each dataset, we generated many repetitions of the S(n) curve for large n. In figure 5, we show the resulting S(n) curves plotted versus n−1/2 and below them versus n−1. It is evident that the linear fits in figure 5a–c are much more accurate than the linear fits in figure 5d–f. For this reason, throughout the analysis we chose α = 2. It is worthwhile to note also that the confidence intervals become smaller for a larger number of quadrats n (smaller n−1/2), as is expected by theory based on a large n approximation.
Figure 5. Panels (a–c) include S(n) medians and confidence intervals plotted versus n−1/2 (circles and error bars) for BCI, Sherman and Cocoli datasets. In panels (d–f), S(n) is plotted versus n−1. The dashed lines are linear fits. The figure empirically shows that when the number of quadrats is large S(n) approaches the behaviour we predict, that is equation (2.4) 
. It is notable that plotting versus n−1/2 gives much better linear fits than plotting versus n−1, therefore corroborating our theory.
4. Discussion
In this work, we presented a novel method to tackle the problem of species number estimation based on quadrat data. We have tested the method in several scenarios, showing that it gives accurate results and outperforms other existing methods in most cases, and performs as well as the jackknife second-order method when the number of species is small. In addition, we have empirically demonstrated that our model's underlying assumptions are fulfilled, therefore reinforcing its reliability. We found that our model is insensitive to details of quadrat choices and different pj probabilities, and therefore is a good candidate to be used in new areas that are being sampled.
Acknowledgements
We thank Dori Edelist for providing the comprehensive dataset of fish hauls in the Mediterranean.
Funding statement
M. Bode was funded by the ARC (DE130100572). The BCI forest dynamics research project was made possible by
Appendix A
Laplace's method [26] is concerned with the asymptotic evaluation of integrals of the form



)


Changing variables to
we then have


For comparison, the work of Chao et al. [22] is primarily concerned with asymptotic estimates for frequency counts fk of species represented by k individuals from a sample of n. In our notation, from equation (1.1), the expectation value of f0, for example, is given by

Chao et al. [22] assume that pj ≥ a for j = 1, 2,…, S* and that the pj have ‘a common marginal density g(x)’ so that one has the following equation (before eqn 6 in [22]):


) and the Laplace method becomes invalid. Comparison of equations (A 2), (A 3) and (A 11) shows in fact that equation (A 10) is essentially a special case of α = 1 in equation (A 7). In other words, the methods presented by Chao et al. [22] are confined to asymptotic expansions in powers of n−1, which we have shown in §3b to provide poorer fits to real data in comparison with equation (2.7).Footnotes
References
- 1
Arrhenius O . 1921Species and area. J. Ecol. 9, 95–99. (doi:10.2307/2255763). Crossref, Google Scholar - 2
Arrhenius O . 1923On the relation between species and area—a reply. Ecology 4, 90–91. (doi:10.2307/1929282). Crossref, Google Scholar - 3
Arrhenius O . 1923Statistical investigations in the constitution of plant associations. Ecology 4, 68–73. (doi:10.2307/1929275). Crossref, Google Scholar - 4
Gleason HA . 1922On the relation between species and area. Ecology 3, 158–162. (doi:10.2307/1929150). Crossref, Google Scholar - 5
- 6
Gotelli NJ& Colwell RK . 2001Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 4, 379–391. (doi:10.1046/j.1461-0248.2001.00230.x). Crossref, ISI, Google Scholar - 7
Gardner TA, 2008The cost-effectiveness of biodiversity surveys in tropical forests. Ecol. Lett. 11, 139–150. (doi:10.1111/j.1461-0248.2007.01133.x). Crossref, PubMed, ISI, Google Scholar - 8
May R, 1988How many species are there on earth?Science 241, 1441–1449. (doi:10.1126/science.241.4872.1441). Crossref, PubMed, ISI, Google Scholar - 9
Wilson KA, McBride MF, Bode M& Possingham HP . 2006Prioritizing global conservation efforts. Nature 440, 337–340. (doi:10.1038/nature04366). Crossref, PubMed, Google Scholar - 10
May RM, Beverton R, May RM& Beverton R . 1990How many species? [and discussion]. Phil. Trans. R. Soc. Lond. B 330, 293–304. (doi:10.1098/rstb.1990.0200). Link, Google Scholar - 11
May RM . 1990Taxonomy as destiny. Nature 347, 129–130. (doi:10.1038/347129a0). Crossref, ISI, Google Scholar - 12
Chao A . 1984Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11, 265–270. ISI, Google Scholar - 13
Colwell RK& Coddington JA . 1994Estimating terrestrial biodiversity through extrapolation. Phil. Trans. R. Soc. Lond. B 345, 101–118. (doi:10.1098/rstb.1994.0091). Link, ISI, Google Scholar - 14
Burnham KP& Overton WS . 1979Robust estimation of population size when capture probabilities vary among animals. Ecology 60, 927–936. (doi:10.2307/1936861). Crossref, ISI, Google Scholar - 15
Chao A, Hwang W-H, Chen Y& Kuo C . 2000Estimating the number of shared species in two communities. Stat. Sin. 10, 227–246. Google Scholar - 16
Chazdon RL, Colwell RK, Denslow JS& Guariguata MR . 1998Statistical methods for estimating species richness of woody regeneration in primary and secondary rain forests of Northeastern Costa Rica. In Forest biodiversity research monitoring and modeling: conceptual background and Old World case studies (eds F Dallmeier, JA Comiskey), pp. 285–309. Pearl River, NY: Parthenon Publishing Group. Google Scholar - 17
Colwell RK, Mao CX& Chang J . 2004Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology 85, 2717–2727. (doi:10.1890/03-0557). Crossref, ISI, Google Scholar - 18
Colwell RK, Chao A, Gotelli NJ, Lin S-Y, Mao CX, Chazdon RL& Longino JT . 2012Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol. 5, 3–21. (doi:10.1093/jpe/rtr044). Crossref, ISI, Google Scholar - 19
Williamson M, Gaston KJ& Lonsdale W . 2001The species–area relationship does not have an asymptote!. J. Biogeogr. 28, 827–830. (doi:10.1046/j.1365-2699.2001.00603.x). Crossref, Google Scholar - 20
Gaston KJ . 1996Species richness: measure and measurement. Biodiversity: a biology of numbers and difference(ed. KJ Gaston), pp. 77–113. Oxford, UK: Blackwell. Google Scholar - 21
Chao A, Colwell RK, Lin C-W& Gotelli NJ . 2009Sufficient sampling for asymptotic minimum species richness estimators. Ecology 90, 1125–1133. (doi:10.1890/07-2147.1). Crossref, PubMed, ISI, Google Scholar - 22
Chao A, Shen T-J& Hwang W-H . 2006Application of laplace's boundary-mode approximations to estimate species and shared species richness. Aust. New Zealand J. Stat. 48, 117–128. (doi:10.1111/j.1467-842X.2006.00430.x). Crossref, Google Scholar - 23
Hubbell SP, Condit R& Foster RB . 2005Colorado forest census plot data. See https://ctfs.arnarb.harvard.edu/webatlas/datasets/bci. Google Scholar - 24
Condit R . 1998Tropical forest census plots: methods and results from Barro Colorado Island, Panama and a comparison with other plots. Berlin, Germany: Springer. Crossref, Google Scholar - 25
Hubbell SP, Foster RB, O'Brien ST, Harms K, Condit R, Wechsler B, Wright SJ& De Lao SL . 1999Light-gap disturbances, recruitment limitation, and tree diversity in a neotropical forest. Science 283, 554–557. (doi:10.1126/science.283.5401.554). Crossref, PubMed, Google Scholar - 26
- 27
- 28
Edelist D, Rilov G, Golani D, Carlton JT& Spanier E . 2013Restructuring the sea: profound shifts in the world's most invaded marine ecosystem. Divers. Distrib. 19, 69–77. (doi:10.1111/ddi.12002). Crossref, ISI, Google Scholar





