USP: an independence test that improves on Pearson’s chi-squared and the G-test

We present the U-statistic permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson’s χ2-test of independence, or the G-test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a U-statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearson’s test, the G-test and Fisher’s exact test, and on real data. The USP test is implemented in the R package USP.


Introduction
Pearson's χ 2 -test of independence [1] is one of the most commonly used of all statistical procedures. It is typically employed in situations where we have discrete data consisting of independent copies of a pair (X, Y), with X taking the value x i with probability q i , for i = 1, . . . , I, and Y taking the value y j with probability r j , for j =   with values 'Middle school or lower', 'High school', 'Bachelor's', 'Master's', 'PhD or higher', so that I = 4 and J = 5. From a random sample of size n, we can summarize the resulting data (X 1 , Y 1 ), . . . , (X n , Y n ) in a contingency table with I rows and J columns, where the (i, j)th entry o ij of the table denotes the observed number of data pairs equal to (x i , y j ); see table 1 for an illustration.
Writing p ij = P(X = x i , Y = y j ) for the probability that an observation falls in the (i, j)th cell, a test of the null hypothesis H 0 that X and Y are independent is equivalent to testing whether p ij = q i r j for all i, j. Letting o i+ denote the number of observations falling in the ith row and o +j denote the number in the jth column, Pearson's famous formula can be expressed as where e ij = o i+ o +j /n is the 'expected' number of observations in the (i, j)th cell under the null hypothesis. Usually, for a test of size approximately α, the χ 2 statistic is compared with the (1 − α)-level quantile of the χ 2 distribution with (I − 1)(J − 1) degrees of freedom 1 . For instance, for the data in table 1, we find that χ 2 = 23.6, corresponding to a p-value of 0.0235. This analysis would therefore lead us to reject the null hypothesis at the 5% significance level, but not at the 1% level.
Pearson's χ 2 -test is so well established that we suspect many researchers would rarely pause to question whether or not it is a good test. The formula (1.1) arises as a second-order Taylor approximation to the generalized likelihood ratio test, or G-test as it is now becoming known (e.g. [4]): The G-test statistic is compared with the same χ 2 quantile as Pearson's statistic, and its use is advocated in certain application areas, such as computational linguistics [5]. There is also a second motivation for the statistic (1.1), which relies on the idea of the χ 2 divergence between two probability distributions P = (p ij ) and P = (p ij ) for our pair (X, Y): The word 'divergence' here is used by statisticians to indicate that χ 2 (P, P ) is a quantity that behaves in some ways like a (squared) distance, e.g. χ 2 (P, P ) is non-negative, and is zero if and only if P = P , but does not satisfy all of the properties that we would like a genuine notion of distance to have. For instance, it is not symmetric in P and P -we can have χ 2 (P, P ) = χ 2 (P , P). Pearson's statistic can be regarded as the natural empirical estimate of the χ 2 divergence between the joint distribution P = (p ij ) and the product P of the marginal distributions (q i ) and (r j ). This  there is more than one distribution that satisfies its constraints.
The first two concerns mentioned above are related to small cell counts, which are known to cause issues for both Pearson's χ 2 -test and the G-test. Indeed, elementary Statistics textbooks typically make sensible but ad hoc recommendations, such as: [Pearson's χ 2 -test statistic] approximately follows the χ 2 distribution . . . provided that (1) all expected frequencies are greater than or equal to 1 and (2) no more than 20% of the expected frequencies are less than 5 (Sullivan III [9, p. 623]). The X 2 statistic has approximately a χ 2 distribution, for large n . . . The χ 2 approximation improves as {μ ij } increase, and {μ ij ≥ 5} 2 is usually sufficient for a decent approximation (Agresti [7, p. 35]).
Unfortunately, these recommendations (and others in different sources) may be contradictory, leaving the practitioner unsure of whether or not they can apply the tests. For instance, for the data in table 1, we obtain the expected frequencies given in table 2. From this table, we see that all of the expected frequencies are greater than 1 but four of the 20 cells, i.e. exactly 20%, have expected frequencies less than 5, meaning that this table just satisfies Sullivan, III's criteria, but it does not satisfy Agresti's.
Fortunately, there is a well-known, though surprisingly rarely applied, fix for the first numbered problem above, for both Pearson's test and the G-test: we can obtain the critical value via a permutation test. We will discuss permutation tests in detail in §2, but for now it suffices to note that this approach guarantees that the tests control the size of the tests at the nominal level α, in the sense that for every sample size n, the tests have Type I error probability no greater than α.
Our second concern above would typically be handled by removing rows or columns with no observations. If such a row or column had positive probability, however, then this amounts to changing the test being conducted. For instance, if we suppose for simplicity that the Ith row has no observations, but q I > 0, then we are only testing the null hypothesis that p ij = q i r j for i = 1, . . . , I − 1 and j = 1, . . . , J. This is not sufficient to verify that X and Y are independent.
It is, however, the third drawback listed above that is arguably the most significant. When the null hypothesis is false, we would like to reject it with as large a probability as possible. It is too much to hope here that a single test of a given size will have the greatest power to reject every departure from the null hypothesis. If we have two reasonable tests, A and B, then typically Test A will be better at detecting departures from the null hypothesis of a particular form, while Test B will have greater power for other alternatives. Even so, it remains important to provide guarantees on the power of a proposed test to justify its use in practice, as we discuss in §2, yet the seminal monograph on statistical tests of Lehmann & Romano [3] is silent on the power of both Pearson's test and the G-test.
The aim of this work, then, is to describe an alternative test of independence, called the USP test (short for U-Statistic Permutation test), which simultaneously remedies all of the drawbacks mentioned above. Since it is a permutation test, it controls the Type I error probability at the desired level 3 for every sample size n. It has no problems in handling small (or zero) cell counts. Finally, we present its strong theoretical guarantees, which come in two forms: first, the USP test is able to detect departures that are minimally separated, in terms of the sample size-dependent rate, from the null hypothesis. Second, we show that the USP test statistic is derived from the unique minimum variance unbiased estimator of a natural measure of dependence in a contingency table.
To complement these theoretical results, we present several numerical comparisons between the USP test and both Pearson's test and the G-test, as well as another alternative, namely Fisher's exact test (e.g. [7, §2.6]), which provide further insight into the departures from the null hypothesis for which the USP test will represent an especially large improvement.
The USP test was originally proposed by Berrett et al. [10], who worked in a much more abstract framework that allows categorical, continuous and even functional data to be treated in a unified manner. Here, we focus on the most important case for applied science, namely categorical data, and seek to make the presentation as accessible as possible, in the hope that it will convince practitioners of the merits of the approach.

The USP test of independence
One starting point to motivate the USP test is to note that many of the difficulties of Pearson's χ 2test and the G-test stem from the presence of the e ij terms in the denominators of the summands. When e ij is small, this can make the test statistics rather unstable to small perturbations of the observed table. This suggests that a more natural (squared) distance measure than the Unlike the χ 2 -divergence, this definition is symmetric in P and P . In independence testing, we are interested in the case where P is the product of the marginal distributions of X and Y, i.e. p ij = q i r j . We can therefore define a measure of dependence in our contingency table by 3 In fact, this represents an important advantage of permutation tests over the bootstrap (another natural choice to obtain p-values) in independence-testing problems. Under the null hypothesis of independence, we have p ij = q i r j for all i, j, so D = 0. In fact, the only way we can have D = 0 is if X and Y are independent. More generally, the non-negative quantity D represents the extent of the departure of P from the null hypothesis of independence.
Note that p ij , q i and r j are population-level quantities, so we cannot compute D directly from our observed contingency table. We can, however, seek to estimate it, and indeed this is the approach taken by Berrett et al. [10]. To understand the main idea, suppose for simplicity that X can take values from 1 to I, and Y and take values from 1 to J. Consider the function where, for instance, the indicator function 1 {x 1 =i,y 1 =j} is 1 if x 1 = i and y 1 = j, and is zero otherwise.
However, h((X 1 , Y 1 ), (X 2 , Y 2 ), (X 3 , Y 3 ), (X 4 , Y 4 )) on its own is not a good estimator of D, because it only uses the first four data pairs, so it would have high variance. Instead, what we can do is to construct an estimator D of D as the average value of h as the indices of its arguments range over all possible sets of four distinct data pairs within our dataset. In other words, where the sum is over all distinct indices i 1 , i 2 , i 3 , i 4 between 1 and n. Thus, we have n choices for the first data pair, n − 1 choices for the second data pair, n − 2 for the third and n − 3 for the fourth, meaning that D is an average of n(n − 1)(n − 2)(n − 3) = 4! n 4 terms, each of which has the same distribution, and therefore in particular, the same expectation, namely D. It follows that D is an unbiased estimator of D, but since it is an average, it will have much smaller variance than the naive estimator h((X 1 , Estimators constructed as averages of so-called kernels h over all possible sets of distinct data points are called U-statistics, and the fact that there are four data pairs to choose means that D is a fourth-order U-statistic. For more information about U-statistics, see, for example, Serfling [11, ch. 5].
The final formula for D does simplify somewhat, but remains rather unwieldy; it is given for the interested reader in appendix Ab. Fortunately, and as we explain in detail below, for the purposes of constructing a permutation test of independence, only part of the estimator is relevant. This leads to the definition of the USP test statistic, for n ≥ 4, as ( This formula appears a little complicated at first glance, so let us try to understand how the terms arise. Notice that o ij /n is an unbiased estimator of p ij , and, under the null hypothesis, e ij /n is an unbiased estimator of q i r j . Thus the first term in (2.1) can be regarded as the leading order term in the estimate of D. The second term (2.1) can be seen as a higher-order bias correction term that accounts for the fact that the same data are used to estimate p ij and q i r j ; in other words, o ij /n and e ij /n are dependent.
To carry out the USP test, we first compute the statistic U = U(T) on the original data T = {(X 1 , Y 1 ), . . . , (X n , Y n )}. We then choose B to be a large integer (B = 999 is a common choice), and, for each b = 1, . . . , B, generate an independent permutation σ (b) of {1, . . . , n} uniformly at random among all n! possible choices. This allows us to construct permuted datasets 4 The key point here is that, since the original data consisted of n independent pairs, we certainly know for instance that X 1 and Y σ (b) (1) are independent under the null hypothesis. Thus the pseudo-test statistics U (1) , . . . , U (B) can be regarded as being drawn from the null distribution of U. This means that, in order to assess whether or not our real test statistic U is extreme by comparison with what we would expect under the null hypothesis, we can compute its rank among all B + 1 test statistics U, U (1) , . . . , U (B) , where we break ties at random. If we seek a test of Type I error probability α, then we should reject the null hypothesis of independence if U is at least the α(B + 1)th largest of these B + 1 test statistics.
It is a standard fact (e.g. [12, lemma 2]) about permutation tests such as this that, even when the null hypothesis is composite (as is the case for independence tests in contingency tables), the Type I error probability of the test is at most α, for all sample sizes for which the test is defined (n ≥ 4 in our case). Comparing (2.1) with the long formula for D in (A 2), we see that we have ignored some additional terms that only depend on the observed row and column totals o i+ and o +j . To understand why we can do this, imagine that instead of computing U, U (1) , . . . , U (B) , we instead computed the corresponding quantities D, D (1) , . . . , D (B) , on the original and permuted datasets, respectively. Since the row and column totals o i+ and o +j are identical for the permuted datasets as for the original data 5 , we see that the rank of U among U, U (1) , . . . , U (B) is the same as the rank of D among D, D (1) , . . . , D (B) . Therefore, when working with the simplified test statistic U, we will reject the null hypothesis if and only if we would also reject the null hypothesis when working with the full unbiased estimator D.
As mentioned in the introduction, Berrett et al. [10] showed that the USP test is able to detect alternatives that are minimally separated from the null hypothesis, as measured by D. More precisely, given an arbitrarily small > 0, we can find C > 0, depending only on , such that for any joint distribution P with D ≥ Cn −1 , the sum of the two error probabilities of the USP test is smaller than . Moreover, no other test can do better than this in terms of the rate: again, given any > 0 and any other test, there exists c > 0, depending only on , and a joint distribution P with D ≥ cn −1 , such that the sum of the two error probabilities of this other test is greater than 1 − . This result provides a sense in which the USP test is optimal for independence testing for categorical data.
To complement the result above, we now derive a new and highly desirable property of the U-statistic D in (A 2).

Theorem 2.1. The statistic D is the unique minimum variance unbiased estimator of D.
The proof of theorem 2.1 is given in appendix Ac. Once one accepts that D is a sensible measure of dependence in our contingency table, theorem 2.1 is reassuring in that it provides a sense in which D is a very good estimator of D. Since U is equally as good a test statistic as D, as explained above, this provides further theoretical support for the USP test.

Numerical results (a) Software
The USP test is implemented in the R package USP [13]. Once the package has been installed and loaded, it can be run on the data in table 1 as follows: > Data = matrix(c (18,12,6,3,36,36,9,9,21,45,9,9,9,36,3,6,6,21,3,3),4,5) > USP.test(Data) As with all permutation tests, the p-values obtained using the USP test will typically not be identical on different runs with the same data, due to the randomness of the permutations. The default choice of B for the USP.test function is 999, which in our experience, yields quite stable p-values over different runs. This stability could be increased by running > USP.test(Data,B = 9999) for example (although this will increase the computational time). Using B = 999 yielded a p-value of 0.001, so with the USP test, we would reject the null hypothesis of independence even at the 1% level. For comparison, the G-test p-value is 0.0205, while Fisher's exact test has a p-value of 0.02, so like Pearson's test, they fail to reject the null hypothesis at the 1% level.

(b) Simulated data
In this subsection, we compare the performance of the USP test, Pearson's test, the G-test and Fisher's exact test on various simulated examples. For each example, we need to choose the sample size n, as well as the number of rows I and columns J of our contingency table. However, the most important choice is that of the type of alternative that we seek to detect. Recall that the null hypothesis holds if and only if p ij = q i r j for all i, j. There are many ways in which this family of equalities might be violated, but it is natural to draw a distinction between situations where only a small number of the equalities fail to hold (sparse alternatives), and those where many fail to hold (dense alternatives). It turns out that the smallest possible non-zero number of violations is four, and our initial example will consider such a setting.
The starting point for this first example is a family of cell probabilities that satisfy the null hypothesis for i = 1, . . . , I and j = 1, . . . , J. A pictorial representation of these cell probabilities is given in figure 1, which illustrates that the cell probability halves every time we move one cell to the right, or one cell down in the table. The corresponding marginal probabilities for the ith row and jth column are q i = 2 −i /(1 − 2 −I ) and r j = 2 −j /(1 − 2 −J ), respectively. Now, to construct a family of cell probabilities that can violate the null hypothesis in a small number of cells, we will fix ≥ 0 and define modified cell probabilities Note that p  in the table; in fact, we can calculate that our dependence measure D is equal to 4 2 in this example.
We first study how well our estimator D is able to estimate D. In figure 2, we present violin plots giving a graphical representation of the values of D obtained from 10 000 contingency tables generated with I = 5 and J = 8, for 11 different values of and for n = 100 and n = 400; we also plot the quadratic function f ( ) = 4 2 . This figure provides numerical support for the fact that D is an unbiased estimator of D, and illustrates the way that the variance of D decreases as the sample size increases from 100 to 400.
Next, we turn to the size and power of the USP test, and compare them with those of Pearson's test, the G-test and Fisher's exact test. Figure 3 shows the way in which the power of these tests increases with , for a test of nominal size 5%, with n = 100 (the corresponding plot with n = 400, which is qualitatively similar, is given in figure 8). For both Pearson's test and the G-test, we plot power curves for both the version of the test that takes the critical value from the χ 2 distribution with (I − 1)(J − 1) degrees of freedom, and the version that obtains the critical value using a permutation test, like the USP test. Here and below, for all permutation tests, we took B = 999.
The most striking feature of figure 3 is the extent of the improvement of the USP test over its competitors. When = 0.06, for instance, the USP test is able to reject the null hypothesis in 89% of the experiments, whereas even the better (permutation) version of Pearson's test only achieves a power of 29%. The permutation version of the G-test and Fisher's exact test do slightly better in this example, achieving powers of 59% and 66% respectively, but remain uncompetitive with the USP test. The version of the G-test that uses the chi-squared quantile for the critical value performs poorly in this example, because it is conservative (i.e. its true size is less than the nominal level 5% level). This can be seen from the fact that the leftmost data point of the purple curve on the righthand plot in figure 3, which corresponds to the proportion of the experiments for which the null hypothesis was rejected when it was true, is considerably less than 5%. It is also straightforward to construct examples for which the versions of Pearson's test and the G-test that use the χ 2 quantile are anti-conservative (i.e. do not control the size of the test at the nominal level) as in appendix Aa or figure 8 in appendix Ae, and for this reason, we will henceforth compare the USP test with the permutation versions of the competing tests.
To give an intuitive explanation of why Pearson's test struggles so much in this example, recall that the χ 2 statistic (1.1) can be regarded as an estimator of the χ 2 divergence (1.2). Since, when > 0, the only departures from independence occur in the four top-left cells of our contingency table, we should hope that the contributions to the test statistic from these cells would be large, to allow us to reject the null hypothesis. But these are also the cells for which the cell probabilities are highest, so it is likely that the denominators e ij in the test statistic will be large for these cells. In that case, the contributions to the overall test statistic from these cells will be reduced relative to the corresponding contributions to the USP test statistic, for instance, which has no such denominator (or equivalently, the denominator is 1). In fact, the denominators in Pearson's statistic mean that it is designed to have good power against alternatives that depart from independence only in low probability cells. The irony of this is that such cells will typically have low cell counts, meaning that the usual (χ 2 quantile) version of the test cannot be trusted.
Our second example is designed to be at the other end of the sparse/dense alternative spectrum: we will perturb all cell probabilities away from a uniform distribution. More precisely, for ≥ 0, we set  (c) Real data To explore this example further, we repeatedly generated further tables of the same size using the empirical cell probabilities from the real data, and computed the proportion of times that the null hypothesis was rejected at the 5% level. Over 1000 repetitions, these proportions were 0.578, 0.491, 0.497 and 0.499 for the USP test, Pearson's test, the G-test and Fisher's exact test, respectively, giving further evidence that the USP test is more powerful in this example.   For a second example, we return to the marital status data in table 1. Since the powers for all tests were very high when we resampled as above, we instead repeatedly subsampled 150 observations uniformly at random from the table, again computing the proportion of times that the null hypothesis was rejected at the 5% level. Over 1000 subsamples, the proportions of occasions on which the null hypothesis was rejected at the 5% level were 0.700, 0.583, 0.585 and 0.633 for the USP test, Pearson's test, the G-test and Fisher's exact test, respectively, so again the USP test has greatest power over the subsamples.

Conclusion
χ 2 -tests of independence are ubiquitous in scientific studies, but the two most common tests, namely Pearson's test and the G-test, can both fail to control the probability of Type I error at the desired level (this can be serious when some cell counts are low), and have poor power. The USP test, by contrast, has guaranteed size control for all sample sizes, can be used without difficulty when there are low or zero cell counts, and has two strong theoretical guarantees related to its power. The first provides a sense in which the USP test is optimal: it is able to detect alternatives for which the measure of dependence D converges to zero at the fastest possible rate as the sample size increases (i.e. no other test could detect alternatives that converge to zero at a faster rate). The second, which is the main new theoretical result of this paper, reveals that the USP test statistic is derived from the unique minimum variance unbiased estimator of D. This provides reassurance about the test not just in terms of the rate, but also at the level of constants. These desirable theoretical properties have been shown to translate into excellent performance on both simulated and real data. Specifically, while no test of independence can hope to be most powerful against all departures from independence, we have shown that the USP test is particularly effective when departures from independence occur primarily in high probability cells.
A further extension of our methodology is to the problem of testing homogeneity of the distributions of the different rows of our contingency table. Since the permutations used to generate our p-values preserve the marginal row totals, the USP test can be used without modification in this setting, in an analogous way to Pearson's test and the G-test.

Appendix A (a) An example to show that Pearson's χ 2 -test and the G-test can have unreliable Type I error
The aim of this subsection is to show that both Pearson's χ 2 -test and the G-test can have highly unreliable Type I error, even in the simplest setting of a 2 × 2 contingency table, and for arbitrarily large sample sizes. Fix a sample size n, fix 0 < λ < n 1/2 , and let p = λ/n 1/2 . Consider a 2 × 2  contingency table with cell probabilities given in table 4. It can be checked that this table satisfies the null hypothesis of independence, since Now suppose that we draw a random sample of size n from this contingency table, obtaining the cell counts in table 5: It is convenient to write p i+ = o i+ /n and p +j = o +j /n. Then by some simple but tedious algebra, We are now in a position to study the asymptotic distribution of the χ 2 statistic in this model, when n is large and λ is fixed. First, note that o 11 , the number of observations in the top-left cell, has a binomial distribution with parameters n and p 2 = λ 2 /n, so its limiting distribution is Poisson with parameter λ 2 , by the law of small numbers (e.g. [14, pp. 2-3]). On the other hand, the other terms in the final expression in (A 1) are converging to constants: p 1+ , the proportion of observations in the first row of the table, is converging to zero in the sense that P( p 1+ > t) → 0 as n → ∞ for every t > 0, and likewise for p +1 , the proportion of observations in the first column. Finally, we turn to e 11 , and note that we can write e 11 = (n 1/2 p 1+ )(n 1/2 p +1 ). Now, n 1/2 p 1+ has the same distribution as W/n 1/2 , where W has a binomial random variable with parameters n and p = λ/n 1/2 . Thus n 1/2 p 1+ has expectation λ and variance (λ/n 1/2 )(1 − (λ/n 1/2 )), which converges to zero as n → ∞. Since n 1/2 p +1 has the same distribution as n 1/2 p 1+ , we deduce that e 11 = λ 2 + E n , where E n converges to zero in the same sense as p 1+ . These calculations allow us to conclude that the asymptotic distribution of the χ 2 statistic in this example is that of where Z has a Poisson distribution with parameter λ 2 . We can immediately see from this that, even in the limit as n → ∞, Pearson's test will not have the desired Type I error probability, because this distribution differs from the χ 2 distribution with 1 d.f., which is what would be expected according to the traditional asymptotic theory where the cell probabilities do not change with the sample size. As another way of comparing the actual asymptotic Type I error probability with the desired level, see figure 6. Here, we plot the asymptotic Type I error probability as a function of λ, where c α is the (1 − α)th quantile of the χ 2 1 distribution. For an ideal test of exact size α, this should produce a constant flat line at level α, but in fact we see that the Type I error probability oscillates quite wildly, due to the discreteness of the Poisson distribution. For a test at a desired 1% significance level, we may end up with a test whose Type I error probability is 10 times larger! These issues are not resolved by working with the G-test instead. Indeed, similar but more involved calculations, given in appendix Ad, reveal that in this example, the asymptotic distribution of the G-test statistic is that of where Z has a Poisson distribution with parameter λ 2 . Since this asymptotic distribution is not a χ 2 distribution with 1 d.f., we again see that the asymptotic size of the G-test will not be correct in general. The corresponding asymptotic size plots, which are presented in figure 7, reveal similarly wild behaviour as for Pearson's test. The biggest jumps in the Type I error probabilities in figure 7 occur when λ = √ c α /2, because when λ exceeds this level, we will reject the null hypothesis on observing Z = 0, whereas for smaller λ we will not. A similar transition occurs when λ = √ c α for Pearson's test in figure 6, though this is barely detectable when α = 0.01, in which case √ c α is approximately 2.58. We conclude from this example that the sizes of both Pearson's test and the G-test can be extremely unreliable, even when the overall sample size in the contingency table is very large. Moreover, these problems can be even further exacerbated when we move beyond 2 × 2 contingency tables, with asymptotic Type I error probabilities that deviate even further from their desired levels.
To explain what is going on in this example in a more general but abstract way, let P denote the set of all possible distributions on 2 × 2 contingency tables that satisfy the null hypothesis of independence. By, e.g. Fienberg & Gilbert [15], all such distributions have cell probabilities of the form given in table 6 for some 0 ≤ s ≤ 1 and 0 ≤ t ≤ 1.
In our example, we simplified this general case by taking s = t = p. The justification for using c α as the critical value for Pearson's χ 2 -test comes from the fact that for each P in the set P, we have that  as n → ∞. Here, the notation P P indicates that the probability is computed under the distribution P. On the other hand, the crucial point about our example is that the speed at which this probability converges to α may depend on the particular choice of P that we make; more formally, this convergence is not uniform over the class P sup P∈P P P (χ 2 > c α ) − α 0, as n → ∞. It is this fact that allows us to find, for each n, a distribution P n in P for which the Type I error probability P P n (χ 2 > c α ) is not approaching α as n increases.
By a Taylor expansion of the logarithms in the second, third and fourth terms, we conclude that the asymptotic distribution of G is that of 2(λ 2 + Z 1 ) log 1 + Z 1 λ 2 − 2Z 1 , as claimed in appendix Aa.

(e) Additional simulation results
Here, we present further numerical comparisons between the USP test, Pearson's test, the G-test and Fisher's exact test. Figure 8 shows power functions for the first (sparse alternative) example in §2b, but with n = 400 instead of n = 100. The figure is qualitatively similar in most respects to figure 3, and reveals that the improved performance of the USP test is not diminished by increasing the sample size. One slight difference is that we can see that the version of Pearson's test with the χ 2 quantile is anti-conservative (fails to control the size at the nominal level) for this sample size. A feature of both our sparse and dense examples is that the perturbations from the null distribution are additive. An alternative mechanism for departing from the null distribution that is also of interest is where the perturbations are multiplicative. For example, for I = J = 4 and ≥ 0, consider the cell probabilities where C = I i=1 J j=1 (1 + (−1) i+j /2 i+j ) is a normalization constant; see figure 9. Figure 10 shows the power curves of our four permutation tests with n = 100. Despite the fact that the perturbations here are dense, we see that the USP test is best able to detect the violations of independence for small and moderate values of , while Fisher's test slightly outperforms it for larger .
Data accessibility. This article has no additional data. Authors' contributions. T.B.B. conceived of the general framework and helped draft the manuscript. R.J.S. helped formulate the general framework and drafted the manuscript. Both authors gave final approval for publication and agree to be held accountable for the work performed therein.