Asymptotic convergence in distribution of the area bounded by prevalence-weighted Kaplan–Meier curves using empirical process modelling

The Kaplan–Meier product-limit estimator is a simple and powerful tool in time to event analysis. An extension exists for populations stratified into cohorts where a population survival curve is generated by weighted averaging of cohort-level survival curves. For making population-level comparisons using this statistic, we analyse the statistics of the area between two such weighted survival curves. We derive the large sample behaviour of this statistic based on an empirical process of product-limit estimators. This estimator was used by an interdisciplinary National Institutes of Health–Social Security Administration team in the identification of medical conditions to prioritize for adjudication in disability benefits processing.


Introduction
Survival analysis addresses the classical statistical problem of determining characteristics of the waiting time until an event, canonically death, from observations of their occurrence sampled from within a population. This problem is not trivial as the expected waiting time is typically dependent on the time-alreadywaited. For instance, a hundred-year-old can be more certain of surviving to his or her one hundred-and-first birthday than a newborn might reasonably be. However, the comparison may shift in the newborn's favour for the living to 121, particularly in light of medical advances that make survival probabilities & 2018 The Authors. Published by the Royal Society under the terms of the Creative non-stationary. Parametric approaches for assembling survival curves are usually not flexible enough to capture this complexity.
One simple approach to this problem was pioneered by the work of Kaplan & Meier [1]. Their product-limit estimator [2][3][4][5] is a non-parametric statistic that is used for inferring the survival function for members of a population from observed lifetimes. This method is particularly useful in that it naturally handles the presence of right censoring, where some event times are only partially observed because they fall outside the observation window. It was not, however, designed to account for varying subpopulations that may yield non-homogeneity in overall population survival (figure 1). For instance, in the example given above, subpopulations for survival characteristics may be defined by birth year or entry cohort of a subject in a particular study (figure 1).
Several existing statistical methods address variants of this limitation. A natural approach is to consider the varying subpopulations as defining underlying covariates, thus laying the framework for a proportional hazards model. The assumption of proportional hazards is quite strong. When considering time-dependent statistics (as in the motivational example), it is violated in all but a few specific cases. Likewise, frailty models, first developed by Hougaard (cf. [6]), and extended by Aalen (cf. [7]), assume multivariate event distributions, but also make assumptions on the underlying event distributions and assume proportional hazards.
Other existing methods, such as bivariate survival analysis (cf. [8]), consider the time to observation and the time to event as conditionally independent random times. Underlying these methods is the assumption that upon the time of observation, all individuals will then have a similar event time distribution, thus failing to acknowledge the temporal changes.
These complexities arose in the identification of new disorders to incorporate into the United States Social Security Administration (SSA)'s Compassionate Allowances (CAL) initiative. The CAL initiative seeks to identify candidate medical conditions for fast-tracking in the processing of disability applications. The intent of this initiative is to prioritize applicants who are most likely to die in the time-course of usual case processing so that they may receive benefits while still living.
At its inception, the CAL initiative identified conditions based on the counsel of expert opinion [9]. The SSA in collaboration with the National Institutes of Health (NIH) sought to expand the list of CAL conditions systematically, using a databased approach. Using in part the survival estimator described in this paper, the NIH identified 24 conditions for inclusion into the list of conditions [9].
The methodology used in CAL is related to that of the work of Pepe & Fleming (cf. [10,11]), where a class of weighted Kaplan-Meier statistics is introduced. Though these statistics exhibit the same limitations as in the standard Kaplan-Meier case, it should be noted that [11] introduces the stratified weighted Kaplan-Meier statistic. The statistic presented here is a priori quite similar, but instead of a weighting function, includes the empirical prevalence. In doing so, the weight is no longer independent of the event time estimate, and thus requires much different methods of proof.  Figure 1. Inhomogeneity of survival within populations can result due to at least two reasons. In (a), inhomogeneity results from a categorical covariate that influences survival statistics. In (b), inhomogeneity results from non-stationarity, where cohorts of individuals are sampled at different times. In this case, the problem of progressive censoring is apparent because later cohorts have not been observed as long.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180496 We thus consider the overall survival distribution for a population of individuals with subpopulations that exhibit non-homogeneous survival distributions. Through this consideration, a new test statistic, based upon the empirical process of product-limit estimators is developed. Through constructive methods, this test statistic compares survival distributions among the distinct subpopulations, and weights according to distribution of the identified subgroups.

Statistical method
Suppose G (1) and G (2) are disjoint populations of individuals where each individual belongs to exactly one of d distinct cohorts labelled z [ Z d . For randomly selected individuals g [ G (i) within population i, we desire to understand the statistics of the event time T g under the assumption that survival is conditional on cohort z g and population.
One representation of the marginal survival probability for members of population i, where S (i) z,t represents the survival function for individuals of cohort z in population i, where each individual's cohort membership is known.
We use this representation of the survival probability as motivation to formulate an estimator for the population-average survival functionsû ( 2 :2) whereq (i) z andŜ (i) z,t are estimators of the cohort prevalence and cohort-wise survival, respectively. This weighted Kaplan-Meier method has appeared previously in the literature [12], and has been empirically validated against the pure Kaplan-Meier method [13], where the weighting procedure was found to reduce the bias in the construction of survival curves. The asymptotic convergence of the product-limit estimator and weighted variants is well established [11,14]. We use this survival curve reconstruction method as a base in constructing a new statistic for comparing populations. The focus of this paper is not the properties of this survival estimator but rather the asymptotic convergence of its bounding area and the use of such a quantity for evaluating a null hypothesis.
Our concern is the general situation where random samples of size n (i) are chosen from each of the respective populations. Within these samples, the number of individuals within each cohort, n (i) z , is counted, from which an estimator of the cohort distribution is obtained In turn, we assume that the cohort-level survival functionsŜ (i) z,t are estimated independently using the product-limit estimator. Note that since the product-limit estimator is not a linear functional of sampled lifetimes,û (i) t is distinct from the estimator obtained by applying the product-limit estimator directly on all n (i) samples of population i. To prevent confusion, we denote all direct applications of the product-limit estimator usingŜ and all instances of weighted sums of product-limit estimators using the Greek letterû : With these elements in place, we define our test statistiĉ where t ¼ inf {t z : z [ Z d }, and t z denotes the time at which cohort z is censored in observations. Note that in the absence of random prevalence this statistic is equivalent to comparison of mean lifetimes between the two populations [10]. We state here the main result of the paper-the large sample behaviour of this statistic within a null-hypothesis statistical testing framework.
z,t denote the probability that a z-type individual has not yet been censored at time t ! 0 (the survival probability relative to the occurrence of censoring), and q (i) z denote the probability that an individual in rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180496 z,t is the product-limit estimate associated with the event of censoring for cohort z within population i,f z ;f z,0 , andŴ Note that this quantity is also well defined sinceĈ (z) z,t . 0 for all t t z . In appendix A, we provide a proof of theorem 2.1 in an empirical process framework. Note that since survival estimatesû andŜ are step functions, all integrals are exactly computable.

Numerical investigation
A computational implementation of the test statisticQ and weighted survival estimators is available in the form of a package for R. This package also contains a class to handle arithmetic involving rightcontinuous piecewise linear functions. In the appendices, we have provided source code that may be used for installing and invoking this package.
Here, we present a computational investigation of the weighted survival curve estimator and the corresponding test statistic. Using simulations, we investigated the statistical power ofQ, contrasted with that of existing non-parametric methods. Using a real dataset, we demonstrate the computation ofQ,û t , and evaluate type I error.

Evaluating statistical power through simulations
Using simulations, we explored the statistical power of the test statisticQ in a case where populations are difficult to distinguish based purely on mean survival time. As test populations, we examined admixtures of exponential and Weibull distributions for the event time, and compared survival in these mixture populations to survival of a population of purely exponential event times (figure 2). Population 1 consists of individuals having an exponentially distributed lifetime with a mean of l 21 ¼ 4 years. Population 2 consists of two types of individuals: those who have an exponentially distributed lifetime with a mean of 5 years (type z ¼ 1), and those of type z ¼ 2 who have a Weibull distributed lifetime with shape parameter k ¼ 5 and scale parameter l ¼ 1.
Since Population 1 is homogeneous, we only track subpopulations of Population 2-we drop the superscript and denote the proportion of Population 2's members of type 2 by q 2 . It is most instructive to examine our method in the neighbourhood where both populations have approximately the same expectation value for the event time, which occurs for q 2 % 0.245. For this reason, we chose values near 0.25 for our simulations. To compare the reweighted Kaplan-Meier estimator (equation (2.2)) to the standard Kaplan-Meier estimator, we estimated survival for the admixed population for q 2 ¼ 0.25, using various sample sizes. In figure 3, we present example reconstructions using these two methods. The estimator variance was approximated using 10 000 resamplings of sample size n of the admixed population, for each value   of n. The estimation error, as defined by mean-squared difference between the reconstruction and the true survival function, was approximated in the same manner.
To better understand the performance of the test statistic (equation (2.4)), we evaluated its statistical power against that of other test statistics in distinguishing between Population 1 and Population 2 for various values of q 2 . For samples of size n (i) [ f30, 50, 100, 200, 1000g taken from each population, we performed 1000 null hypothesis statistical tests using our method, the log-rank method [15], and the standard Kaplan-Meier Wilcoxon signed-rank difference-of-mean methods [16,17]. The power of the test, or the proportion of times that the null hypothesis was correctly rejected, is shown in figure 4.

Evaluating type I error in a real world example
We applied the survival estimator and statistic to NCCTG Lung Cancer data [18] available within the survival package for R. We compared the survival between male (n (1) ¼ 136) and female (n (2) ¼ 90) cancer patients, organized by ECOG performance score (z [ f0, 1, 2g) as cohort. Using males as population 1 and females as population 2, we arrived at the test-statistic estimate:Q ¼ À961, with 95% asymptotic confidence interval: (21527, 2396), which would support rejection (p % 0.0009) of the null hypothesis (û (1) t ¼û (2) t ) at a ¼ 0.05. For reference, both the Wilcoxon ( p % 0.0012) and log-rank (p % 0.0015) tests referenced in figure 5 also rejected the null hypothesis. In figure 5, cohort-level survival estimates are also shown.
In theory, the type I error is set by the significance level at study design. Whether a statistic controls type I error correctly depends on accurate evaluation of its sampling distribution. In the case ofQ, our main result is that the sampling distribution for this estimator converges asymptotically in distribution to a Gaussian with a definite variance. However, small-sample behaviour is not guaranteed. To evaluate type I error, we used the same dataset, restricted to male patients. For each of n [ f40, 80, 136g, we sampled without replacement the n male patients split into two groups so that n (1) ¼ n (2) ¼ n/2, and compared survival between the two random groups. Repeating this procedure 10 000 times,  we generated the observed distribution of p-values, presented in figure 6 in log-scale. The distributions computed using the three methods are similar. The three methods all rejected H 0 approximately 5% of the time except for the case ofQ at n ¼ 40, which rejected H 0 approximately 6% of the time. Essentially, asymptotic convergence as defined by the accurate evaluation of a ¼ 0.05 type I error occurs somewhere between 40 and 80 samples for this particular dataset. Probing deeper, we examined the sampling distributions ofQ for each of n [ f50, 60, 70g, in each instance compared to the Gaussian distribution stated in theorem 2.1, where the approximation is computed using only the first sample of size n. The results for these simulations are shown in figure 7, where it is seen that the sampling distribution ofQ is approximately the same as the computed asymptotic Gaussian distribution, which is traced out in red.
The R code used to compute these examples is available in appendix B.3.

Discussion and conclusion
In this paper, we have proposed a test statistic that uses a cohort-averaged survival function estimator in order to make cross-population comparisons of survival within a null hypothesis statistical testing  framework. The proposed survival estimator was an empirically weighted average of cohort-level product-limit estimates. The test statistic involved computation of the area between estimated survival functions for two populations. By invoking an empirical stochastic process, we proved asymptotic normality of this test statistic. Using simulations, we contrasted the weighted survival estimator against the pure Kaplan-Meier estimator. It is seen, in figure 3, that the survival curves generated from the two methods are distinct yet similar. In the second and third rows of figure 3, one sees that this reweighted estimator has comparable performance to the pure Kaplan-Meier estimator at large sample sizes. Asymptotically, both estimators converge to the true survival function, with variance converging to zero. At small sample sizes, there are differences. The reweighted estimator has reduced variance at the cost of larger bias, in a time-dependent manner. It also appears to have smaller variance at the cost of larger error at earlier times. This error at earlier times is mitigated by decreased error at later times (better reconstruction of tails); however, the estimator variance is lower at all times. Hence, dependent on costs, for small samples, this reweighted estimator may be preferable to the pure Kaplan-Meier estimator.
In simulations of the test statistic derived from the reweighted survival estimator, we saw superior performance compared to existing methods. In figure 4, it is seen that in all cases, the test statisticQ was better at distinguishing between the two populations than either the Wilcoxon signed-rank test or the log-rank test. The relatively high statistical power of this statistic is due to tighter variation in the test statistic. In nearly all cases (greater than 99.5%), the estimator variance for the tested method was less than that of the other two tests (not shown).
This paper derives the asymptotic convergence in distribution of theQ statistic. Numerically, we demonstrated convergence of the statistic in figures 6 and 7, where we verified that the asymptotic approximation respects type I error at a ¼ 0.05 and where we observe good match between the sampling distribution ofQ and the asymptotic Gaussian distribution provided by theorem 2.1.
A variant of this method was used in Rasch et al. [9] in order to classify physical disorders based on severity for the sake of prioritization of processing for disability claims. Since the underlying survival surface is non-stationary, and the fixed observation windows create progressive censoring, that paper illustrates the utility of this statistical method. In that paper, the cohorts were defined based on binned application entry times and a heuristic 'survival surface' was generated in order to get a single overall picture of the survivability of a given disorder. The censoring parameters t z varied due to the finite sampling window and the fact that more recent cohorts are not observed for as long a time period as older cohorts, as depicted in figure 1b. It was also expected that survival by cohort would vary due to differences in healthcare administration and treatment between entry cohorts. The use of the empirical prevalences (q z ) allowed the accounting for variability in disability application volume by sufferers of given disorders, conditional on entry date.
We note that a strong limitation of the presented method lies in its framing in terms of null hypothesis statistical testing. TheQ statistic only provides a p-value, as opposed to other tests such as the log-rank test which provide hazard ratios as well. As a trade-off for statistical power, one is sacrificing interpretability in the form of effect sizes.
Although the most direct and natural applications of the method that we have presented here involve discretely indexed covariates, it is possible to use this method for continuously indexed covariates such as time by employing the binning strategy used by Rasch et al. [9]. This approach is particularly fruitful if the sampling windows are coarse, and there is a clear separation between cohorts to maintain statistical independence. In this situation, it may be unreasonable to expect to construct a full continuous surface for survival. Nonetheless, a possible future extension of this method might involve replacing the sum of equation (2.1) with an integral and using statistical regularization tools [19] in order to infer true continuously indexed surfaces.
Data accessibility. All data in this paper are simulated, with R source code provided in appendix B.  To prove the main theorem, we use an empirical process modelling framework to develop the asymptotic properties of first deterministically proportionally weighted Kaplan-Meier estimators. We then replace the deterministic proportions with estimates given by the sample prevalences of the cohorts. Here, we restate the main theorem and prove it through a series of lemmata.
z,t denote the probability that a z-type individual has not yet been censored at time t ! 0 (the survival probability relative to the occurrence of censoring), and q (i) z denote the probability that an individual in population i is of cohort z, and let p (i) ¼ n (i) /(n (1) þ n (2) ). Suppose that u (1) t ¼ u (2) t : ThenQ À ! d N(0, s 2 ), as n (i) ! 1, with where for 0 t^t z , where t z is the time at which samples of cohort z are censored, f z,t ¼ Ð tz t dsS z,s , f z ; f z,0 , S z,t is the survival function for the pooled data of cohort z, and The variance s 2 may be consistently estimated bŷ where for 0 t^t z ,Ŝ z,t is the product-limit estimate of the pooled data for cohort ẑ C (i) z,t is the product-limit estimate associated with the event of censoring for cohort z within population i,f z ;f z,0 , and Overview of proof of theorem 2.1. To prove the main theorem, we turn to the modelling framework that we present in A.2. In general, we proceed by first assuming fixed sample proportions and then extending results to random proportions as given by empirical prevalence (equation (A 2)). The convergence ofQ follows directly from corollary A.9 and equation (A 18). The consistency ofŝ 2 follows from theorem 4.2.2 of [2], which provides for weak convergence of the product limit estimator to a Gaussian process, and the Glivenko-Cantelli theorem. B

A.1. Preliminaries and notation
Given any pair of random elements X, Y, we denote equality in a distributional sense by X % Y. Let P be a probability measure on the measurable space (X, A). The empirical measure generated by the sample of random elements x 1 , . . ., x n , n [ N is given by where for any x [ X, and any A [ A, Note that alternatively, when needed, one may write d x (A) as the indicator function 1 A (x) on the set A. Furthermore, in the case that Given H, a class of measurable functions h : X ! R, the empirical measure generates the map H ! R given by h 7 ! P n h, where for any signed measure Q and measurable function h, we use the notation rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180496 Qh ¼ Ð dQh. Furthermore, define the H-indexed empirical process G n by and with the empirical process, identify the signed measure G n ¼ n À1=2 P n i¼1 (d x i À P). Note that for a measurable function h, from the law of large numbers and the central limit theorem, it follows that P n h À ! a:s: Ph, and G n h À ! d N(0, Ph À Ph 2 ), provided Ph exists and Ph 2 , 1, and where 'À ! d ' denotes convergence in distribution. In addition to the preceding notation, given the elements f, and f n , n [ N, we also denote, respectively, convergence in probability and in distribution, of f n to f, by f n À ! p f.
For any map x : H ! R k , k [ N, define the uniform norm kxk H by and in the case that H , R, write k Á k H ; k Á k 1 . A class H for which kP n À Pk H ! 0 is called a P-Glivenko -Cantelli class. Denote by ' 1 (H) the class of uniformly bounded functions on H. That is, for a general k [ N, If for some tight Borel measurable element G [ ' 1 (H), G n À ! d G, in ' 1 (H), we say that H is a P-Donsker class.

A.2. Empirical process framework
To prove theorem 2.1, we turn to an empirical modelling framework that will provide us the asymptotic statistics of the weighted product limit estimator. Consider a closed particle system, such that according to a predefined set of characteristics, the system can be subdivided into mutually exclusive subsystems. Each particle corresponds to the observed state of a particular individual in a fixed population cohort. Note that we will restrict this discussion to only a single population of particles. These arguments will extend to multiple populations as mentioned in this paper by treating separate populations as independent.
At any given time t ! 0, each particle will have exactly one associated state x in the set Z 4 , referring, respectively, to states of 0 dormancy 1 activity 2 inactivity 3 censored: Assume that the path of any particle is statistically dependent upon its particular subsystem, and that given the respective subsystems of any two particles, their resulting paths are statistically independent. Assume further that at a reference time t ¼ 0, all particles enter into the active state (x ¼ 1), and that particles are considered dormant for all t , 0. Let d [ N and t [ (0, 1) be fixed. We will assume the existence of a collection of individuals G, assumed to be infinite in size, where each individual g [ G exhibits a càdlàg path-valued state x g t , for t ! 0. For each g [ G, x g t is determined by the individual's particle type z g and a random jump time j g . The particle type z g is distributed in the population through the probability mass P( . ., S d,t ) be the survival vector S z,t ¼ P{T z . t}, which is assumed continuous for t ! 0. Suppose that it is desired to understand the event probabilities for randomly selected g [ G, unconditional on subgroup membership. We assume that members of each cohort are in the inactive (0) state at times t , 0.
Given a random sample g 1 , . . ., g n , n [ N of individuals, let n ¼ (n 1 , . . ., n d ) and where n z is the random number of drawn individuals of cohort z. In considering the event time probabilities of each subgroup, the random number of particles excludes the use of many wellestablished results in survival analysis. Therefore, we begin with a somewhat restricted framework, and assume a known number of initial individuals of each type. Assume the sample contains a known number n z ¼ a z n, a z [ (0, 1), of individuals of cohort z, and let m nz j,z,t ! 0 be the number of the cohort z individuals who are in state j [ Z 4 at time t, so that is conserved. Also, we assume that there exists t z , 1 when all particles either become inactive or censored so that t z is the infimum time where the condition holds. For the sample of size n z , we denote the z-type cumulative hazard by L z,t and, respectively, define the z-type cumulative hazard and survival estimates bŷ where t^t z ¼ min {t, t z }, and d (Á,Á) is the Kroenicker delta function.

A.3. Convergence theorems
In order to guarantee convergence of the estimator, we make the following assumptions (based upon an initially known sample size distribution n).
Assumption A.1. We assume that the initial sample is chosen large enough to ensure that individuals of cohort z, at state 1 (active), exist at all points t [ [0, t z ], z [ f1, . . ., dg. That is, inf z[N d m nz 1,z,tzÀ . 0, a.s.
Since any survival function is monotone, an immediate result that follows from the above assumption is c , S z,tz S z,t 1, t ! 0, ( for some constant c . 0.
Assumption A.2. It is assumed that as n becomes large, the sample size for each individual type will grow to infinity. That is, Note that in the case of fixed censoring, that is, in the case that censoring exists only at time t, the above is satisfied by m z,t ¼ S z,t . In the general case, m z,t can be seen as the probability that an individual of cohort z has not yet left state 1. That is, m z,t is the probability that an individual has not left due to censoring or death by time t, and so m z,t ¼ S z,t C z,t2 , where C z,t is the probability that censoring has not occurred by time t. for any arbitrary h, e . 0. Therefore, from assumption A.2, since n z ! 1 a.s., the desired result follows. B Turning momentarily to the situation where there are two populations denoted by superscripts (1) and (2), for any t ! 0, definê