Journal of The Royal Society Interface
Open AccessResearch articles

Spatio-temporal surveillance and early detection of SARS-CoV-2 variants of concern: a retrospective analysis

Massimo Cavallaro

Massimo Cavallaro

School of Life Sciences and Mathematics Institute, University of Warwick, Coventry, UK

The Zeeman Institute for Systems Biology and Infectious Disease Epidemiology Research, University of Warwick, Coventry, UK

[email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

Louise Dyson

Louise Dyson

School of Life Sciences and Mathematics Institute, University of Warwick, Coventry, UK

The Zeeman Institute for Systems Biology and Infectious Disease Epidemiology Research, University of Warwick, Coventry, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

Michael J. Tildesley

Michael J. Tildesley

School of Life Sciences and Mathematics Institute, University of Warwick, Coventry, UK

The Zeeman Institute for Systems Biology and Infectious Disease Epidemiology Research, University of Warwick, Coventry, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

Dan Todkill

Dan Todkill

Warwick Medical School, University of Warwick, Coventry, UK

Contribution: Conceptualization, Writing – review & editing

Google Scholar

Find this author on PubMed

Matt J. Keeling

Matt J. Keeling

School of Life Sciences and Mathematics Institute, University of Warwick, Coventry, UK

The Zeeman Institute for Systems Biology and Infectious Disease Epidemiology Research, University of Warwick, Coventry, UK

[email protected]

Contribution: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed



    The SARS-CoV-2 pandemic has been characterized by the repeated emergence of genetically distinct virus variants of increased transmissibility and immune evasion compared to pre-existing lineages. In many countries, their containment required the intervention of public health authorities and the imposition of control measures. While the primary role of testing is to identify infection, target treatment, and limit spread (through isolation and contact tracing), a secondary benefit is in terms of surveillance and the early detection of new variants. Here we study the spatial invasion and early spread of the Alpha, Delta and Omicron (BA.1 and BA.2) variants in England from September 2020 to February 2022 using the random neighbourhood covering (RaNCover) method. This is a statistical technique for the detection of aberrations in spatial point processes, which we tailored here to community PCR (polymerase-chain-reaction) test data where the TaqPath kit provides a proxy measure of the switch between variants. Retrospectively, RaNCover detected the earliest signals associated with the four novel variants that led to large infection waves in England. With suitable data our method therefore has the potential to rapidly detect outbreaks of future SARS-CoV-2 variants, thus helping to inform targeted public health interventions.

    1. Introduction

    With more than 600 million confirmed cases and more than six million associated deaths worldwide by November 2022, the COVID-19 pandemic caused by SARS-CoV-2 has proven to be a substantial threat to global public health [1]. Its community-level burden is revealed by epidemiological surveillance, that is, the systematic collection of health-related data and the real-time monitoring of trends of incidence through time [2]. Although testing is generally performed for patient-centred and public health reasons, the data on the number, location and type of infection are also of central importance as they provide information useful for the implementation of effective interventions to mitigate the overall burden on public health [3]. Mathematical and statistical modelling is often a critical part of this analysis, enabling us to translate the raw data into a projection of current and future trends with associated uncertainty bounds [4,5]. While vaccination campaigns have decreased the overall risk of severe outcomes for COVID-19 cases and many countries have been phasing out restrictions since 2022, ongoing active surveillance is still required due the rapid emergence of SARS-CoV-2 mutants with new, potentially threatening, phenotypic properties [6]. SARS-CoV-2 surveillance includes identifying new variants, assessing their outbreak potential, and timing and targeting interventions to geographical regions who need them the most [7]. This last step is extremely important as interventions, either pharmaceutical or non-pharmaceutical, are most effective when enacted early and yet they are both economically and socially costly, and should therefore be avoided where and when unnecessary. The quest for tools to detect increases in the number of infections had led the research community to develop a number of statistical surveillance systems (as reviewed in [811]), and, more recently, model-independent methods rooted in the theory of dynamical systems [1215].

    In this paper, we retrospectively apply a spatio-temporal statistical surveillance method called RaNCover (random neighbourhood covering) to study the relative prevalence of SARS-CoV-2 variants and flag the locations where a new variant was set to overtake previous lineages. RaNCover was recently developed to monitor invasive Group-A Streptococcal disease (iGAS) cases [16], but as we demonstrate here, can be easily applied to other epidemiological situations to detect a change in the pattern of cases.

    In general, the detection of a change in surveillance data is a two-step process: first a simple model is fitted to predict the expected behaviour and uncertainty (e.g. predicting the distribution of the number of cases) at each location and time; second, these projections are then statistically compared to the observed counts. If the observation is above a specific percentile of the prediction interval—a situation which we call exceedance—then an alarm is raised [17]. One of the major challenges encountered in the development of an effective surveillance system is that available data are often patchy, with missing information in certain geographical locations and time intervals, and with varying sampling efforts. Dealing with insufficient data is particularly common at the beginning of an outbreak (of a new pathogen or new variant), when the number of cases is still low but is growing fast. Such sparse observations are often prohibitive when monitoring local trends and are subject to strong fluctuations that can potentially trigger many false positive alarms. One approach to this issue is to improve the prediction of the observed number of cases in a given space–time region by integrating the information available from nearby regions [18]. The random neighbourhood covering method introduced in [16] is an alternative approach: instead of improving model prediction in a spatio-temporal region, it aggregates information from multiple weak exceedance tests performed in neighbouring areas to improve the final warning signal relative to an area of interest. In other words, while the individual tests performed on single locations are likely to perform poorly, their consensus estimates have far better properties—an idea which is commonly exploited in ensemble statistical learning [19].

    Here we considered SARS-CoV-2 polymerase-chain-reaction (PCR) test results collected in routine community testing in England (known as Pillar-2 testing) from August 2020 to February 2022. PCR results are obtained with a short delay (a few days in general) after testing and thus they offered a good opportunity to perform regular population-level surveillance. While these tests are principally used to inform about the aetiology of infection (distinguishing SARS-CoV-2 infection from other respiratory pathogens), the sum of all positive tests is a useful indicator of the scale of the epidemic. Here we focused on testing performed with the ThermoFisher TaqPath PCR Kit. This assay is built with redundancy as it targets three highly specific regions of the SARS-CoV-2 genome, including a region in the S-gene (the gene that encodes the spike protein). When a sample fails to amplify one or two of the three targets, it can still be attributed to SARS-CoV-2, but the failure may also signal the presence of a SARS-CoV-2 variant: the Alpha and Omicron BA.1 variants fail to trigger the S-gene targeting element of the TaqPath system, whereas wild-type, Delta and Omicron BA.2 do lead to S-gene positive tests [20,21]. While whole-genome sequencing is necessary to identify variants with certainty, S-gene detection has allowed a rapid assessment of the invasion of novel variants over time—without the delays (and expense) that are associated with genotyping (although some care is needed when the sample has a low viral concentration as indicated by a high cycle threshold, Ct, value). Historically, these data permitted the observation of SARS-CoV-2 Alpha variant spreading and becoming the dominant lineage in few months after being first detected in September 2020 [22,23]. At the time, the data on S-gene detection alone were unlikely have offered the means to raise early alarms on the invading variant. However, with the benefit of hindsight, we can now search for early warning signals in these data and study how timely invasions can be detected in the presence of general data on SARS-CoV-2 variants. Later, the Alpha variant was out-competed by the Delta variant which became dominant in England by late May 2021 [24]. A third group of variants with both increased transmissibility and immune escape, called Omicron, were first detected in England at the end of September 2021 [25]. Each of these novel variants (Alpha, Delta and Omicron) grew at a rate higher than those preceding, thus causing three major waves of infection during the study time.

    Variants are defined by their genetic characteristics, and are initially labelled as variants of interest or variants of concern by public health agencies, such as the World Health Organization (WHO) and the UK Health Security Agency (UKHSA) in UK, based on their epidemiological characteristics. Variants of interest (VOIs) are those with genetic changes expected to increased COVID-19 transmissibility, severity and diagnostic or therapeutic escape and actually observed to cause significant community transmission in multiple countries or other epidemiological repercussions, thus suggesting an emerging risk to global public health. Variants of concern (VOCs) are those where there are further detrimental changes in COVID-19 epidemiology, clinical disease presentation or in the effectiveness of public health measures and diagnostics. As such, variants are often first classified as VOIs before becoming VOCs, although not all will lead to a major increase in cases. The Alpha, Delta and Omicron variants were designated as VOCs by the WHO and the UKHSA [7,2628]. However, detecting a new variant is only a part of the surveillance effort as policy advisors need summarized information, such as how quickly and in which regions the variant is spreading. Here, we analysed the historical invasions of four variants—Alpha, Delta, Omicron BA.1 and Omicron BA.2—in England with RaNCover, highlighting the Lower Tier Local Authority (LTLA) districts with exceedingly high levels of a variant as time progresses, and thus anticipating country-wide waves. The application of this approach in the future has the potential to provide a rapid assessment of a variant’s epidemiological potential and could inform early targeted interventions.

    2. Methodology

    Here we outline the two basic steps in the RaNCover methodology. We first describe the simple predictive model used to estimate the distribution of SARS-CoV-2 positive samples that are S-gene positive/negative each day in each LTLA. In brief, we estimate the number of S-gene positive/negative samples based on the number of tests in a given location on a given day and the recent historic proportions. Second, we focus on the comparison of this prediction with the observed data and the formulation of a warning score. We compare the true number of S-gene positive/negative in multiple overlapping regions and times around our point of interest, and compare each to our projections; warning scores are raised if the true values are significantly above our projections. A more detailed and technical description is given in the following two subsections.

    2.1. Baseline levels of positive and negative tests

    We considered SARS-CoV-2 positive tests in England performed with the ThermoFisher TaqPath system, an assay based on reverse-transcription polymerase chain reaction (RT-PCR). Detection of S-gene was used as a proxy to identify a variant, and S-gene positive and negative count data were aggregated by a patient’s LTLA as a proxy for location. In general, LTLAs contain between 50 000 and 400 000 inhabitants, although the Isles of Scilly is the smallest LTLA with under 2500 residents while Birmingham with 1.1 million residents is the largest. To assess the potential for a variant to be concerning, we measured the exceedance in the number of S-gene positive and S-gene negative cases with respect to a baseline level, which we derived as follows. For a given LTLA location i and day t, the aggregated data consist of the total number Ni,t of strongly positive community PCR tests using the TaqPath system (with a cycle threshold less than 30) and the number Mi,t of these tests which are positive for the S-gene. The proportion of S-gene positive tests, Mi,t/Ni,t which is a proxy of the prevalence of a variant, is denoted as Pi,t. The day t corresponds to the date on which a swab sample was collected; in general, the TaqPath results were then available 1–2 days later, which would add a slight delay to any detection mechanism. The objective of our methodology is to find any deviations from a null model for Mi,t in the observed data. The null model does not need to describe the data accurately, but only formalize how the data would behave if the dynamics of new infections were stable. We therefore estimated the average proportion p^i,t for our null model as a running average using values observed in the previous T = 15 days,

    where the weight distribution gτ is taken to be exponential with mean 5.5 days set by the approximate time-scale of infection duration [2931]. This has the net effect of combining the observed statistics for each day and smoothing the test statistics over time. Conditioned on the total number of COVID-19 positive tests (Ni,t) and the null model, the expected number of S-gene positive tests observed is m^i,t=p^i,t×Ni,t. Analogous baseline values for the observed number of S-gene negative tests Mi,t := Ni,tMi,t are m^i,t=p^i,t×Ni,t, where p^i,t=1p^i,t.

    2.2. Exceedance probability and random neighbourhood approach

    We let

    denote the Poisson random variable that describes the number of S-gene positive cases detected under the null model in location i at time t. Let us define a threshold value θ(m^i,t) such that P[Xi,t>θ(m^i,t)]=α, i.e. the probability that Xi,t exceeds the threshold is α. In other words, θ is the quantile function evaluated at 1 − α; see also electronic supplementary material, text S1. The basic exceedance criterion consists of raising a warning flag at location i at time t if the number of S-gene positive cases exceeds the threshold defined by the null model,
    An analogous exceedance criterion for the number of S-gene negative cases is Mi,tθ(m^i,t). Exceedance probabilities under the Poisson model (2.2) and a more refined negative-binomial model are discussed in the electronic supplementary material, text S2.

    A limitation of this naive approach is that there is a probability α of triggering a warning signal by chance even when the data come from the null model. A second limitation of the naive approach is that the criterion (2.3) is only applied to a single data point (one location and one time) whereas outbreaks are typically spatially and temporally extended; therefore it is desirable to test exceedance over a number of locations and days simultaneously, albeit it is impossible to know the outbreak range prospectively. We detect aberrations from the baseline by means of the random neighbourhood covering (RaNCover) approach of Cavallaro et al. [16]. RaNCover tackles the limitations mentioned above by considering the exceedances over many overlapping spatio-temporal sets (chosen to be cylinders whose heights and circular bases correspond to the time and geography components, respectively) that cover a region of interest and its neighbours. By virtue of the Poisson assumption (2.2), the number of observed S-gene positive or negative cases in a spatio-temporal set C is also a Poisson random variable, with intensity given by the sum of the intensities at all points (i, t) ∈ C, i.e.

    This allows us to flag a cylinder if
    which is analogous to condition (2.3). A warning score wi,t specific to the location (i, t) is then defined as the percentage of flagged cylinders over all sampled cylinders C such that (i, t) ∈ C,
    where 1 is the indicator function. Cylinder dimensions were randomly sampled, with uniformly drawn pairs of radii and heights (ρ, h) such that the cylinder volumes are constant (π × h × ρ2 = c ≈ 4066 day × km2). Upper limits to the heights and radii were heuristically set to 16 days and ≈36 km, respectively, in order to cover short outbreaks over several LTLAs. Cylinder positions were also randomly chosen such that each cylinder contains an LTLA centroid location (this can be easily achieved by centring each cylinder at one of these locations and then shifting in space by an amount smaller than ρ; see also figure 1). To condition on cylinders not being empty, and therefore look at the neighbourhoods of identified test locations, the relation (2.2) is corrected to Xi,tPoi(m^i,t)+1 (see also [16]). The algorithm was recursively applied to each day within the study period. At each step, one of the bases of the cylinders was positioned on the plane corresponding to the current day, ensuring that the cylinders only extended towards the past and did not cover future time points. By doing so, all warning scores were computed using exclusively past information, thus emulating a real-time application. We drew 500 000 cylinders at each step. We expect that the statistical properties improve as the total number of cylinders and the sample size over which the sum in equation (2.6) is evaluated increases, but the estimation of the uncertainty associated with the warning scores requires additional theoretical studies (as an example, using confidence intervals for proportions, based on the number of cylinders covering a point event, as in the illustration S3 of Cavallaro et al. [16], could underestimate uncertainties due to correlated observations in overlapping cylinders). On the other hand, as also discussed in [16], the algorithm appears to be robust to changes in cylinders’ numbers and dimensions.
    Figure 1.

    Figure 1. Illustration of RaNCover approach. Many overlapping cylinders with circular bases enclosing several LTLA centres and heights corresponding to time intervals are drawn (a). For a generic cylinder C (e.g. the one with red circular basis), the observed number of cases from daily data in the enclosed LTLAs and the null-model intensities are aggregated (b, orange lines are positive test numbers in the LTLAs and red lines are total counts in the circle). In the highlighted cylinder, the total number of positive results is 1809, while the baseline prediction is ≈1957. According to criterion (2.5) the cylinder is not flagged.

    It is not desirable to have a very low baseline, since this could trigger a signal even with one or two cases. We therefore heuristically perform the substitution m^i,tmax(m0,m^i,t) to mitigate this effect. The cutoff m0 is chosen as the lowest value of intensity that does not trigger a warning if there is one S-gene positive case observed, with P(X>1|m0)=0.95. To evaluate the effect of decreased testing capacity, we considered a second scenario with half numbers of both S-gene positive and negative test results (rounded to the nearest integer for each time and location; see electronic supplementary material, text S3) and computed new warning scores using formula (2.6) with the new values of Mi,t and Mi,t. As a by-product of the procedure used to calculate the warning scores, we also obtain, for each cylinder, the fraction of positive (negative) tests over all test results in the area and time delimited by the cylinder. A moderated estimate of Mi,t (or Mi,t) is the average of the ratio in all cylinders that contain (i, t). This generates a spatially and temporally averaged value for the proportion of all suitable tests that are S-gene positive (or S-gene negative) providing a measure of the invasion of new variants over time (see electronic supplementary material, movie S1).

    3. Results

    For each of the main four SARS-CoV-2 variants that emerged during the study period (Alpha, Delta, Omicron BA.1 and Omicron BA.2), we study patterns of S-gene test results and warning scores over time in all of England; aggregated test data are illustrated in figure 2, along with key dates of non-pharmaceutical interventions and VOC definitions as reported in [7,32].

    Figure 2.

    Figure 2. Pillar-2 test results reported in England over the study period. The solid grey line shows the total number of SARS-CoV-2 positive tests, which tested either negative or positive to S-gene target (blue or orange lines, respectively). This dataset gives a proxy for the relative prevalence of a variant, with peaks corresponding to intensified testing or increased SARS-CoV-2 prevalence. Increased S-gene negative tests at the beginning of 2021 corresponds to Alpha invasion. The S-gene negative peak in July 2021 is due to the Delta variant and the growth in S-gene negative tests at the end of 2021 corresponds to the first Omicron invasion. The horizontal bar highlights the introduction and relaxation of restriction measures (shades of red) on the following key dates: (a) 14 October 2020, introduction of the 3-tier system to control local outbreaks; (b) 5 November 2020, enforcement of the second national lockdown; (c) 2 December 2020, lockdown ends and reintroduction of a (stricter) 3-tier system; (d) 21 December 2020, introduction of the 4-tier system to control Alpha outbreak; (e) 6 January 2021, the third national lockdown starts; (f) 8 March, 12 April, 17 May and 19 July 2021, 4-step roadmap to full relaxation of restrictions; (g) 20 December 2021, mask mandate for public spaces to control Omicron. The three vertical dashed lines mark the dates when Alpha, Delta and Omicron were declared variants of concern (VOCs).

    3.1. Alpha variant

    The SARS-CoV-2 variant of concern named B.1.1.7 (PANGO nomenclature [33]), also referred to as Alpha, has a specific mutation in the S-gene which results in a deletion of two amino acids of the spike protein (69-70del) and failure of S-gene target amplification in the TaqPath assay [34]. The proportion of S-gene target failures (SGTFs) over SARS-CoV-2 positive TaqPath PCR tests is thus a proxy for the relative prevalence of Alpha compared to other non-SGTF variants. Alpha was first detected in the COVID-19 Genomics UK Consortium (COG-UK) [35] genome data in Kent on 20 September 2020 and spread quickly in London and then in other parts of England. Based on SGTFs, it is estimated that at the beginning of November 2020 5% of all cases were ascribed to this variant. Alpha was noted as a potential variant of concern and was designated a VOC on 18 December 2020 [7,26], although the likely implications of this first (major) new variant were not yet fully understood on 2 December 2020, when the second lockdown period ended in England. By the end of January 2021, more than 95% of all new COVID-19 cases in England were likely caused by Alpha. While this VOC had greater transmissibility than its predecessors, its spread was driven by the increased movement of people across dominant source locations such as London [36]. As Alpha was first detected in South East England, and so effectively spread from a point location, our dataset provides an opportunity to test how quickly its potential as a VOC could have been assessed given epidemiological data from the variant’s emergence up until its designation as a VOC.

    Some degree of geographical spread can be observed in data aggregated at regional level (solid lines in figure 3a,b); indeed, by 19 November 2020 a quarter of all cases in the South East were attributable to Alpha, followed by reaching a quarter SGTF in the East of England (21 November), London (24 November) and South West (6 December) NHS regions. Aggregating data at such large scales hides local details, such as which LTLAs are invaded earlier than the others. However, with fewer tests available in a small region, detecting a local outbreak early can be difficult. This is therefore an ideal test of the RaNCover algorithm to highlight patterns on a finer spatial scale. A key Alpha hot spot appeared to be Folkestone and Hythe (an LTLA in the South East region of England) where the warning score first peaked (w>90%) on 15 October 2020 (with only 5 SGFTs and 10 SARS-CoV-2 positive tests on that day) as illustrated in figure 3c. In early November 2020, the warning scores increased in a large area including Greater London and surrounding LTLAs (e.g. figure 3d), forewarning an outbreak in this area even though the fraction of detected S-negative tests in single LTLAs was still below 17% (lower than the levels reached in Folkestone and Hythe mid-October). The timeline of when each LTLA achieved a 90% warning score threshold, and hence a measure of the geographical spread of the Alpha variant is summarized in figure 4. During the initial phase of Alpha spread, anomalies were detected in different locations across England, presumably corresponding to S-negative variants (Alpha or others) which were not able to locally out-compete the wild-type (see electronic supplementary material, movie S1). Since 69–70del mutation also existed in other variants, these signals could also indicate variants which did not fix in the population. At the end of 2020 almost all LTLAs in England were flagged by high warning scores (figure 4); later, warning scores began to reduce as the presence of Alpha became the norm, starting from the most southern and eastern LTLAs, although some LTLAs in the NorthWest (around Manchester) flagged up until the end of January 2021 (see electronic supplementary material, movie S1).

    Figure 3.

    Figure 3. Progression of the Alpha variant in selected regions and LTLAs. (a) Green dots are the proportion of SGTFs in Folkestone and Hythe (F&H), with shaded areas representing 0.25–97.5% Wilson CIs; uncertainly is substantial due to the low number of SARS-CoV-2 positive tests. For comparison, proportions of SGTFs per region average in South East, East of England, London and South West are reported in green, magenta, blue and grey solid lines, respectively. The green dashed line represents the expected proportion in F&H according to the null-model baseline. (b) Proportion of SGTF tests in the City of London (blue dots) compared with local baseline (dashed blue line) and SGTF proportions per region average in South East, East of England, London and South West reported for comparison. (c) RaNCover detects anomalies in F&H (red line is the warning score temporal trend) thus suggesting that the observed proportion of SGTFs is significantly higher than the null model baseline. (d) Increased warning scores (w>90%) in City of London LTLA (red line) clearly flags the presence of an outbreak from 7 November 2020, even if the proportion of SGTFs is still below the level reached in F&H. As in figure 2, the dashed vertical lines mark the Alpha VOC definition date.

    Figure 4.

    Figure 4. Progression of the Alpha variant and warning scores. (a) Spatial diffusion of warning flags for SGTFs during the Alpha invasion. The colour map from yellow to blue indicates the date when an LTLA warning score first exceeded a threshold value of 90%, starting from 15 October 2020, when Folkestone and Hythe LTLA is first flagged (figure 3). LTLAs in the South and Greater London are flagged first. Temporal trends in Coventry (b) and Dorset LTLAs (c). The observed proportions of SGTFs are marked by dots, with dashed lines representing null-model expectations and shaded areas 0.25–97.5% Wilson CIs for proportions. Warning scores are represented by solid red lines. Vertical dashed lines mark the WHO declaration of Alpha as VOC on 11 May 2021.

    The strength of the warning-score patterns depends on the total number of SARS-CoV-2 tests. With fewer tests, the proportion of SGFTs is estimated with wider confidence intervals and the increased uncertainty is also reflected in the warning scores; indeed the LTLAs in less populated regions, e.g. Folkestone and Hythe (figure 3c) and Dorset (figure 4c), display fluctuating warning scores, especially at beginning of the outbreak. Nevertheless, warning flags are explicitly raised. Using down-sampled data (§2.2; see also figure S3 in electronic supplementary material, text S3) yields slightly delayed signals. Folkestone and Hythe exceeds the 90% threshold on 26 October 2020 (11 days later than the full data result and with 68% of SGFTs, but with only 16 SARS-CoV-2 positive tests). City of London is flagged only two days after the full sample date, with three SGFTs over a total of 26 SARS-CoV-2 tests (instead of three negatives over 18 tests from the full-sample warning date).

    3.2. Delta variant

    Starting from 6 January 2021 (when the third lockdown began), England experienced a period with no spread of novel variants of concern and decreased levels of SARS-CoV-2 positive tests. The VOCs that emerged directly after Alpha did not have the same spike-protein deletion as Alpha, thus testing S-gene positive in the TaqPath assay, and providing a means of rapid identification against a background of Alpha cases. The most common of these variants were B.1.351 (first identified in South Africa), P.1 (in Brazil) and B.1.617 PANGO lineages—with sublineage B.1.617.2 (also referred to as Delta) first identified in India, escalated in the UK on 6 May 2021 [27] and designed VOC by the WHO on 11 May 2021 [7]. S-gene positive results provided rapid signals for investigating community spread of these variants [21]. Phylogenetic studies demonstrated that Delta sublineages were introduced more than 1000 times from India before travel restrictions were introduced on 23 April 2021 [37]. When COVID-19 restrictions were relaxed on 17 May 2021 [38] (as England entered Step 3 of the relaxation process), all sublineages spread from independent seeding events and grew at a similar pace, with associated Delta clusters initially documented in the North West region [37]. Our data show that in England Delta overtook Alpha in less than three months, from May to June 2021. To detect local anomalies in S-gene positive test incidence we again used RaNCover. The NorthWest area including and surrounding Greater Manchester was acknowledged at the time as a hotspot [39]; our warning score in this region first exceeded 90% in Warrington LTLA on 2 May 2021 (with Manchester, the most populated LTLA in the region, exceeding 90% on 7 May; figure 5a). Interestingly, RaNCover flagged the South West LTLA of Somerset West and Taunton significantly earlier than elsewhere, with w>90% on 23 April 2021. This suggests that an early outbreak of Delta (or another S-gene positive variant) might have occurred in that area unreported (figure 5b,c). A London outbreak followed, with warning score exceeding 90% in the City of London LTLA on 19 May 2021.

    Figure 5.

    Figure 5. Progression of the Delta variant and warning scores for the proportion of S-gene positive tests. Temporal trends in Manchester (a) and Somerset West and Taunton LTLAs (b). The observed proportions of tests are marked by dots, with dashed lines representing the expectations according to the null-model baseline. Shaded areas are 0.25–97.5% Wilson CIs for proportions. The warning score trend in Somerset West and Taunton LTLA flags the presence of aberrations from the null model, with w>90%, on 23 April 2021. Manchester LTLA (North West region) has w>90% on 7 May 2021; on that day the fraction of S-positive tests was reported as zero. The vertical lines mark the WHO declaration of Delta as VOC on 11 May 2021 (dashed line). (c) Spatial diffusion of warning flags for S-gene positive tests during the Delta invasion. The colour map indicates the date when an LTLA warning score first exceeded a threshold value of 90%, starting from 23 April 2021. LTLAs in Somerset West and Taunton, East Devon, Mid Devon, and Torbay (yellow colour) are flagged first, followed by the North West (starting from Warrington LTLA) and London areas.

    Delta invasion occurred in a period of decreased SARS-CoV-2 positive tests, as also illustrated in figure 2 (note the minima in the total number of positive tests at the beginning of May 2021). This obviously affects the quality of an early warning signal system. Using only half of the available data, warning scores did not reach the 90% level in certain LTLAs, yet they still reported the presence of anomalies. For example, in Warrington LTLA and Somerset West and Taunton LTLA, the warning scores peaked at w=0.88% on 8 May and at w=0.86% on 26 April 2021, respectively. In this scenario, Manchester and City of London LTLAs reached w>90%, respectively, 12 and 5 days after the dates obtained from processing all available tests, thus demonstrating, on the one hand, the importance of widespread testing coverage; on the other hand, it is worth noting that RaNCover provides useful information even in this low-testing scenario (see figure S4 in electronic supplementary material, text S3).

    3.3. Omicron

    The Delta variant dominated in England throughout the summer and autumn of 2021. On 27 November 2021, the S-gene negative Omicron variant (BA.1) was first detected in the UK after being imported from South Africa. At the time of detection, it was deemed that the variant was likely to result in a rapid increase in cases due increased transmissibility and immune evasion and it was declared a VOC on 26 November 2021 [7,28]. It was also observed that this VOC was associated with decreased severity [7,40]. To slow down Omicron spread in England’s largely vaccinated population, a mask mandate was re-introduced on 20 December 2021. The Omicron wave in England was also characterized by a peaking number of SARS-CoV-2 positive tests (figure 2; this may be due to increased prevalence, increased testing capacity and increased awareness). The Omicron lineages have more than 30 mutations in the spike protein alone. In particular, the original Omicron variant, called BA.1, also shows 69-70del, thus resulting in SGTFs in the TaqPath tests. Omicron BA.2 has fewer mutations than BA.1 and lacks the genetic deletion on the spike protein that causes SGTFs; it has also been reported to be more transmissible than BA.1 [41,42].

    In England the S-gene negative Omicron (BA.1) variant spread during November and December 2021, overtaking Delta with more than 50% of all cases on 12 December 2021 and reaching 95% prevalence on 26 December 2021 according to the TaqPath testing dataset. The LTLAs flagged first by the application of RaNCover are Bury, in the Manchester area (figure 6a), and Northampton, in the Midlands, which exceeded a 90% warning level on 1 December 2021. This was a clear early warning, anticipating the Omicron wave, especially given that total number of BA.1 Omicron cases confirmed by genotyping across the whole of England by this time was only 22 [43]. Owing to the large number of tests during December 2021, the proportion of S-gene negative tests within an LTLA is estimated with high confidence, which yields strong early warning signals with RaNCover achieving w>90% with only 0.03%, 0.26% and 0.17% of SGTF in Bury, Northampton and City of London LTLAs, respectively (figure 6ac). In these three LTLAs, the warning scores obtained from the down-sampled dataset, respectively, exceeded the 90% threshold only four, one and two days after the dates marked by the full dataset, respectively (see also figure S5 in electronic supplementary material, text S3). The short delay is ascribed to an exceptionally large number of SARS-CoV-2 positive tests during this phase of the epidemic (figure 2).

    Figure 6.

    Figure 6. Progression of the S-gene negative Omicron variant (BA.1). Temporal trends and warning flag in Bury (a) and City of London (b) LTLAs. The observed proportions of tests are marked by dots, with dashed lines representing the expectations according to the null-model baseline. Shaded areas are 0.25–97.5% Wilson CIs for proportions. The warning scores (solid red lines) display abrupt departure from the null-model expectations, with w>90% on 2 December 2021 in the City of London and on 1 December 2021 in Bury and Northampton LTLAs. At those dates the reported fractions of SGTF in the single LTLAs were very low. The grey dashed vertical line marks the WHO declaration of Omicron as VOC on 26 November 2021 (see also figure 2). (c) Spatial diffusion of warning flags during the Omicron BA.1 invasion. The colour map indicates the date when an LTLA warning score first exceeded a threshold value of 90% starting from 1 December 2021. The areas flagged first are Manchester, Northampton and London.

    A second Omicron invasion began in early 2022, with increasing numbers of S-gene positive (Omicron BA.2) cases in LTLAs in the London area; see figure 7. As of 10 January 2022, 53 sequences of the S-gene positive BA.2 sub-lineage of Omicron had been identified in the UK [43]. RaNCover raises a warning signal in the City of London with w>90% on 17 January 2022 (18 January using the down-sampled data), when only the 0.01% of positive SARS-CoV-2 tests in that LTLA were positive to the S-gene target (figure 7a). By the end of the study period on 20 February 2022, the Omicron BA.2 variant was set to replace the S-gene negative Omicron.

    Figure 7.

    Figure 7. Progression of the S-gene positive Omicron variant (BA.2). Temporal trends in the City of London (a) and Manchester (b) LTLAs. The warning scores reach the threshold value (w>90%) on 17 January 2022 (City of London) and on 23 January 2022 (Manchester), despite the fraction of S-gene positive tests being still very low (0.01 and 0.04, respectively). Shaded areas are 0.25–97.5% Wilson CIs for proportions. (c) Spatial diffusion of warning flags for SCGF during the Omicron BA.2 invasion. The colour map indicates the date when an LTLA warning score first exceeded a threshold value of 90%, starting from 17 January 2022.

    3.4. Local anomalies of the dominant variant

    So far we have generated warning scores by considering statistically significant increases in the non-dominant S-gene type (e.g. increases in the proportion of S-gene negative tests when the dominant variant was S-gene positive). This provided a signal for invasion of a new variant (assuming it is different in terms of S-gene detection). However, our methodology can also be applied to anomalies in the S-gene type of the dominant variant, due to increases above the expected baseline which never achieves 100% dominance (figures S7 and S8 in electronic supplementary material, text S4). As an example, we initially consider the proportion of S-gene positive tests during October 2020 (figure S7A and D in electronic supplementary material, text S4), when we observe widespread and diffuse warning scores due to an increase in the proportion of S-gene positive samples generating multiple exceeding warning scores (w>90%). In the absence of detailed genomic investigation, it is difficult to state whether these correspond to local outbreaks in wild-type SARS-CoV-2 (or other non-documented S-gene positive variants) or to random fluctuations in the early testing regimen. In early October 2020, we also observed a transient warning score for S-gene negative variants in the North West of England (figure S7C in electronic supplementary material, text S4). This is the only clear false positive signal of variant change observed in our study. It is again unclear as to the reasons behind this signal: it could be due to a short-lived invasion of a S-gene negative variant, it could be a consequence of higher than normal levels of S-gene positive tests in the rest of the country (as illustrated by figure S7A in electronic supplementary material, text S4), or it could be a random fluctuation due to low levels of testing at this time. Patterns for local variations in the dominant S-gene type can observed throughout 2021. In March 2021, during the Alpha wave, there are raised warning scores (w>90%) in East Anglia with an increase in the dominant S-gene negative tests (figure S8A-B in electronic supplementary material, text S4). During the Delta wave there are high warning scores due to an increase in the proportion of S-gene positive tests (figure S8C-D in electronic supplementary material, text S4), clustered in four separate occasions associated with periods of increase in testing.

    4. Discussion

    Here we have used data from community PCR testing for SARS-CoV-2 generated by the TaqPath system [20,21] to explore the early invasion of new variants of concern. The TaqPath system was designed to accurately diagnose SARS-CoV-2 infections. It is also able to detect the S-gene present in the ancestral wild-type variant, the Delta variant, and the Omicron BA.2 variant, while it fails to amplify the S-gene target in Alpha and Omicron BA.1 variants, thus providing a way to discern if the infection is due to one of these variants. In fact, the system has been used as a rapid and reliable means of tracking new variants [2022,24,36], which operated more rapidly and at a greater scale than genomics testing [44]. The fortuitous pattern of invasion of SARS-CoV-2 variants with alternating S-gene target results, spanning over 1 year from September 2020 up until February 2022, also provided an ideal avenue for studying the spread of SARS-CoV-2 variants and evaluating outbreak detection methodologies. The growth and detection of new variants are generally subject to considerable stochastic fluctuations; as such, simple methods of estimating the early growth rate or looking for changes are error-prone in the early stage of invasion. Simple methods may also be influenced by changes in testing or contact behaviour, which will affect the observed growth of both resident and invading variants. Here we use the RaNCover methodology, previously developed to detect anomalies in disease patterns subject to strong fluctuations due to low count numbers (with the initial case study focused on the detection of invasive group A Streptococcus outbreaks [16]). We apply this to the proportion of positive cases which are S-gene negative (for invasion of Alpha and Omicron BA.1) or positive (for Delta and Omicron BA.2), thereby reducing the impact of changes in testing behaviour.

    RaNCover is a method to highlight aberrations from a baseline level that uses as its null model the observations expected in an idealized, steady-state, situation. The general approach consists of drawing, inspecting and aggregating the information in a large number of spatio-temporal cylinders. In the original formulation of RaNCover [16], which was tailored to detect clustered outbreaks in sparsely distributed cases of infection, our null model assumed cases occurred at a Poisson rate proportional to population sizes in each location. Here we used a different baseline (null model) to study the relative proportion of SARS-CoV-2 variants in LTLAs and detect early the invasion of a novel variant. We defined the baseline as the expected number of S-gene positive or negative tests (indicating the presence of a variant) under the assumption that the proportion of S-gene positive (or negative) samples detected is stationary. Therefore, an aberration from this baseline is an early warning signal that a new variant is increasing (relative to the resident variant) in a geographical area and that a local outbreak might occur. As this approach detects increases in proportion relative to other SARS-CoV-2 sublineages, it is relatively unaffected by changes in the overall number of detected cases (such as may be driven by variations in testing behaviour, social mixing or changes in population immunity). As a second step, we process the information in these cylinders and obtain a continuous warning score w (the larger the value of w the higher the warning level) for each location.

    In this retrospective application to S-gene tests, we were able to rapidly detect each new variant as its prevalence started to increase within local populations. We detect signals of the rise of the Alpha variant in Folkestone and Hythe LTLA (in the South East of England) on 15 October 2020, nearly two months before it was declared a variant of concern. We detect early warning signals for the Delta variant from the area around Manchester on 6 May 2021, as well as for an earlier S-gene positive anomaly (potentially also due to Delta) around Somerset on 23 April 2021 which, to the best of our knowledge, has not been documented before. Again, these signals pre-date 11 May 2021, when the Delta variant was first declared a variant of concern. Given the large amount of testing in late 2021 and early 2022, we are able to rapidly detect the relative growth of Omicron variants (as shown by the tighter confidence intervals in figures 6 and 7 than in figure 3). Warning signals for Omicron BA.1 are detected in the North East and Greater London on 1 December 2021 and 2 December 2021, respectively, when the prevalence detected by other means was still low. Finally, the Omicron BA.2 variant generates a warning in Greater London on 17 January 2022. We only find one example of a potential false positive from the RaNCover methodology, in October 2020 when a slight increase in S-gene negative tests was detected in the North West of England centred around Kirklees, Oldham and Rochdale (figure S7C in electronic supplementary material, text S4). It is impossible to know whether this warning is a statistical fluctuation or a short-lived invasion without a detailed genomic investigation of cases over that time period.

    These detection events, based on S-gene tests, benefit from hindsight. In October 2020, we were not expecting to observe new variants, and any changes to S-gene detection would not have been an immediate cause for alarm. We therefore recognize that detailed genomic investigation is needed to interpret data on circulating SARS-CoV-2 variants. PCR testing in England declined from its maximum of over a hundred thousand detected cases per day in early 2022 (when over half a million tests were performed each day), to under a thousand a day since 2 April 2022 and continued to decline. This drop corresponded to the change in testing policy [45], when symptomatic testing was restricted to vulnerable and at-risk populations and extensive PCR-based testing was abandoned in favour of integrated genomic surveillance, as per WHO recommendations [2,46].

    Genomic sequencing remains the gold-standard method to identify which variant of SARS-CoV-2 is present in a specimen. A range of different Omicron variants have been found in genomically sequenced specimens after the end of the Omicron BA.2 outbreak in May 2022, denoted BA.4, BA.5 and BA.2.12.1, with BA.5 becoming dominant around August 2022 in the UK. At the time of writing, a number of additional variants, i.e. BA.2.75, BQ.1, XBB, XBB.1.5 and the so-called recombinant variants (created by the combination of genetic material from two different variants), are simultaneously circulating [43,47], thus demonstrating increasingly complex infectious disease dynamics, arguably resulting from interactions between multiple variants, immune escape, vaccination and other public health interventions, and behavioural changes. The presence of multiple variants also invalidates identification methods based on S-gene target only. However, the RaNCover approach presented here can be easily adapted to more comprehensive genomic data. In fact, genomic sequencing can be used to determine the precise viral variant (or sub-variant) that caused an infection case and this information can be aggregated to estimate proportion of each variant among all the sequenced samples at a particular time (as required by our methodology; see §2.1), although this would introduce additional delays into the system and would reduce the accuracy of statistical approaches as the number of samples would likely be lower. Genomic surveillance has also the limitation that the observed frequency of variants may not be representative of population variant frequencies, mainly due to biased selection of samples sent for sequencing, especially after April 2022. However, effective surveillance does not require the sequencing of a specimen from every COVID-19 case, or even from random representative samples, but may perform well using just selected samples, e.g. patients in hospitals [48]. In this respect, our formulation of the baseline, conditioned on the number of tests per area, remains helpful and can be arguably extended to genomic surveillance contexts.

    We reported that the generation of early warning signals greatly benefits from increased volumes of data generated by the large number of positive PCR tests obtained during the SARS-CoV-2 pandemic. As disease surveillance is rapidly shifting towards using genomics data, which permit simultaneous monitoring of more variants but have lower throughput than PCR, meticulous validation studies of early warning systems based on these sparser sources of data are required. For comparison, the application of genomic surveillance of SARS-CoV-2 in England has produced on average more than 38 000 sequenced SARS-CoV-2 samples per month, from January 2022 to February 2023, according to the data available at [49]—this is much less than the average ≈420 000 community cases per month detected with PCR during our study period. On the other hand, it is also worth noting that we encountered and studied a low-count scenario at the beginning of the Delta wave (see figure 2, with less than 1000 positive PCR tests per day at the end of April 2021) and RaNCover was still able to get early warning signals during this phase, thus suggesting the wider applicability of our approach in genomic surveillance. As also discussed in Cavallaro et al. [16], RaNCover is actually well suited to extract warning signals from sparse data.

    One interesting spatial pattern that can be pulled from the continuous warning scores is the timing at which warning flags (e.g. with w>90%) are first raised, before a variant predominantly spreads. Comparing figures 47 several factors in common emerge. Firstly, London is usually one of the first areas to generate flags; even for the Alpha and Delta variant that were initially centred elsewhere, London was the second and third geographical hotspot to be identified, respectively. This is presumably due to the number of transport links to London helping to facilitate the spread of infection, and the amount of international travel associated with the city bringing imports of novel variants. Secondly, there is generally a wave-like spread of warnings with the occasional long-range jumps, as would be expected from a spatial transmission kernel with a heavy tail [50]. Finally, the South West and to a lesser extent East Anglia are generally the last to experience warnings, which can be attributed to their lack of major cities and long journey times which mitigate the spread of infection. Data on the spatial spread of warning scores over time (as exemplified by figures 47) may be helpful to predict the likely time-scales for any new variant to spread before it becomes dominant, with Alpha and Delta taking around two months and Omicron BA.1 and BA.2 only requiring about two weeks (likely due to higher transmission and immune evasion of the variants and fewer public health interventions at the time).

    The ability to rapidly detect the relative growth of a new variant in a region also has important public health implications. Any information on the heterogeneous distribution of infection allows resources to be targeted if necessary, helps inform local public health messaging and awareness or allows authorities to prepare local health services for increased burden. In England for example, a programme of spatial targeting of controls (known as Tiers) was implemented from 14 October to 15 November 2020 and from 2 December 2020 to 6 January 2021 (see also figure 2); during this period different areas of the country were evaluated as being at ‘medium’, ‘high’ or ‘very high’ alert levels, with different restrictions imposed depending on the severity of the outbreak and allocations regularly reviewed by the Joint Biostatistics Centre. From 21 December 2020, England also included a fourth ‘stay at home’ tier, imposing additional restrictions to people’s movements in high-risk local authorities [32,51]. More generally, by means of early warning signals specific to precise geographical locations, public health officials can quickly take action to contain or slow down a disease and reduce its overall impact if necessary.

    One potential limitation of this study is that, in the absence of a ground truth which determines the scale of epidemic and the component variants with high certainty, it is difficult to quantify the accuracy of our results. Such comprehensive validation studies are often required for deployment of new methodologies (such as RaNCover) in real-world routine surveillance [52]. However, we have already shown and discussed the accuracy of RaNCover’s accuracy in simulation experiments [16], and its performance with COVID-19 data adds extra support to our claims of accuracy and robustness. In fact, RaNCover’s approach offers an improvement from the usual viewpoint of monitoring and catching exceedances in individual locations, which are subject to important fluctuations that can cause false positive and false negative detection events. With its ability to aggregate signal from multiple regions, resulting in earlier identification of areas of concern, where a new variant is set to grow before it becomes widespread, we believe the random neighbourhood covering approach may be a valuable tool in the efforts to mitigate the impact of emerging infectious diseases.


    Data were supplied after anonymization under strict data protection protocols agreed between the University of Warwick and UK Health Protection Agency, with public data deposition non-permissible for privacy reasons. The ethics of the use of these data for these purposes was agreed by UK Health Protection Agency with the Government’s SPI-M-O/SAGE committees.

    Data accessibility

    SARS-CoV-2 data were supplied under strict data protection protocols agreed between the researchers and UK Health Protection Agency. Interested readers wanting similar data may wish to contact , as detailed from: Spatial maps created with Sf [53] using digital vector boundaries available from the Office for National Statistics, licensed under Open Government Licence v.3.0. Crown copyright and database right 2022. Terms and conditions of supply for the digital boundaries and reference maps are provided from: The scripts used in this project can be found from the GitHub repository: [54].

    Electronic supplementary material is available online [55].

    Declaration of AI use

    We have not used AI-assisted technologies in creating this article.

    Authors' contributions

    M.C.: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, writing—review and editing; L.D.: conceptualization, writing—review and editing; M.J.T.: conceptualization, writing—review and editing; D.T.: conceptualization, writing—review and editing; M.J.K.: conceptualization, data curation, funding acquisition, investigation, methodology, visualization, writing—original draft, writing—review and editing.

    All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

    Conflict of interest declaration

    We declare we have no competing interests.


    This work was supported by Health Data Research UK, which is funded by the UK Medical Research Council, EPSRC, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and the Wellcome Trust (M.C. and M.J.K.). M.J.K. was also supported by the JUNIPER modelling consortium (grant no. MR/V038613/1) and by the National Institute for Health Research (NIHR) (Policy Research Programme, Mathematical and Economic Modelling for Vaccination and Immunisation Evaluation, and Emergency Response; grant no. NIHR200411). M.J.K. is affiliated to the National Institute for Health Research Health Protection Research Units (NIHR HPRUs) in Gastrointestinal Infections and in Genomics and Enabling Data. This research is supported by the National Institute for Health and Care Research (NIHR) Applied Research Collaboration (ARC) West Midlands (D.T.). The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health and Social Care or Public Health England. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.


    The authors would like to thank Andre Charlett and the data teams at UKHSA for collecting and making available the data used in this study.


    Electronic supplementary material is available online at

    Published by the Royal Society under the terms of the Creative Commons Attribution License, which permits unrestricted use, provided the original author and source are credited.