Validating CircaCP: a generic sleep–wake cycle detection algorithm for unlabelled actigraphy data

Sleep–wake (SW) cycle detection is a key step for extracting temporal sleep metrics from actigraphy. Various supervised learning algorithms have been developed, yet their generalizability from sensor to sensor or study to study is questionable. In this paper, we detail and validate an unsupervised algorithm—CircaCP—for detecting SW cycles from actigraphy. It first uses a robust cosinor model to estimate circadian rhythm, then searches for a single change point (CP) within each circadian cycle. Using CircaCP, we estimated sleep/wake onset times (S/WOTs) from 2125 individuals’ data in the MESA sleep study and compared the estimated S/WOTs against self-reported S/WOT event markers, using Bland–Altman analysis as well as variance component analysis. On average, SOTs estimated by CircaCP were 3.6 min behind those reported by event markers, and WOTs by CircaCP were less than 1 min behind those reported by markers. These differences accounted for less than 0.2% variability in S/WOTs, considering other sources of between-subject variations. Rooted in first principles of human circadian rhythms, our algorithm transferred seamlessly from children’s hip-worn ActiGraph data to ageing adults’ wrist-worn Actiwatch data. The generalizability of our algorithm suggests that it can be widely applied to actigraphy collected by other sensors and studies.


Introduction
Accurate sleep-wake (SW) cycle detection is essential for extrapolating sleep patterns from actigraphy data (Meltzer and others (2012)).Actigraphy records one's activity levels at 30-second or 60-second intervals continuously for a prolonged time (e.g. a few days to several weeks).Using such records, researchers can estimate circadian cycles and SW cycles, and extrapolate sleep metrics such as sleep/wake onset times (S/WOTs), sleep durations, and the withinsubject variability of these sleep metrics.
Numerous algorithms for estimating SW cycles from actigraphy data have been developed and such research attempts date back to the 1980s (Mullaney and others (1980) Webster and others (1982)).Mullaney et al were the first group to devise such algorithms and then Webster et al proposed a scoring algorithm (Webster and others (1982)).
Other researchers extended Webster's scoring method and devised their own scoring algorithm customized to their data (Cole and others (1992) Sadeh and others (1994), Oakley (1997)).These scoring algorithms have been widely used to identify SW states due to their accessibility and ease of use.In recent years, machine-learning techniques have also been applied in S/WOT detection.Most of these techniques are binary classification algorithms such as logistic regression Sazonov and others (2004) Paquet and others (2007) Domingues and others (2013), linear discriminant classifiers (Long and others (2017)), support vector machines (Palotti and others (2019)), and random forests (Angelova and others (2020) El-Manzalawy and others (2017)).
These algorithms have several disadvantages.First, algorithms based on generalized linear models or supervised machine-learning algorithms require ground-truth labels to train models for SW cycle detection.Labels for training models can be expensive to collect (e.g. from polysomnography (PSG) studies), contain inaccuracies (e.g.proxy reports), or be inconvenient to collect (e.g.researchers may elect not to collect labels in order to maximize the compliance rate).Secondly, since the debut of actigraphy sensors in the 1980s, activity levels have been measured in relative scales without a consistent unit.Thus, actigraphy data collected by different wearable activity trackers for the same types of activity have different ranges and distributions.As the parameters of these algorithms are fine-tuned to maximize the detection accuracy for a particular dataset, their generalizability to other datasets is questionable.
Thirdly, these algorithms use fixed-length time windows to extract features, which cannot adequately capture the structural breaks in data distributions that may happen on various timescales (Auger and Lawrence (1989)).Moreover, these algorithms assumed that the distributions of diurnal and nocturnal actigraphy are identical.As a result, the SW cycles estimated by these algorithms tend to have multiple mis-identified sleep/wake states, resulting in multiple fragments of sleep/awake states.These estimated SW cycles may be further smoothed by an ad hoc temporal filter (e.g. a median filter) (Cao and others (2012)) or heuristic rules (Webster and others (1982)).Although such filters may consolidate some of the fragments and improve detection accuracy (Palotti and others (2019)), they can also further reduce the temporal resolution of the detection.Figure 1 illustrates this common issue, based on the application of three of these scoring algorithms to the actigraphy data of one subject.
Recognizing these issues, some researchers have devised algorithms focusing on general SW cycle detection.Van Hees et al descried a accelerometer-based estimation method without sleep labels; instead, heuristic cutoffs were used to estimate various postures and corresponding SW states (van Hees and others (2018)).Cakmak et al fused wristworn sensor data collected at higher sampling rates, applied a binary segmentation algorithm to identify the change points, and integrated three types of segmented time series by training generalized linear models (Cakmak and others (2020)).These algorithms were designed in particular for wearable sensor data stored at higher temporal resolutions (e.g.10Hz), and cannot be readily applied to actigraphy data that are aggregated at 30-second or 1-minute intervals.
Chen et al (Chen and others (2019)) proposed a generic, unsupervised algorithm to detect sleep-wake cycles and estimate S/WOTs based on common characteristics in human circadian rhythms.The algorithm leverages the prior information obtained from a nonlinear parametric model, and detects the S/WOTs with higher precision using a parametric change point (CP) detection algorithm.This algorithm was tested on 112 children's data collected from hip-worn ActiGraph TM sensors without ground-truth labels, and its generalizability has not yet been tested.In this paper, we detail the mathematical foundation of Chen's algorithm (Chen and others (2019)), improve its computation time and robustness, and validate the algorithm on a large actigraphy dataset collected by wrist-worn Actiwatch sensors used in the Multi-Ethnic Study of Atherosclerosis (MESA) Sleep study (Zhang and others (2018) Chen and others (2015)).This dataset contains week-long actigraphy data from over 2000 adults of diverse ages, ethnicities and work schedules, and are accompanied with self-reported S/WOTs recorded as event markers, making them perfect for external validation.

Data Collection
A total of 2237 subjects aged from 45 to 84 were enrolled in the MESA sleep study between 2010 and 2012.During the study, all 2237 subjects were instructed to place an Actiwatch (Philips Respironics, Inc) on the non-dominant wrist, and wear the same sensor for at least five days in the habitual living environment.Subjects were also asked to press the small button located on the right side of the Actiwatch every time they went to sleep or woke up, creating event markers.After the study days, the devices were returned to the study staff, and the data from these devices were downloaded by Respironics Actiware 5 Software, and aggregated as activity counts in 30-second intervals using the Actiware-Sleep software.The quality of these data was scored by the Sleep Reading Center at the Brigham and Women's Hospital.

Measures
A total of 2159 subjects in the MESA Sleep study had actigraphy data, of whom 2125 had a minimum of three days of data that was no less than 50% reliable.These 2125 subjects entered our actigraphy analysis.The quality of these actigraphy data were rated from 2 to 7 (2 being the poorest and 7 being the best, named as G2, G3,..,G7 in the rest of this paper) according to consistency with PSG data, event markers, sleep diary and light levels (more details can be found in the MESA Exam 5 Sleep Data Documention Guide from sleepdata.org/datasets/mesa).Note that quality grades do not necessarily reflect the quality of the actigraphy data alone, but also depend on the completeness and quality of sleep diaries.We used a one-dimensional time series of the actigraphy data -the vector magnitude of activity counts -for SW cycle detection, and compared the detection results with the event marker data to validate the SW cycles detected by our algorithm.

Screening of Actigraphy Data
To reliably estimate circadian rhythms, we included subjects who had worn the sensor continuously for at least four days.To do so, we first excluded subjects with actigraphy of less than 5,760 minutes.Then among the remaining subjects, we detected the continuous wearing periods, defined as having no consecutive zeros more than 120 minutes.
We included subjects with a longest continuous wearing period of at least 5,760-minutes.Lastly, we aggregated the included 30-second interval actigraphy into 60-second interval actigraphy.

Sleep Detection Algorithm
This section details Chen's algorithm for SW cycle detection.The algorithm consists of two steps, 1) segmenting the time series of actigraphy data into circadian cycles using a non-linear parametric model; each cycle consists of a diurnal and a nocturnal period, and 2) identifying more accurate sleep-to-wake and wake-to-sleep CPs within each segment.
The time series of human activity data demonstrate a strong pseudo-periodical pattern (i.e.circadian rhythm), which can thus be modeled by functions with periodical patterns, e.g. a cosine function.Halberg et al (Halberg and others (1967)) proposed a cosinor model to fit data with circadian patterns.It has three parameters, amplitude amp which is the peak of the rhythm, mes which indicates the midline estimating statistic of rhythm, and acrophase phi which is the time of the peak of the rhythm, to be estimated to identify the circadian rhythm.The cosinor model is shown in Equation 1: where t = 1, 2, ..., n, t ∈ N, is the time index for the actigraphy sequence.T is the period of one's circadian rhythm, often set as a constant of 24 (hours), or in our case 1440 (minutes).Marler's extension of this model, the sigmoidallytransformed cosine model, can also be used to capture the circadian rhythm.Since the cosinor model takes significantly less computation time without sacrificing much accuracy, our study implemented the cosinor model to fit the nonlinear curve.We estimated the parameters of the non-linear curve by minimizing the sum of squared residuals.The objective function we optimized is displayed in the formula below: where p is a 3-parameter vector containing mes, amp and φ in Equation 1, the initial value of p is set as (500,550,227).F (p, t) is the fitted curve and Y the raw activity counts with sequence index t, t ∈ N.After the non-linear curve fitting, we dichotomize the fitted curve F by treating the lower 18% of its range as nocturnal periods and the upper 82% as diurnal periods.This dichotomized curve gives an estimate of one's circadian cycles whose transition edges are rough estimates of S/WOTs.
The fixed period of the cosinor model imposes consistent S/WOT over the days and hence cannot capture day-today variation in S/WOT for each individual.Therefore, the precise sleep/wake times must be determined with a more refined searching algorithm.Next, we locate more precise S/WOTs for a particular circadian cycle identified by the cosinor model.If a subject's actigraphy sequence has clear circadian patterns, we assume the subject gets up at least once between two consecutive SOTs, or goes to sleep at least once between two consecutive WOTs, during a day.As a subject's activity pattern shifts most drastically from night to day or day to night, we can assume that S/WOTs are the most significant structural break points before and after which the data distributions were drastically different.
Such structural CPs can be detected using a parametric change point detection method.
where k is the location of the change point, n the length of the data sequence, and δ a real number.Under H 0 , the likelihood function is: whereas under H 1 , the likelihood function is where j = 1, 2, ..., k, ..., n − 1, and represents all the possible locations of a change point k.Thus, to estimate the location of the change point k in a time series, our goal is to find the maximum likelihood ratio L 1 L 0 , or the minimum of the Bayesian Information Criteria (BIC).Chen et al Chen and others (2006) proposed an improved criteria, called modified information criteria (MIC), which is particularly suited for assessing model complexity in change-point models.Thus, here we adapted the MIC term with a scaling factor λ, λ controls the penalty of the change point being detected near the beginning or near the end of a sequence: Since the third and fifth terms in the equation of M IC(k) do not contain location parameter k, we dropped the fifth and sixth terms from the model and simplified the fourth term.Therefore, Equation 2.8 reduces to: (2.9) k is the estimated location of the change point in sequence Y .λ is empirically determined as 50.The whole algorithm for sleep-wake cycle detection is detailed in the Algorithm 1.A Matlab implementation of CircaCP can be found at https://github.com/ShanshanChen-Biostat/CircaCP.

Error Detection
To identify detection errors without ground-truth labels, we adopted a metric for evaluating clustering performancethe Calinski-Harabasz (CH) criterion (Caliński and Harabasz (1974)).Also known as the variance ratio criterion, this metric measures the ratio of total between-cluster sum of squares (SSB) to total within-cluster sum of squares (SSW).
where n is the total number of samples in an actigraphy sequence, k is the number of clusters, in our case, k = 2 (i.e.sleep and awake periods).We evaluated both the CH for SW cycles estimated by the cosinor model ( LabelCP 1 = SOT 13: else 14: for each k in [1 : l] do detection errors and removed from the subsequent analysis.

Validation Procedure
Self-reported event markers contain inherent errors.For example, the subject may forget to click the button before falling asleep or after waking up, or click the button multiple times for a true S/WOT, or even at a random time of day which was not bed time or wake-up time.Thus, for validation, we paired up the S/WOTs detected by the CircaCP algorithm with the most likely corresponding event markers.First, sequence alignment methods such as the Needleman-Wunsch algorithm (Needleman and Wunsch (1970)) and Dynamic Time Warping (Mueen and others (2016)) were applied to match event markers and the estimated S/WOTs.However, these existing alignment methods did not work well because long gaps between event markers and S/WOTs are quite common when the subject forgets to record S/WOTs.On the other hand, the correct event markers can always be detected close to true S/WOTs.Hence, we searched for event markers within the proximity of the estimated S/WOTs directly.Here, we defined a marker as valid if it fell within 180 minutes of the estimated S/WOT.If multiple markers matched for a certain S/WOT, we kept only the latest marker for SOT and the earliest marker for WOT.
Once the event markers and the estimated S/WOTs were matched, we converted the index-based S/WOTs and event markers to minute-based S/WOTs, defined as the time elapsed since midnight, for quantitative analysis.SOTs occurring after midnight had 1440 minutes added to the original elapsed time.
To assess the variability in S/WOTs from the two sources (i.e.event markers and CircaCP estimation) and from actigraphy of different quality, we conducted variance component analyses to decompose the total variance in the data into the percentage contributions of various random effects (i.e.algorithm, subject and within-subject, day-to-day variability).We also assessed the effects of these factors on the outcome -SOT or WOT -using linear mixed-effects models.Specifically, actigraphy quality group, algorithm, work schedule, age, gender and ethnicity were modeled as fixed effects.The reference levels were the highest actigraphy quality group (i.e.G7), event markers, being white, female, and youngest, and having a day job.Day-to-day variability was modeled as random effects nested within subjects.Lastly, we assessed the agreement between the two methods using Bland-Altman analysis.

Results
Out of the 2125 subjects' actigraphy data, 1957 subjects' actigraphy passed the screening criteria (i.e.having a continuous wearing time of at least four days) and were used for SW detection (Table 1).Our screening criteria kept more than 90% of subjects' actigraphy of varying quality, with quality group G7 having the most subjects included.Among these 1957 subjects, 34 subjects with poor SW detection results (evaluated by the error detection procedure) were excluded.Of the remaining 1923 subjects, 1857 had at least one matched marker and algorithm-detected S/WOT pair and entered the validation analysis.The demographics of these 1857 subjects are also listed in Table 1.The majority of these subjects were retired, aged between 66 and 71, white, and female.In the group with the lowest actigraphy quality (G2), subjects were older and mostly African-American, and had a higher percentage of shift jobs than other groups.The G2 group had the lowest percentage (88.9%) of actigraphy that entered the validation analysis whereas in the G4 to G7 group, almost all actigraphy sequences for SW detection entered the validation analysis.
The SW cycle detection results are visualized in Figure 3.Despite the variety of work schedules, the CircaCP algorithm correctly identified SW cycles, where the transition edges are the estimated S/WOTs.These S/WOTs are very close to the S/WOTs reported by the event markers.Bland-Altman plots in Figures 4 and 5 show the limits of agreement between the estimated S/WOTs and those recorded as the event markers.Although our marker matching procedure implied that the maximum difference between matched markers and estimated S/WOTs is ±180 minutes, the limits of agreement between the two were well under 180 minutes (about 100 minutes for SOT and 90 minutes for WOT).Our variance-component model further analyzes the factors that impact the S/WOT outcomes (Table 3).
Overall, the factor of S/WOT estimation method contributes less than 0.2% to the total S/WOTs variance, suggesting that the CircaCP algorithm is equivalent to the event markers when analyzing the impact of the estimation algorithm on the outcomes.The majority of the variation comes from the subject-to-subject variability and day-to-day variability.
Results from linear mixed-effects models are shown in Table 2. SOTs estimated by our proposed algorithm are about 6 minutes behind SOTs reported by markers in the highest-quality actigraphy group, and more behind (though nonsignificantly) in worse-quality actigraphy groups.WOTs detected by CircaCP are almost identical to those recorded by markers, with a non-significant difference of less than one minute.On average, the subjects who are the youngest white females with a day job, and have the best actigraphy quality slept at 23:23 and woke up at 6:45.Having a non-regular work schedule significantly affects the S/WOT outcomes, and working night shifts delays WOTs the most among all types of schedule.Among the four race groups, African Americans had significantly delayed SOTs in comparison to white Caucasians.Since the quality grades do not entirely reflect raw actigraphy quality, this factor (i.e. the quality groups) did not impact the S/WOT outcomes significantly.

Discussion
In this study, we improved and validated a generic, unsupervised algorithm for detecting S/WOTs from a large actigraphy dataset.Originally developed for children's actigraphy data collected by hip-worn ActiGraph sensors, this algorithm was seamlessly applied to adults' actigraphy data collected by wrist-worn Actiwatch sensors.Validated against the self-reported S/WOT event markers, our proposed algorithm showed good consistency with the markers, resulting in a small bias of less than 6 minutes in comparison with the markers.Using variance component models, we found that the between-subject variability and day-to-day variability within subjects contributed by far the largest percentage to the total variances in S/WOTs.This highlights the necessity of SW cycle detection when analyzing actigraphy data.
While the between-subject variation can be partially modeled by certain covariates (e.g.age), individual habitual rhythms are largely random and cannot be learned well via population statistics.The large within-subject variation suggests that reporting average S/WOTs will lose considerable information about an individual's sleep patterns.
Our proposed algorithm has several advantages.First, this top-down approach converts the SW cycle detection problem into a change-point detection problem within bounded regions, which are the estimated circadian cycles.
In contrast to previous methods that focused on supervised learning methods, the assumptions of CircaCP focus on the commonality of sleep patterns in humans.Following these assumptions, CircaCP captures the circadian cycles via a cosinor model and the day-to-day variability via CP detection bounded by each circadian cycle.Therefore, this method is directly transferable to other types of actigraphy, given apparent circadian rhythms and reasonably identified data distributions.For example, activity counts from Actiwatch and ActiGraph, although based on different proprietary algorithms, are continuous random variables with long-tail distributions, hence can both be modeled well by Gamma distributions where the dispersion (i.e. the inverse of the shape parameter) is explicitly estimated.For activity count data that are discrete and generated from a counting process (e.g.activity counts measured by infrared room occupancy sensors, or step counts from FitBit ® sensors), change-point detection algorithms based on Poisson distributions may achieve better detection accuracy.
Secondly, the only parameter that needs to be chosen in advance is the threshold to dichotomize the cosinor curve.
Although we used a set of priors to initialize the cosinor model, those values are related to the circadian rhythms of human beings, thus can be applied to other datasets without tuning.Since the proposed algorithm circumvents the laborious collection of sleep labels and training classifiers for various actigraphy datasets collected by various devices with different configurations, this automated approach reduces the overall overhead in developing SW cycle detection algorithms.
Lastly, we elected to use the less computationally-demanding cosinor model instead of the Hill-transformed cosinor model in our previous publication (Chen and others (2019)), so that the curve fitting step was more robust and much faster.Moreover, we improved the previous algorithm by a second round CP searching procedure, so that the misidentified CPs could be better located in the second round.Since all of these steps can be done in linear time (O(n)), with the algorithms implemented in Matlab 2021a using Intel i7 4-core CPU at 3.4GHz with 16.0GB RAM, it takes less than 15 minutes to complete the SW cycle detection on 1930 subjects' actigraphy data.
One limitation of using actigraphy for S/WOT detection is that the S/WOTs are effectively time-to-bed and timeto-get-up, whereas the precise time when the body physiologically and/or cognitively switches off is unattainable.For subjects who have afterhours sedentary behavior, actigraphy-based SW detection can underestimate their SOTs by a large margin (e.g. 7 pm can be identified as the SOT if the subject stays in bed after dinner until falling asleep).
underestimation.Still, the framework of CircaCP (i.e.identifying CPs within a circadian cycle) can also be applied to synchronous heart rate data for even more precise S/WOT estimation.Another limitation is that for certain subjects with severely interrupted circadian rhythms (mostly subjects who worked night shifts), the detection results contained large errors and were identified by the error detection method.Although this is a small percentage (less than 2%) of the cohort, their poor regularity may be of greater interest in studies investigating the impact of occupation on health.
For such actigraphy data, algorithms that take into account the characteristics of long sequences can be used (e.g.Gamma distribution-based Hidden Markov Models).Our future work will focus on fusing information from different sensor modalities and other types of estimation method.
The applications of our proposed algorithm are wide.In essence, it provides a unified framework for accurately detecting SW cycles and S/WOTs.In sleep research, multiple sleep-related metrics, such as sleep duration and S/WOT variability, can be derived based on accurate S/WOTs.For actigraphy analysis in general, CircaCP can also serve as a pre-processing step by segmenting the long sequences of actigraphy into two regions, within which time-series are relatively stationary, and thus time series models can be validly applied and metrics related to physical activity profiles can be accurately extrapolated.

Financial disclosure
The authors had no financial arrangements or connections to disclose.

Non-financial disclosure
The authors have no potential conflicts of interest.

Fig. 1 .
Fig. 1.Sleep-wake states detected by three popular algorithms for one subject's actigraphy data from the MESA Sleep study

FigFig. 4 .Fig. 5 .
Fig. 3. Visualization of SW detection results Thus, we describe the time series of actigraphy as a pseudo-cyclostationary random process Y, which collects a sequence of independent and identically distributed random variables, Y = [y 1 , y 2 , ..., y k , ..., y n ], that follow Gamma is the scale parameter for the random variable at time k, y k , and ξ the shape parameter shared by the random variables [y 1 , y 2 , ..., y k , ..., y n ].A structural change point in such a time series means the scale parameter of the sequence before and after this point is significantly different.To test if such a structural change point exists in Although count data are traditionally modeled by Poisson distributions, activity "counts" are not natural counts that are generated from a counting process, but intensity measures of activity based on various algorithms.Distributions of such intensity data (e.g.time series of seismic moments and precipitation intensity) usually show long-tail, a segment of time series data, we perform a test based on likelihood ratios.The hypothesis can be expressed by the following mathematical notation:

Table 1 .
Summary of Actigraphy Data in MESA Sleep Study.