Long-tailed distributions of inter-event times as mixtures of exponential distributions

Inter-event times of various human behaviour are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states, in each of which they are regarded to generate events obeying a Poisson process. If this is the case, inter-event times should approximately obey a mixture of exponential distributions with different parameter values. In the present study, we introduce the minimum description length principle to compare mixtures of exponential distributions with different numbers of components (i.e. constituent exponential distributions). Because these distributions violate the identifiability property, one is mathematically not allowed to apply the Akaike or Bayes information criteria to their maximum-likelihood estimator to carry out model selection. We overcome this theoretical barrier by applying a minimum description principle to joint likelihoods of the data and latent variables. We show that mixtures of exponential distributions with a few components are selected, as opposed to more complex mixtures in various datasets, and that the fitting accuracy is comparable to that of state-of-the-art algorithms to fit power-law distributions to data. Our results lend support to Poissonian explanations of apparently non-Poissonian human behaviour.


NM, 0000-0003-1567-801X
Inter-event times of various human behaviour are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states, in each of which they are regarded to generate events obeying a Poisson process. If this is the case, inter-event times should approximately obey a mixture of exponential distributions with different parameter values. In the present study, we introduce the minimum description length principle to compare mixtures of exponential distributions with different numbers of components (i.e. constituent exponential distributions). Because these distributions violate the identifiability property, one is mathematically not allowed to apply the Akaike or Bayes information criteria to their maximum-likelihood estimator to carry out model selection. We overcome this theoretical barrier by applying a minimum description principle to joint likelihoods of the data and latent variables. We show that mixtures of exponential distributions with a few components are selected, as opposed to more complex mixtures in various datasets, and that the fitting accuracy is comparable to that of state-of-the-art algorithms to fit power-law distributions to data. Our results lend support to Poissonian explanations of apparently non-Poissonian human behaviour. record various human behaviours in quantitative manners. Such quantitative understanding often challenges traditional assumptions underlying mathematical modelling of human behaviour, a major instance of which is a lack of Poissonian properties in a range of human behavioural data. In other words, when a sequence of time-stamped events, such as email correspondence, is observed from a human individual, the event sequence often deviates from Poisson processes, in which inter-event times independently obey an exponential distribution. Rather, empirical inter-event times from human behaviour and other data often obey long-tailed distributions [1,2]. Such non-Poissonian processes on nodes and edges are building blocks of temporal networks on which dynamical processes such as epidemic processes often behave differently from the same processes on static networks [3,4].
Two classes of models that generate long-tailed distributions of inter-event times are priority queue models and modulated Markov processes [3,4]. In priority queue models, one assumes that a human individual is a queue that receives tasks across time and prioritizes some particular tasks for execution [1]. By contrast, in modulated Markov processes, one assumes that a human individual generates events according to a Poisson process (i.e. one determines whether to have an event right now at a constant rate without memory) but modulates the event rate. The event rate may be assumed to take one of a few values [5][6][7][8][9][10] or continuously many values [11,12]. When we assume a few states, where each state corresponds to a Poisson process, an underlying assumption about human behaviour is that an individual transits among a few distinguishable states, such as an active state and an inactive state. If this is the case, we should be able to fit a mixture of exponential distributions rather than powerlaw distributions to empirical distributions of inter-event times with a reasonable accuracy. This is because a subset of inter-event times, i.e. those produced in each state, is expected to obey an exponential distribution.
In fact, a mixture of a small number of exponential distributions and a power-law distribution with an exponential cut-off can look similar. In the present study, we fit mixtures of exponential distributions to several datasets of inter-event times. The idea of approximating empirical distributions resembling power-law distributions by mixtures of exponential distributions is not a new idea [13]. We perform model selection to determine the number of component exponential distributions fitted to data and also compare the mixture of exponential distributions with two types of conventionally used powerlaw distributions. A technical challenge is that the maximum-likelihood estimator of the mixture of exponential distributions does not satisfy the asymptotic normality such that it is not allowed to use the Akaike information criterion (AIC) or Bayes information criterion (BIC) (e.g. [14, p. 203]). Therefore, based on the minimum description length (MDL) criterion [15,16], we perform the procedure so-called latent variable completion [17][18][19][20] to derive variants of MDL justified for mixtures of exponential distributions. We find that mixtures of up to three exponential distributions are the best performer in many cases. Python codes for estimating and selecting the mixture of exponential distributions are available at Github (https://github.com/naokimas/exp_mixture_model).

Mixture of exponential distributions and its maximum-likelihood estimation
We fit mixtures of exponential distributions, which we refer to as the exponential mixture models, abbreviated as EMMs, to distributions of inter-event times, denoted by τ ∈ [0, ∞), and compare the fit with the case of power-law distributions. We denote by k the number of the exponential distributions to be mixed, which we refer to as the number of component distributions. The mixing weight, i.e. the probability that the jth exponential distribution is used, is denoted by π j (1 ≤ j ≤ k). The probability density function for an EMM is given by where p ; {p 1 , . . . , p k }, m ; {m 1 , . . . , m k } and μ j is the mean of the jth exponential distribution. Given a series of inter-event times, t ; {t 1 , . . . , t n }, and the number of components, k, we estimate the values of the model parameters, u ¼ (p, m), as follows.
First, we run the expectation-maximization (EM) algorithm [21]. As an initial condition, we set where the superscript indicates the iteration number in the EM procedure. We also draw m (0) j (1 ≤ j ≤ k) such that log 10 m (0) j independently obeys the uniform density on [log 10 τ min , royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 log 10 τ max ], where τ min and τ max are the minimum and maximum of the inter-event time in the given dataset, respectively. In this manner, we intend to generate a sufficiently broad initial distribution of m (0) j while a majority of the m (0) j values is relatively small. Intuitively, a large value of m (0) j , which sometimes occurs, reflects the long tail of the distribution of inter-event times. We iterate the following EM steps t max times by alternating the expectation (E) and maximization (M) steps. The E step is given by where t = 0, 1, 2, …, t max − 1. One can interpret g (t) ij as the probability that τ i is generated from the jth exponential distribution. The M step is given by and ij : (2:3b) After the iteration of the E and M steps t max times, one obtains p (tmax) k , which is an approximate maximum-likelihood estimator. This estimator corresponds to the case in which one does not estimate the latent variables that explicitly indicate which exponential distribution out of the k exponential distributions has generated each τ i .
When the values of the latent variables are required, we compute them as follows. We denote the latent variable corresponding to each τ i (1 ≤ i ≤ n) by z i (1 ≤ z i ≤ k). In other words, τ i is estimated to be generated from the z i th exponential distribution. We first estimate the latent variables bŷ We then estimate the other parameters byp j (ẑ) ¼ n j n , ( 2 :5) whereẑ ¼ {ẑ 1 , . . . ,ẑ n }, n j is the number of inter-event times whose estimated latent variableẑ i is equal to j, andm ). To cope with the problem of local maxima, we estimate the values ofp j ,m j , andẑ i (1 ≤ j ≤ k, 1 ≤ i ≤ n) 10 times starting from different initial conditions. Then, among the 10 maximum-likelihood estimators, we employ the one that has yielded the largest likelihood of the joint distribution of t andẑ.

Model selection criteria for mixtures of exponential distributions
We consider the following six model selection criteria.

AIC and BIC
For an EMM with k components without an explicit consideration of the latent variables, the AIC [22] and the BIC [23] are given by

AIC and BIC with latent variable completion
An EMM is non-identifiable, which by definition dictates that a distribution does not correspond to a parameter set in a one-to-one manner [14]. For example, if k = 2 and μ 1 = μ 2 , the distribution of τ is the same exponential distribution regardless of the value of π 1 ( = 1 − π 2 ). When a distribution is nonidentifiable, the maximum-likelihood estimator is not asymptotically normal, and the conventional AIC and BIC, which are derived on the basis of asymptotic normality, lose justification [14].
A method for overcoming this theoretical barrier is to apply model selection criteria to a complete variable model, in which one completes the values of the latent variables, rather than to use a marginalized model [17][18][19][20]. We call this method the latent variable completion [20] throughout this paper. Under latent variable completion, one explicitly incorporates the likelihood of the latent variables, z i (1 ≤ i ≤ n), without marginalizing them, into a model selection criterion. Then, the joint distribution of τ i and z i (1 ≤ i ≤ n) is an identifiable distribution such that the use of the AIC and BIC is justified. Also in practice, latent variable completion is better at describing some empirical data (for example, at estimating an appropriate number of components, k) than model selection criteria that do not use latent variable completion [17,18].
We refer to the corresponding AIC or BIC as the AIC or BIC with latent variable completion [17,19] and denote them by AIC LVC and BIC LVC , respectively. We obtain AIC LVC (t,ẑ) ¼ À log p(t,ẑ;û (t,ẑ)) þ 2k Ã À 1 ( 2 :9) and BIC LVC (t,ẑ) ¼ À log p(t,ẑ;û (t,ẑ)) þ k Ã À 1 2 log n þ 1 2 log n j : (2:10) In equations (2.9) and (2.10),û(t,ẑ) ¼ {p 1 (ẑ), . . . ,p k Ã (ẑ),m 1 (t,ẑ), . . . ,m k Ã (t,ẑ)}, derived in equations (2.5) and (2.6), is the maximum-likelihood estimator for the joint distribution of inter-event times t and latent variablesẑ. The likelihood in equations (2.9) and (2.10) is given by (2:11) In equations (2.9), (2.10) and (2.11), k Ã is the number of components that are used at least once under the estimated latent variable values. The latent variable specifies which of the k exponential distributions has produced τ i for each i. Therefore, as a result of estimating the latent variables, a jth exponential distribution may not have any inter-event times τ i belonging to it (i.e. n j = 0). We exclude such exponential distributions, i.e. those with n j = 0, and only consider the remaining k Ã exponential distributions. To calculate AIC LVC and BIC LVC , we use k Ã instead of k because k Ã is the actual number of components given the estimated latent variable values. For the same reason, we will use k Ã instead of k in the two model selection criteria introduced in the following sections, which also explicitly use the estimated latent variables and latent variable completion.

Normalized maximum-likelihood codelength
Here, we introduce model selection criteria on the basis of the MDL principle [16]. This principle asserts that the best model should be the one that attains the shortest codelength required for encoding the data as well as the model itself. The codelength is the number of bits required for encoding the data into a binary sequence under the information-theoretic requirement that each codeword can be uniquely decodable even without commas. In fact, MDL approaches to mixture modelling have a long history, as represented by the Snob program [24,25]. The normalized maximum-likelihood (NML) codelength is a type of MDL [16,26]. The NML codelength minimizes the worst-case (i.e. in terms of data x) regret value, which is equal to the actual royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 codelength minus the ideal codelength, i.e. min u[Q (Àlog p(x; u)) [27]. The other two model selection criteria, which we will explain in the next two sections, are based on the NML. Therefore, we explain the NML codelength in this section. Although we assume that inter-event times τ i are continuousvalued, we first explain the NML codelength for a sequence of discrete-valued variables x ¼ {x 1 , . . . , x n } for expository purposes.
To shorten the codelength for the given data x, one should in principle minimize the Shannon information given by Àlog p(x; u), where we assume throughout the present paper that log is in base e unless we specify the base. The minimizer is obviously given by the maximum-likelihood estimator, u(x). However, the maximum-likelihood estimator generally yields X becauseû depends on x. In equation (2.12), the summation is taken over all possible values of x. Therefore, we use the normalized probability distribution given by to encode the data [28]. In other words, we set where is called the parametric complexity. In equation (2.14), the first term on the right-hand side is small when the model fits the data well, and the second term represents the complexity of the model. In general, C(n) tends to increase as k increases because an increase in k leads to the expansion of the parameter space in which one searchesû, which leads to an increase in C(n) [29]. One can similarly derive the NML codelength for a sequence of continuous-valued variables by replacing the summation by the integral. For example, the equivalent of equation (2.15) in the case of continuous-valued variables is given by In the remainder of this section, we explain the NML codelength for an exponential distribution [19,30], not for an EMM, for two reasons. First, analytically calculating the NML codelength for an EMM seems formidable. Second, we will use the NML codelength for single exponential distributions in deriving the two types of the NML-based codelengths for EMMs in the following sections.
Consider an exponential distribution given by For this distribution, we obtain and where T (m min , m max ) is a region of t 0 ; {t 0 1 , . . . , t 0 n } [ R n such that the maximum-likelihood estimator given bym We have to confine the domain of integration to T in this manner because the royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 integral in equation (2.18b) diverges if T is replaced by R n . In practice, if we take μ min ( > 0) small enough and μ max large enough,m [ [m min , m max ] would be satisfied for empirical sequences of inter-event times t 0 . The standard method to evaluate the integral in equation (2.18b) is to transform the integral to that in the parameter space [19,31]. This method yields an integral of the form similar to Ð m max m min (1=m)dm, which is manageable [19]. Region T is a useful heuristic for avoiding such a divergence in NML-related calculations [31].
To incorporate the codelength necessary for encoding the value of μ min and μ max into the NML codelength, we rewrite μ min = exp(m min ) and μ max = exp(m max ), where m min and m max are integers. Then, we encode m min and m max instead of μ min and μ max .
The first term on the right-hand side of equation (2.18a) is given by 20) wherem is given by equation (2.19). To obtain the second term on the right-hand side of equation (2.18a), we substitute equation (2.17) into equation (2.18b), which leads to where Γ(n) is the gamma function. By substituting equations (2.20) and (2.21) into equation (2.18a), one obtains To carry out model selection, we add the codelength for m min and m max , denoted by ℓ(m min ) and ℓ(m max ), to L NML (t) to obtainL The derivation of ℓ(m min ) and ℓ(m max ) is given in the electronic supplementary material.

Normalized maximum-likelihood codelength with latent variable completion
As a first NML type of the model selection criterion for EMMs, we consider a latent variable completion of the NML codelength. We refer to the resulting criterion by NML LVC and the codelength by L LVC .
Although an analytical expression for the NML codelength for an EMM without latent variable completion is unavailable, we can analytically calculate the NML codelength for the joint distribution of inter-event times and the latent variables. Similar to the derivation in the case of a mixture of Gaussian distributions [19], we derive the codelength for EMMs as follows: We remind that p(t 0 ,ẑ 0 ;û(t 0 ,ẑ 0 ) is given by equation ( and for two reasons. First, log C EMM (n, k Ã ) is small when m 0 min is large or m 0 max is small. Second, encoding of m 0 min and m 0 max is facilitated by the introduction of integer variables, as we did for the NML codelength ( §2.2.3).
By substituting Àq j log q j represents the entropy, one obtains By substituting equation (2.11) in equation (2.25), one obtains the parametric complexity as follows: (2:33) Note that we have used equation (2.21) to derive equation (2.33) and that C EMM (n, 1) coincides with C exp (n). Because we cannot calculate C EMM (n, k Ã ) using equation (2.32) due to combinatorial explosion, we calculate C EMM (n, k Ã ) using the recursive relationship [19] given by The calculation of C EMM (n, k Ã ) using this recursive relationship, which dominates the computation time for calculating L LVC (t,ẑ), requires O(n 2 k Ã ) time. Finally, we add the codelength to encode integers m 0 min and m 0 max to obtaiñ

Decomposed NML codelength
The second type of the NML codelength with latent variable completion, which we call the decomposed NML (DNML) codelength [20,32], also completes the latent variable and then calculates the NML codelength. The DNML does so after decomposing the joint distribution into the distribution of the latent variables, z, and that of the observables, t, conditioned on the value of z [20]. The DNML codelength was originally formulated for the case in which both observables and latent variables were royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 discrete [20]. Here, we regard the inter-event time, τ i , as continuous-valued and the latent variable,ẑ i , as discrete-valued. We start by considering {n j logm j þ n j log n j À log G(n j )} þ k Ã log (m 0 max À m 0 min ), (2:37) where m 0 min and m 0 max are given by equations (2.27) and (2.29), respectively. It should be noted that we could prepare integers m min and m max for each j (1 ≤ j ≤ k Ã ) to code {t i } n i¼1,ẑ i ¼j throughm j , similarly to equation (2.23), to derive an alternative of equation (2.37). However, that coding method requires 2k Ã integers and therefore is less efficient than using only two integers m 0 min and m 0 max . The NML codelength L NML (ẑ) on the right-hand side of equation (2.36) is for a multinomial distribution ofẑ i and is given by In equation (2.38), C mult (n, k Ã ) is the parametric complexity for the multinomial distribution having k Ã elements. Using equation (2.5), one obtains where n 0 j is the number of inter-event times whose latent variableẑ 0 i is equal to j. One can recursively calculate C mult (n, k Ã ) [33] by n! t!(n À t)! t n t n À t n nÀt (2:40b) and C mult (n, k) ¼ C mult (n, k À 1) þ n k À 2 C mult (n, k À 2) (k ! 3): (2:40c) The computational time for solving the set of recursive equations is given by O(n + k Ã ) and dominates the computational time for L DNML (t,ẑ). Finally, similarly to the case of NML LVC , we add the codelength to encode integers m 0 min and m 0 max to obtainL

Power-law distributions
We also fitted two types of power-law distributions to the empirical distributions of inter-event times. The first type is the Pareto distribution whose probability density function is given by which is defined for τ ≥ b. We use the maximum-likelihood estimator given by [34,35] b(t) ¼ min : (2:44) The second estimator was the one proposed by Clauset et al. [36]. In this method, which we refer to as the PLFit algorithm, one selects theb value such that the distribution of the data points satisfying t i !b is as close as possible to a power law. The power-law exponent given theb value is royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 estimated by equation (2.44). It should be noted that the PLFit does not intend to fit a power-law or different distribution to the data points whose values are less thanb. We used a publicly available Python implementation of the PLFit [37].

Data
We used the following seven datasets of time-stamped event sequences obtained from human behaviour. Basic properties of each dataset are shown in table 1.
First, we used the data that the SocioPatterns project collected from individuals in an office building that hosted scientific departments [38]. We refer to this dataset as the Office dataset. The individuals wore a sensor on their chest, with which physical proximity between pairs of individuals was detected. The recording lasted for two weeks. A time-stamped event was defined as a contact that lasted at least 20 s, which was the time resolution of the data. For each individual, we ignored the partner of the contact and used the beginning time of each contact as the time of the event. In this manner, we examined a sequence of time-stamped events for each individual in this and the following datasets. Note that, because a contact is a symmetric relationship between two individuals, any event between nodes i and j is considered for both i and j.
The inter-event time was defined as the difference between the starting time of the two consecutive events. There was no event during the night-time due to the circadian rhythm. The effect of the circadian rhythm on analysis of inter-event times may be considerable but beyond the scope of the present study. Therefore, we excluded the inter-event times when the two events belonged to different days unless otherwise stated. When there were multiple events in consecutive 20 s time windows between the same pair of individuals, we regarded that they constitute a single event lasting over 20 s rather than a sequence of events with inter-event times equal to 20 s. In practice, in this case, we discarded all but the first event in each sequence of the multiple events in consecutive 20 s time windows and then calculated the inter-event times. We also aggregated events for a focal node that occurred at the same time with different partners into one event. Therefore, the minimum inter-event time is equal to 20 s. For the individuals that have at least 100 inter-event times, the mean and standard deviation of the inter-event times are shown in table 1.
Second, we used the Hypertext 2009 dynamic contact network (Hypertext for short) [39]. Under the SocioPatterns project, the data were collected from approximately 75% of the participants in the HT09 conference in Torino over three days. As was the case for the Office dataset, we ignored inter-event times that span multiple days.
The third dataset originates from the Reality Mining Project (Reality for short), which provides timestamped physical proximity relationships between the participants of an experiment, who are students or Table 1. Properties of the datasets. N represents the number of individuals with at least one inter-event time (therefore, at least two events). The edges are directed in the Bitcoin, Email, College and Sexual datasets. For these datasets, we only considered the individuals as sender but not recipient of the edges. N ≥100 represents the number of individuals with at least 100 interevent times. The mean and standard deviation (abbreviated as s.d.) are for inter-event times and calculated on the basis of the individuals having at least 100 inter-event times. For the Sexual dataset, N ≥100 , the mean and standard deviation are based on the individuals having at least 50 inter-event times. faculty members of MIT. The data were obtained from Bluetooth logs of mobile phones [40]. We did not skip inter-event times overnight because the data were collected over months and the mean inter-event time was much longer than that for the Office and Hypertext datasets (table 1). We used the data from September 2004 to May 2005, which were the months in each of which there were at least 10 3 events in total. Fourth, Bitcoin OTC is an over-the-counter marketplace where users trade with bitcoin [41]. In Bitcoin OTC, users rate other users regarding trustworthiness. We neglected the rate score and who a focal user i rated. For each user i, we examined a series of time-stamped events, where an event was defined by rating behaviour by user i toward anybody. We refer to this dataset as Bitcoin.
The Email-Eu-core (Email for short) dataset was collected from a large European research institution between October 2003 and May 2005 [42]. A node was an email address. An edge was defined between two nodes if and only if they sent email to each other at least once. A time-stamped event for node i in the network was defined by an email that i sent to anybody, disregarding the identity of the recipient of the email.
The CollegeMsg (College for short) data were collected from students belonging to an online community at the University of California, Irvine, between April and December 2004 [43]. The event was defined as the message sent from a user to another user. We downloaded the Bitcoin, Email and College data from the Stanford Network Analysis Project (SNAP) website (http://snap.stanford.edu/).
The sexual contact (Sexual for short) dataset provides times of online sexual encounters between escorts and sex buyers [44]. We used event sequences for each sex buyer by regarding a commercial sexual activity with any escort as an event.

Fitting of the EMM to the individual with the largest number of events and comparison with power-law distributions
We fitted EMMs with different numbers of components, k, to the sequence of inter-event times obtained from each individual in each dataset. We examined k = 1, 2, …, 9, 10, 20, 50 and 100 in each case. Then, according to each of the six model selection criteria, AIC, BIC, AIC LVC , BIC LVC , NML LVC and DNML, we selected the best k value. Results of model selection for the individual with the largest number of events in each dataset are shown in table 2. Table 2 indicates that the effective number of components finally chosen, k Ã , is at most four in all but three out of the 42 combinations of the dataset and criterion. Furthermore, for four out of the seven datasets, k Ã is at most three regardless of the criterion. These results suggest that a mixture of a small number of exponential distributions may be a reasonable approximation to empirical distributions of inter-event times in many cases. Table 2. Number of components of the EMM selected by each model selection criterion. Each entry of the table shows the selected (k, k Ã ) pair. For each of the seven datasets, the individual with the largest number of inter-event times is used. The number of inter-event times for that individual is shown in the second column of the table. We compared the EMMs with k = 1, 2, …, 9, 10, 20, 50 and 100 under each criterion. For the Sexual dataset, k = 1, 2 or 3 were ties when AIC LVC , BIC LVC , NML LVC or DNML was used. However, these four k values effectively produced the same exponential distribution with k Ã = 1. Therefore, in the table, we wrote k = 1 for these four criteria. Note that, for any criterion, the same k Ã value induced by different initial k values may yield different criterion values because the estimated EMM may be different between the different k values even if the k Ã value is the same. On the other hand, if different criteria are maximized at the same k value, they are maximized by the same EMM in addition that they share the k Ã value. Next, we compared the performance between the estimated EMMs and power-law estimators in approximating the empirical distributions. We have fitted power-law distributions because they have widely been applied to empirical distributions of inter-event times [1][2][3][4]. For the same individuals as those used in table 2, the empirical and estimated distributions of inter-event times are compared in figure 1. When the different model selection criteria yielded different EMMs (i.e. EMMs with different values of k or k Ã ), we showed all the selected EMMs overlaid in the figure. We compared the different distributions in terms of the survival probability (i.e. the probability with which the inter-event time is larger than τ, i.e. Ð 1 t p(t 0 )dt 0 ) and the odds ratio defined by OR(t) ¼ [1 À Note that the odds ratio is particularly good at capturing differences between distributions at small values of τ [45]. Figure 1 elicits the following observations. First, in five datasets (i.e. except Bitcoin and Email) the right tails of the distribution seem to be more accurately approximated by an EMM than the powerlaw distributions. This may be because the two power-law estimators employed here assume a strictly power-law tail, whereas empirical data usually have exponential cut-offs. For the Bitcoin and Email datasets, the power-law estimate obtained by the PLFit algorithm seems to be roughly as accurate as EMMs in approximating right tails of the distribution. Second, the odds ratio plots show that, at small τ values, the approximation looks the most accurate with the Pareto distribution in all but the College dataset.
To examine whether these casual observations extend to the entire set of inter-event times from the same individuals, we investigated the likelihood of the empirical data in each model. The likelihood for the three models, i.e. the selected EMM, maximum-likelihood estimator of the Pareto distribution, and PLFit, is compared in table 3. Because the PLFit discards small inter-event times, we start by comparing the EMM and Pareto distribution when all inter-event times are used. The table indicates that the EMM realizes a larger likelihood value than the Pareto distribution in five datasets and vice versa in the other two datasets (i.e. Hypertext and Reality). The EMM has more parameters than the Pareto distribution, which one may suspect as a reason why the EMM fits the data better than the Pareto distribution in more datasets than vice versa. However, the results are qualitatively the same when the two models are compared in terms of the AIC and BIC (electronic supplementary material, table S1). We remark that the maximum-likelihood estimator for the Pareto distribution is not asymptotically normal such that the application of the AIC and BIC to the Pareto distribution as well as to EMMs is not justified. It should be noted that the other four criteria are not relevant to the Pareto distribution because it is not a latent variable model. For the Hypertext and Reality datasets, for which the likelihood is larger for the Pareto distribution than the optimal EMM, there are relatively many data points with τ i = τ min (table 3). The Pareto distribution beats the EMM probably because it avoids devoting the probability mass to values less than τ min by definition, and it is accurate at many data points that have τ i = τ min . To clarify this point, we compared the likelihood when the data points with τ i = τ min were excluded. The results are shown in table 3. To simplify the discussion, for the EMM we employed the value of k that minimized the AIC. For the Reality data, the EMM fits the data better than the Pareto distribution when we only consider the interevent times larger than τ min . For the Hypertext dataset, the Pareto distribution is still better than the EMM, but the difference in the likelihood is much smaller than when all the inter-event times are used. Note that the Reality dataset has a larger fraction of inter-event times with τ = τ min than the Hypertext dataset (table 3).
The PLFit only fits a power-law distribution to relatively large inter-event times. Therefore, we compared the likelihood of the EMM, Pareto distribution, and PLFit for the set of inter-event time values satisfying t i !b, whereb is the threshold estimated by the PLFit. The PLFit is the best Table 3. Likelihood of the entire and truncated datasets. The results for the individual with the largest number of inter-event times in each dataset are shown. The number of inter-event times larger than min n i¼1 t i andb as estimated by the PLFit is denoted by n 00 and n 0 , respectively. The columns labelled all, τ > min i τ i , and t !b are the likelihood values for the n, n 00 and n 0 inter-event times, respectively. performer among the three fitting schemes (table 3). This is probably because the threshold is optimized for the PLFit in this comparison and the PLFit is allowed to discard small inter-event times. By contrast, the EMM intends to fit the entire set of inter-event times. Furthermore, in this comparison, the EMM is disadvantageous by the amount of probability mass that it has to devote to inter-event times smaller than b. It should be noted that the PLFit discards a considerable fraction of data points, ranging between 27 and 94%, except for the College dataset (table 3). It should also be noted that the EMM yields a larger likelihood value than the Pareto distribution for all the datasets with thisb value. We do not conclude whether or not the EMM outperforms the Pareto distribution or PLFit. Both the Pareto distribution and PLFit benefit from carefully setting the lower bound on the inter-event times,b. When all inter-event time data have to be modelled and we do not know the smallest possible value of the inter-event time, EMMs may be preferred to the Pareto distribution and the PLFit.
So far, we have excluded the inter-event times across consecutive days in the Office and Hypertext datasets. We also ran the same analysis for these two datasets without excluding such long inter-event times. The results were qualitatively the same except that the EMM estimated the number of components larger by one (electronic supplementary material). The new component (i.e. a single exponential distribution in the EMM) corresponded to the across-day inter-event times, with a considerably longer mean inter-event time than that for the other estimated components. Furthermore, the mean inter-event time of the constituent exponential distributions that exist both when across-day inter-event times are included and when excluded are almost the same in the two cases (electronic supplementary material). We conclude that inter-event times on the timescale of a day or longer contribute one exponential distribution with a large mean to the entire EMM without interfering with the constituent exponential distributions on smaller time scales.

Population results
In the previous section, we showed that mixtures of a small number of exponential distributions tended to be selected according to the different model selection criteria. To assess the generality of this result, we carried out the model selection for all individuals having at least 100 inter-event times. Because the Sexual dataset has only two individuals with at least 100 inter-event times, we instead used the 56 individuals that had at least 50 inter-event times for this dataset. The histogram of the optimal k Ã value for each combination of the dataset and criterion is shown in figure 2. Consistently with the results for the individuals with the largest number of inter-event times in each dataset, the effective number of components, k Ã , was estimated to be at most three in four out of the seven datasets (i.e. Office, Hypertext, Reality and Sexual). For the Bitcoin dataset, the distributions of k Ã had the mode at 3, and we obtained 2 ≤ k Ã ≤ 4 in most cases. For the other two datasets (i.e. Email and College), the AIC and BIC yielded distributions of k Ã whose mode was four and the largest value was seven. Under the other four criteria, which are the ones justified for EMMs, the mode was equal to three, and the largest k Ã was equal to five.
To exclude the possibility that longer data (i.e. larger n) tend to yield a larger k Ã value, we calculated the Pearson correlation coefficient between n and the optimal k Ã for each combination of the dataset and criterion. To calculate the Pearson correlation coefficient, we only used the individuals with at least 100 inter-event times (and at least 50 inter-event times for the Sexual dataset), similar to figure 2. The Pearson correlation coefficient values, denoted by r, and its p-value are shown in table 4. The correlation coefficient was small and insignificant except for the Email dataset combined with all criteria and the College dataset combined with the AIC or BIC. When we confined the analysis to the individuals with larger numbers of inter-event times, i.e. at least 500 and 200 inter-event times for the Email and College datasets, respectively, the Pearson correlation coefficient was smaller. In particular, for the AIC LVC , BIC LVC , NML LVC and DNML, the correlation was close to zero and insignificant when we only considered the individuals with many inter-event times. Note that the largest number of inter-event times for these two datasets were sufficiently larger than the new threshold values (Email: n = 3948, College: n = 1090) and that there remained sufficiently many individuals with the new threshold values (Email: 117 individuals, College: 61 individuals). The results suggest that our main result that the optimal number of components, k Ã , tends to be small is expected to remain true for long sequences of inter-event times.

Discussion
We showed that EMMs (i.e. mixtures of exponential distributions) with a small number of components, up to three in many cases and four in some cases, were a reasonably good fit to various empirical royalsocietypublishing.org/journal/rsos R. Soc. open sci. 7: 191643 distributions of inter-event times. In general, there are various mechanisms behind power-law distributions of an observable, and many of them are believed to be relevant as generative mechanisms of long-tailed degree distributions of empirical networks [46]. By contrast, the present results suggest that empirical long-tailed distributions can be also approximated by EMMs. Furthermore, EMMs can be obtained as an outcome of stochastic point processes in which the system's state switches between a small number of discrete states in each of which the process generates events as a Poisson process. In practice, a human or animal individual may maintain a  small number of relatively discrete states, such as active and rest. The event rate may depend on the discrete state. It should be noted that, within each state, the mechanism to generate event sequences is maximally simple: Poisson process with a constant rate. To the best of our knowledge, Poissonian views of long-tailed distributions of inter-event times were first introduced in [5]. The contribution of the present work is to have introduced a formal model selection framework to directly compare EMMs with different numbers of components and having carried out some comparisons with powerlaw distributions. There is a recent debate regarding the abundance of scale-free (i.e. power-law) degree distributions in empirical networks. When a pure power-law tail is assumed for the degree distribution, there are empirically few scale-free networks [47]. By contrast, scale-free degree distributions are abundant when impurity in the power-law distributions via the regularly varying distributions is considered [48]. In the present study, we fitted the pure power-law distributions via the PLFit algorithm [36]. The PLFit was more accurate than EMMs while the PLFit discarded small inter-event times, which occupied a non-negligible fraction of inter-event times in each dataset. We did not examine the method presented in [48]. This is because the purpose of the present study was not to find an accurate estimator or to test if a power-law distribution of inter-event times was a good fit. Rather, our purpose was to test the hypothesis that the system generating time-stamped events, such as humans, transit between a small number of discrete states each of which is a Poisson process generator.
A reasonably accurate fit of the EMMs to the empirical data revealed in the present study mainly comes from a high accuracy at small τ values, because there are many data points with small τ. Such short inter-event times yield a component exponential distribution with a small mean inter-event time and may be produced as a result of the visit of the individual to an active state. Given these considerations, looking at the entire distribution of the data rather than focusing on the distribution's tail may help us to understand mechanisms generating the data.
The authors of [10] reached a similar conclusion to ours, where they split the inter-event times into two groups by thresholding. Then, they fitted an exponential distribution to the inter-event times exceeding the threshold and another exponential distribution to the inter-event times less than the Table 4. Relationship between the best k Ã value and the number of inter-event times (i.e. n) across the individuals within each dataset. We calculated the Pearson correlation coefficient, r, between k Ã and n, and its p-value. The calculation was based on the individuals having at least 100 inter-event times unless we state the threshold value in the first column. The cells without results are the cases in which all the individuals yielded the same k Ã value and hence one cannot calculate the correlation coefficient. threshold. Their models are EMMs with two components. In contrast to their study, we systematically fitted EMMs and inferred the number of components using model selection criteria valid for EMMs. Testing our method on various other distributions of inter-event times to clarify the reason why EMMs with a small number of components apply accurately to some datasets and not others warrants future work.

Data accessibility.
The data are open resources. Python codes for estimating and selecting the mixture of exponential distributions are available at Github (https://github.com/naokimas/exp_mixture_model).