Journal of The Royal Society Interface

    Abstract

    Understanding how people move within a geographical area, e.g. a city, a country or the whole world, is fundamental in several applications, from predicting the spatio-temporal evolution of an epidemic to inferring migration patterns. Mobile phone records provide an excellent proxy of human mobility, showing that movements exhibit a high level of memory. However, the precise role of memory in widely adopted proxies of mobility, as mobile phone records, is unknown. Here we use 560 million call detail records from Senegal to show that standard Markovian approaches, including higher order ones, fail in capturing real mobility patterns and introduce spurious movements never observed in reality. We introduce an adaptive memory-driven approach to overcome such issues. At variance with Markovian models, it is able to realistically model conditional waiting times, i.e. the probability to stay in a specific area depending on individuals' historical movements. Our results demonstrate that in standard mobility models the individuals tend to diffuse faster than observed in reality, whereas the predictions of the adaptive memory approach significantly agree with observations. We show that, as a consequence, the incidence and the geographical spread of a disease could be inadequately estimated when standard approaches are used, with crucial implications on resources deployment and policy-making during an epidemic outbreak.

    1. Introduction

    People move following complex dynamical patterns at different geographical scales, e.g. among areas of the same city, among cities and regions of the same country or among different countries. Such patterns have been recently revealed by using human mobility proxies [15] and, intriguingly, some specific patterns tend to repeat more than others, with evidences [6,7] of memory of meaningful locations playing a fundamental role in our understanding of human mobility. In fact, human dynamics might significantly affect how epidemics spread [2,6,810] or how people migrate from one country to another [4].

    The collaboration between researchers and mobile operators recently opened new promising directions to gather information about human movements, country demographics and health, faster and cheaper than before [1,1018]. In fact, mobile phones have heterogeneously penetrated both rural and urban communities, regardless of richness, age or gender, providing evidences that mobile technologies can be used to obtain real-time information about individuals' location and social activity, in order to build realistic demographics and socio-economics maps of a whole country [19]. Mobile data have been successfully used in a wide variety of applications, e.g. to estimate population densities and their evolution at national scales [13], to confirm social theories of behavioural adaptation [20] and to capture anomalous behavioural patterns associated with religious, catastrophic or massive social events [21]. Even more recently, the public availability of mobile phone datasets further revolutionized the field, e.g. by allowing ubiquitous sensing to map poverty, to monitor social segregation and to optimize information campaigns to reduce epidemic spreading [14,22], to cite just some of them [18].

    Despite some limitations, mobile phone data still provide one the most powerful tools for sensing complex social systems and represent a valuable proxy for studies where human mobility plays a crucial role [14,6,810,15,2224]. Milestone works in this direction have shown that human trajectories exhibit more temporal and spatial regularity than previously thought. Individuals tend to return to a few highly frequented locations and to follow simple reproducible patterns [1,5], allowing a higher accuracy in predicting their movements [3] and significantly affecting the spreading of transmittable diseases [6]. However, the increasing interest for using mobile phone data in applications should be accompanied by a wise usage of the information they carry. In fact, an inadequate model accompanied by incomplete data and scarce knowledge of other fundamental factors influencing the model itself, might lead, for instance, to a wrong estimation of the incidence of an epidemic and its evolution [25].

    Here, we used high-quality mobile phone data, consisting of more than 560 million call detail records, to show that standard approaches might significantly overestimate mobility transitions between distinct geographical areas, making it difficult to build a realistic model of human mobility. To overcome this issue, we developed an adaptive memory-driven model based on empirical observations that better captures existing correlations in human dynamics, showing that it is more suitable than classical memoryless or higher-order models to understand how individuals move and, for instance, might spread a disease.

    2. Material and methods

    2.1. Markovian model of human mobility

    Let us consider a physical mobility network composed by nodes, representing geographical areas, connected by weighted edges, representing the fraction of individual movements among them. Usually, the weights are inferred from geolocated activities of individuals, e.g. the consecutive airports where a plane departs and lands or, as in this work, the cell towers where a person makes consecutive calls.

    A standard approach to deal with mobility models of dynamics [3,4,6,10,15,16,26] is to consider each node as a state of a Markov process, obtaining the flux between any pair of nodes from consecutive calls, and to build a mobility matrix Fij encoding the probability that an individual in node i will move to node j (i, j = 1, 2, … , n). Here, we use a similar approach to build the mobility matrix for each individual Inline Formula separately and we then average over the whole set of mobility matrices, to obtain the transition probability of an individual, on average:

    Display Formula
    2.1
    where Inline Formula is the number of times the individual ℓ makes at least one call in node j after making at least one call in node i.

    We did not impose a specific time window to calculate transitions, to avoid introducing biases and undesired effects due to the choice of the temporal range and it is worth remarking that other normalizations can be considered depending on data and metadata availability [15]. Where not otherwise specified, we considered the mobility matrix obtained from the whole period of observation. This model is known as ‘first-order’ (or 1-memory) because the present state is the only information required to choose the next state. Although very useful, this has the fundamental disadvantage that it does not account for mobility memory. In fact, it is very likely that an individual moves to a neighbouring area (by means of a car or public transportation) to work and after a few hours he or she will go back to the original position. This effect has been shown to be relevant, for instance, at country level, where individuals fly from one city to another and often go back to their origin instead of moving towards a different city [7]. This memory is an intrinsic property of human mobility and must be taken into account for a realistic modelling of people movements between different geographical areas. When memory is taken into account, each physical node (e.g. Inline Formula) is replaced by the corresponding state nodes (e.g. Inline Formula if memory is of order 2) encoding the information that an individual is in node i when he or she comes from j. While F encodes information about the network of n physical nodes, we need to introduce a new matrix H to encode information about the network of n2 state nodes, accounting for the allowed binary combinations (e.g. Inline Formula, j, k = 1, 2, … , n) between physical nodes. Similarly, higher order memory can be taken into account by building appropriate matrices.

    We use different mobility matrices to build different mobility models. Let Ni(t) indicate the population of the physical node Inline Formula at time t, then the n mobility equations describing how the flux of people diffuses through the network are given by

    Display Formula
    2.2
    In the case of τ-memory, we indicate by Inline Formula the population of the state-node Inline Formula at time t and the nτ mobility equations required to describe the same process are given by
    Display Formula
    2.3
    The population in each physical node at time t is given by the sum of the population in the corresponding state nodes. It is worth remarking that, in general, the matrix H can also be a function of time and the equations would keep their structural form.

    2.2. Adaptive memory model of human mobility

    However, spatial human mobility is quite complex and (higher order) Markovian dynamics might not be suitable to model peculiar patterns such as returning visits and conditional waiting times, i.e. the probability to stay in a location depending on the origin of the travel.

    We will discuss better this point in the following. Let us consider, for instance, the call sequence BBBBCCCSSS made by an individual travelling between three American cities: Chicago, Boston and San Antonio. The main drawbacks of Markovian models—of order lower than 3—become evident in a scenario like this one, because the number of consecutive calls in the same city exceeds the memory of the model and the spatial information about previously visited locations is lost. Clearly, in the presence of more complicated patterns, increasing the order of the model will not solve the issue and some information will be inevitably lost. Alternatively, we could aggregate consecutive calls in the same place to a single identifier, e.g. the previous sequence would be reduced to BCS. In this case, a Markovian model would preserve the spatial information and correctly identify the transitions between the three cities, at the price of losing information about how many calls have been made in each place.

    In the absence of detailed temporal information about calling activity, the number of consecutive calls in a specific location can be used as a proxy: the higher the number of calls the larger the waiting time. The temporal information about the amount of time spent in each location is critical for many dynamical processes like spreading or congestion. We assert that this time, like the next visited location, is conditioned by previous movements of the individuals. To illustrate this, we use the example shown in figure 1, where people from three different places (nodes blue, green and orange) go to the same destination (node red), stay some time in there, and come back to the origin of their trip. The self-loops in the central (red) node represent the time spent there, the colour encoding individuals coming from different origins and the size encoding the amount of time spent. For instance, individuals coming from the blue node wait more than individuals coming from the green node. This type of dependence is what we call conditional waiting time.

    Figure 1.

    Figure 1. Conditional waiting times. An example of human mobility between four different places. Individuals from green, blue and orange nodes move to the red central node and, after some time, go back to their previous location. The amount of time spent in the red node by individuals coming from the other nodes depends on their previous location, and it is represented by self-loops of different size.

    To better appreciate this fact, let us consider holiday trips. Individuals making expensive intercontinental trips tend to spend more time visiting the destination than individuals making cheaper trips, achieving a good trade-off between the travel cost and the time spent. Another emblematic case is urban mobility. For instance, the red node might be an expensive commercial area, the green node a wealthy neighbourhood and the blue node a less wealthy area. In this scenario, that should be considered only for illustrative purposes, individuals coming from the less wealthy area are more likely to be qualified workers in the commercial one, with long and frequent visits. Conversely, individuals from the wealthy area are more likely to make unfrequent and shorter visits for shopping, for instance.

    The importance of accounting for conditional waiting times will be evident later, when we will consider the spreading of epidemics in a country.

    Here, we propose a mobility model that we name adaptive memory, able to account for conditional waiting times. At first order, the method is equivalent to a classical first-order Markovian model, whereas significant differences emerge for increasing memory with respect to standard approaches. For instance, at second order, the 2-memory mobility matrix is built between all possible pairs of nodes (two-states), as in a standard second-order Markovian model. However, instead of considering transitions between areas in the sequence of calls, as a second-order Markovian model does, transitions in the sequence of distinct geographical areas are considered. This point is crucial, and we better clarify it with the example shown in figure 2, where the differences between adaptive memory and Markovian models, in terms of probability assigned to different mobility patterns, are reported.

    Figure 2.

    Figure 2. Comparing different mobility models. Mobility models built from a representative sequence of mobile phone calls (BBBBCCCSSS) made, for instance, by an individual during travels between three American cities, namely Chicago (C), San Antonio (S) and Boston (B). Let us focus on the pattern S ← C ← B that is the real sequence of movements in the geographical space. The first-order model predicts a probability of 1/64, the second-order model a probability of 1/49, whereas the adaptive 2-memory estimates a probability of 1/7, closer to observation.

    The importance of such differences is reflected in the ability of each model to predict successive individual movements. In fact, the presence of spurious or under-represented patterns might significantly affect the results, as shown in figure 3. In this example, two sequences of phone calls generated by two different users moving between three cities—B, C and S—are considered. Markovian models generate spurious patterns that are never observed in the data, an issue not affecting the adaptive memory model by construction. Moreover, our approach predicts the next movement with more accuracy than Markovian ones, because it correctly takes into account conditional waiting times.

    Figure 3.

    Figure 3. Predicting individual mobility. Using the sequence of calls made by two different users (1 and 2)—starting from two different locations (B and C) and visiting a new location S—we build first- and second-order Markov models, as well as the adaptive memory one. We use each mobility model to generate the possible mobility sequences. Given that there are two empirical starting points, we originated the sampled sequences in B and C, respectively. In the figure, for each sample, we report the fraction of times it is reproducing observation (correct), it is a non-observed mobility pattern (spurious pattern) and it is underestimating or overestimating waiting times (longer/shorter conditional waiting time).

    The difference between the adaptive memory and Markovian models becomes more evident when the corresponding transition matrices are compared. There is no difference at the first order; thus, we will focus on the comparison between τ-order Markovian and adaptive τ-memory models, in the following.

    In both models, the number of possible transitions between state nodes is the same and equals Inline Formula, where n is the number of physical nodes. For instance, in second-order models, there are n2 × n2 transition matrices with n3 possible transitions between state nodes, as shown in figure 4. However, the way that each model stores repeating calls in the same physical node is very different. While adaptive memory stores this information into the nτ diagonal elements of the matrix, encoding the conditional waiting times discussed in the previous section, Markovian models redistribute this information among off-diagonal entries, because they do not allow this type of self-loops by construction.

    Figure 4.

    Figure 4. Mobility matrix second-order transition matrix for three physical nodes A, B and C. The state nodes are represented with the notation Inline Formula meaning that walkers in this node have travelled from node y to node x. The cells in red are not used by either second-order Markovian model or adaptive memory model. The cells in blue are used only in adaptive memory model, while the cells in orange are used only in the second-order Markovian model. The cells in white are used by both models.

    More specifically, the information is redistributed among transitions between state nodes of the same physical node. The entries of off-diagonal blocks—corresponding to transitions between state nodes of different physical nodes—are the same in both models. Therefore, while the stationary probability of finding a random walker in a physical node is not different in the two models, it is different at the level of state nodes and, as we will see later, this significantly affects diffusion processes such as epidemic spreading.

    2.3. Overview of the dataset

    In the next section, we will quantify the impact of adaptive memory on human mobility modelling by using datasets provided by the Data for Development Challenge 2014 [27] and some supplementary datasets provided by partners of the challenge. Mobile phone data consist of communications among 1666 towers distributed across Senegal. We exploit this information to map communication patterns between different areas of the country (i.e. the arrondissements). Another subset consists of 560 million call records of about 150 000 users along 1 year at the spatial resolution of arrondissements. We use this information to map individuals' movements among different arrondissements. Demographics information has been obtained from the Senegal data portal (http://donnees.ansd.sn/en/), an official resource. It is worth noting that information has been manually checked against inconsistencies and data about population for the arrondissements of Bambilor, Thies Sud, Thies Nord, Ndiob and Ngothie were not available. We reconstructed the missing information by combining mobile phone activity and available demographics data (figure 5). Such arrondissements did not exist at the time when the population census was obtained, because they were part of larger administrative areas. Information is available for older arrondissements; therefore, we devise a procedure to infer the population in the new areas by using phone calls as a proxy to population density.

    Figure 5.

    Figure 5. Inferring men and women populations. Second-order polynomial model (solid line) fitting the log–log relationships between the observed mobile phone data and demographics data (points). Men (a) and women (b) populations were fitted separately, thanks to data availability, and have been used to infer the populations in the arrondissements of Bambilor, Thies Sud, Thies Nord, Ndiob and Ngothie. (Online version in colour.)

    We have used the data to also infer more realistic contact rates to be used in viral spreading simulations. The contacts among individuals are generally quite difficult to track at country level. Their rate varies depending on several social and demographical factors such as age, gender, location, urban development, etc. [28,29]. Nevertheless, there are evidences from European and African countries that, on average, the number of daily physical contacts among individuals range from 11 to 22 [28,29]. There are no available data about contact rate in each arrondissement of Senegal; therefore, we need to infer this information from available sources. We first estimate the population density for each region, an administrative level coarser than arrondissement, using available data about number of inhabitants and area. As a plausible range of contact rates, we consider 10 and 25. Under the assumption that the contact rate is proportional to the population density, we assign a value to each region that ranges between 10 and 25, with extremal values assigned to the regions with the lowest and highest population density, respectively. Therefore, we assign the same contact rate to all arrondissements pertaining to the same region. We obtain a contact rate between 10 and 11 for all regions, except Dakar which has the highest population density.

    3. Results

    3.1. Understanding human mobility flow

    We show in figure 6 the significant differences in modelling the mobility flow using first-order (FO), second-order (SO) and adaptive memory (AM) models. Markovian models provide very similar transition patterns, whereas adaptive memory provides very different results. The adaptive memory model exhibits significantly less returning transitions than Markovian models, but—on average—with much higher probability of observing them. In fact, 47.4% of patterns captured by the first-order approach and 43.4% captured using second-order are spurious because they are not observed in reality. Remarkably, the probability that an individual comes back to her origin is on average six times higher using adaptive memory models than using first order, and five times higher using second order. In the electronic supplementary material, we show the result of the same analysis for the gravity model [30,31] and the more recent radiation model [4,32], two widely adopted approaches to model human mobility.

    Figure 6.

    Figure 6. Mobility flow among a subset of Senegal's arrondissements. For simplicity, we illustrate the effects of each model by considering a subset of 13 arrondissements and patterns that goes through one specific arrondissement (Kael, in this example) after departing from their origin and before reaching their destination. The figure shows the mobility modelled by means of first-order (a), second-order (b) and adaptive 2-memory (c), putting in evidence the different mobility patterns between Markovian models and adaptive memory. For instance, the adaptive memory module captures returning patterns (i.e. movements like X → Kael → X) better than the first-order model. See electronic supplementary material for results obtained from gravity and radiation models. (Online version in colour.)

    To compare the accuracy of both models against the mobility behaviour observed in data, we use the coverage, defined as the fraction of nodes visited by an individual within a given amount of time. We calculate the coverage for each individual in the data, over a period of one month, and then we average over all arrondissements to obtain a measure at country level. For the same period of time, we generate three transition matrices F, H and A encoding the mobility dynamics for first-order, second-order and adaptive memory models, respectively. To better replicate the calling behaviour of the individuals in the dataset, we extract information about the distribution of time between calls and we use this information in our simulations (see the electronic supplementary material).

    In figure 7a,b, we show that people diffuse in the country too fast using Markovian models, whereas significantly slower diffusion is found with adaptive memory, in agreement with empirical observation. In the electronic supplementary material, we show the result of the same analysis for both the gravity and the radiation models. We observe that the gravity model is not suitable to reproduce the observation, whereas the radiation model provides results comparable with the adaptive memory model proposed in this study.

    Figure 7.

    Figure 7. Observed human mobility and theoretical predictions. (a) Temporal evolution of the global mean coverage calculated from real data and from simulations using first-order, second-order and adaptive memory models. (b) Relative difference between the coverage observed in real human mobility and the one obtained from simulations. See electronic supplementary material for results obtained from gravity and radiation models. (Online version in colour.)

    These results have deep implications, for instance, in short-term or long-term predictions of epidemic spreading or national infrastructure planning.

    3.2. Impact of human mobility models on the spreading of epidemics

    Here, we focus on epidemic spreading. How infectious individuals move among different locations has a strong influence in how diseases diffuse in a population. We considered each arrondissement as a meta-population where any individual can interact with a limited number of other individuals. We use a SEIR compartmental model [33] to characterize the epidemic evolution within each arrondissement and mobility models to simulate people travelling in the country.

    The discrete time step of the following models is Δt ≈ 1 h, approximately the observed median between two successive calls from the same individual. The parameters are demographical and epidemiological. Demographics parameters include the birth Inline Formula and death Inline Formula probability, whereas epidemiological parameters correspond to the latent period τE of the infection, from which the probability Inline Formula to pass to the infectious state is calculated, and the infectious period τI, from which the probability γ = Δt/τI to recover from or die because of the infection is calculated. The last parameter is the effective transmission probability

    Display Formula
    3.1
    an arrondissement-dependent parameter that depends on the average number of contacts per unit of time ci experienced by an individual in node i, the fraction of infected individuals in that node and the transmission risk Inline Formula in the case of contact with an infectious individual. In fact, the definition of βi(t) induces a type-II reaction–diffusion dynamics [9] accounting for the fact that each individual does not interact with all the other individuals in the meta-population, but only with a limited sample. If the number of infected agents is small (i.e. Ii(t) ≈ 0) the Taylor expansion of βi(t) truncated at the first order gives the classical factor Inline Formula [33]. It follows that the equations describing the average spreading of a disease according to a SEIR model coupled to first-order mobility are given by
    Display Formula
    3.2
    whereas the coupling to the second-order model is given by
    Display Formula
    3.3
    where Inline Formula is the total population in the country at time t, Inline Formula indicates the floor function and is used to identify the subset of state nodes corresponding to the same physical node the population Inline Formula belongs to. The equations for the adaptive memory model are the same, except that the transition matrix A is used instead of H.

    We initiate the simulation by infecting five individuals in Barkedji, at the centre of Senegal. The differences between the diffusion of the infective process using each mobility model are quite visible in figure 8. The spreading is faster for Markovian models, with some arrondissement populated by more infected individuals than adaptive memory. The incidence, i.e. the fraction of infected individuals in an arrondissement, follows different spatial patterns in the three models (figure 8a–c), with a higher incidence observed in the origin of the infection that decreases as we move far from there. This effect is significantly stronger using adaptive memory because it tends to concentrate more infectious individuals close to the origin (figure 8d).

    Figure 8.

    Figure 8. Spreading of an influenza-like outbreak in Senegal. We show the incidence of an influenza-like virus over Senegal arrondissements a week after the infection onset, using first-order (a), second-order (b) and adaptive 2-memory (c) mobility models. The infection started in Barkedji (centre of Senegal), where three individuals are initially infected. A SEIR compartmental dynamics with parameters β = 0.05, Inline Formula, γ = 0.5 is used to simulate the spreading of the disease within each arrondissement. We found that the number of arrondissements with infected individuals is higher using Markovian dynamics. Conversely, the adaptive memory favours a higher concentration of infected individuals in the arrondissements around the initial location of the infection. In fact, the location of the onset of the epidemic can be better identified using adaptive memory rather than Markovian models. (d) Relation between the incidence in a region and the distance from the hotspot of the infection using the three models. Adaptive memory models spread the incidence on regions closer to the hotspot and this effect is even more evident when higher memory is used.

    4. Discussion

    Modelling how people move among different locations is crucial for several applications. Given the scarcity of information about individuals' movements, often human mobility proxies such as call detail records, GPS, etc., are used instead. Here, we have shown that dynamical models built from human mobility proxies can be significantly wrong, underestimating (or overestimating) real mobility patterns or predicting spurious movements that are not observed in reality. We have proposed a general solution to this issue, by introducing an adaptive memory modelling of human mobility that better captures observed human dynamics and dramatically reduces spurious patterns with respect to memoryless or higher order Markovian models. However, it is worth remarking that this approach, as all other methods in the literature, is based on the assumption that an individual makes a call in each place he or she visits. In fact, this is not always true and care must be taken when interpreting the results. Fortunately, an appropriate choice of the spatial granularity, for instance at administrative levels corresponding to cities or larger areas, reduces this unavoidable effect. We have validated our model on a dataset consisting of 560 million call detail records from Senegal. We have found that individuals tend to diffuse faster with standard mobility models than observed in reality, whereas the adaptive memory approach reconciles empirical observations and theoretical expectations. Our findings have, for instance, a deep impact on predicting how diseases spread in a country. While standard approaches tend to overestimate the geographical incidence of the infection, the more realistic modelling obtained by means of adaptive memory can improve the inference of the hotspot of the infection, helping to design better countermeasures, e.g. more effective quarantine zones, improved resources deployment or targeted information campaigns.

    Data accessibility

    The data used in this paper were made publicly available during the D4D Senegal Challenge organized by Orange. More information about data can be found here: http://www.d4d.orange.com/en/presentation/data.

    Authors' contributions

    J.T.M. and M.D.D. contributed equally to this work. J.T.M. and M.D.D. developed the theoretical model and carried out the statistical analyses; M.D.D. and A.A. conceived of the study, designed the study and coordinated the study. All authors wrote the manuscript and gave final approval for publication.

    Competing interests

    The authors declare that they have no competing interests.

    Funding

    M.D.D. acknowledges financial support from the Spanish programme Juan de la Cierva (IJCI-2014-20225). J.T.M. was supported by Generalitat de Catalunya (FI-DGR 2015). A.A. acknowledges financial support from ICREA Academia, James S. McDonnell Foundation and Spanish MINECO FIS2015-71582.

    Footnotes

    Published by the Royal Society. All rights reserved.

    References