Towards powerful experimental and statistical approaches to study intraindividual variability in labile traits

There is a long-standing interest in behavioural ecology, exploring the causes and correlates of consistent individual differences in mean behavioural traits (‘personality’) and the response to the environment (‘plasticity’). Recently, it has been observed that individuals also consistently differ in their residual intraindividual variability (rIIV). This variation will probably have broad biological and methodological implications to the study of trait variation in labile traits, such as behaviour and physiology, though we currently need studies to quantify variation in rIIV, using more standardized and powerful methodology. Focusing on activity rates in guppies (Poecilia reticulata), we provide a model example, from sampling design to data analysis, in how to quantify rIIV in labile traits. Building on the doubly hierarchical generalized linear model recently used to quantify individual differences in rIIV, we extend the model to evaluate the covariance between individual mean values and their rIIV. After accounting for time-related change in behaviour, our guppies substantially differed in rIIV, and it was the active individuals that tended to be more consistent (lower rIIV). We provide annotated data analysis code to implement these complex models, and discuss how to further generalize the model to evaluate covariances with other aspects of phenotypic variation.


Introduction
Ecologists and evolutionary biologists are often interested in studying the variation among individuals within a population and the consistency of these differences, particularly in the study of labile traits. To do so requires resampling individuals and assessing the relative proportion of overall variance that can be attributed to individual differences, most often quantified using 'repeatability' to estimate the consistency of scores through time [1][2][3][4]. Individuals often also vary in their response to their internal or external environment ('plasticity'), and thus vary in temporal trends and/or responses to obvious contextual variation (e.g. [5,6], reviewed in [7]). However, even after accounting for both these hierarchical levels of systematic variation, considerable residual variation remains in labile traits (hereafter termed 'residual intraindividual variability (rIIV)'; [8]), which can conceal important biological processes [9].
Behaviour is a particularly labile trait, which varies at multiple phenotypic levels, and is flexible in response to a changing environment. It has long been appreciated that individuals commonly differ in their mean level behaviour (often termed 'personality'), and there is broad consensus that behaviour is, in general, repeatable [2]. However, the repeatability of behaviour is low on average, in that the majority of behavioural variation occurs within individuals, rather than among individuals [2]. The same can be said for physiological traits, such as metabolism, which are similarly labile [3]. Thus, the challenge is to better understand factors that affect this large but understudied component of labile trait variation.
Recent studies indicate that this rIIV can also significantly differ between individuals [8]. These differences are frequently observed under highly controlled conditions and after accounting for systematic temporal and/or contextual plasticity among individuals [8,10,11]. Residual IIV is typically viewed as short-term unpredictable fluctuations in an individual's trait scores in a given context [7]. Although the few studies that exist are predominantly behavioural, there are recorded examples of individual differences in rIIV in hormone physiology [12] and cattle milk yields [13]. Interestingly, rIIV itself has been shown to be repeatable, in the sense that rIIV has some within-individual consistency across months under controlled laboratory conditions [10] and across seasons in the field [14]. Furthermore, rIIV in non-behavioural traits and residual intragenotypic variability have heritable components [13,15] which demonstrates rIIV could be viewed as a trait that could be selected upon.
The study of individual differences in rIIV is clearly still in an exploratory stage, and we need more research to quantify rIIV effectively and to start building a consensus on the magnitude and correlates of rIIV. In turn, this will guide empirical and theoretical research into its proximate causes and ultimate functions. One reason this topic is rarely addressed, at least in part, is due to the large sample sizes required at both the within-individual level (repeated measures per individual) and between-individual levels (replicate individuals) to estimate effect sizes with precision [16]. Another possible reason this topic is relatively unexplored is the complexity of the analyses needed to quantify these levels of variation in the best possible way [11,16,17]. Under simple model scenarios, a statistically powerful analysis requires in the region of at least N = 50 individuals with 10 repeated measures [16]. To the best of our knowledge, no existing study has yet met even these conservative estimates of minimum sample sizes needed, and so we argue that rigorous estimates of the magnitude of variation in rIIV are lacking, thereby impeding progress on this topic. Ideally, such data need to also be measured in controlled conditions to negate the effect of unquantified contextual environmental variation that could affect some individuals more than others [8,9]. However, that is not to say we should not study rIIV in the field, but individual differences in rIIV will probably be hard to distinguish from responsiveness to unobserved sources of temporal or contextual variation [14,17,18].
Individual differences in rIIV, if widespread and substantial, have significant implications for the design of experiments and the statistical analysis of those data. These issues have only just begun to be noted from a design [8] and statistical perspective [16,19], and so the major aim of this study is to provide a model example of experimental design, sampling effort and data analysis to best study rIIV. To further our understanding of rIIV, studies should analyse the covariance of individual differences in rIIV with individual mean levels of that trait. Such correlations could help form predictions on how rIIV relates to better known axes of phenotypic variation, which may help to inform on the function or proximate causes of rIIV. There are some early indications that individual differences in rIIV can be related to individual mean trait values [8,14,20], but these correlations can be estimated more rigorously and powerfully using methods described below, compared with the two-step approach these studies employed. More generally, the presence of individual variation in rIIV indicates a violation of model assumptions that can potentially affect our inferences from any data of a labile trait. Indeed, a recent study demonstrated how ignoring this level of behavioural variation altered the biological conclusions drawn from the data, severely reducing the power of the study [18].
Instead of viewing rIIV as being due to unexplained stimuli and continuing to try to better control experiments, it may be better to view this variation as a trait in itself, on which evolution acts. It is thought that rIIV, in behaviour, may represent different ways to facilitate learning, social interactions or avoid predation. For example, hermit crabs on average exhibited higher latency to emerge from their shell and higher rIIV when exposed to predator cues [21], suggesting that high levels of rIIV could potentially reduce the predictability of prey behaviour to predators. In terms of learning, large declines in rIIV have been observed across adolescents in humans during a time of rapid learning and brain development [22]. Similarly, average rIIV in activity rates decreased over the course of months in mosquitofish when held in isolation [10]. Consequently, we could predict exploratory and reactive individuals to show higher rIIV to facilitate learning.
Here, we conducted a controlled laboratory study that quantified variation in activity rates at the among-and within-individual levels. We used N = 104 male guppies (Poecilia reticulata), and measured them approximately five times each per burst, across three distinct bursts of sampling, for a total of 15 repeated measures per individual. This level of sampling provides the most robust analysis and description of rIIV to date, yielding good precision in estimating among-individual variation in rIIV. In addition to providing a model example of burst sampling methods and experimental design required to rigorously study rIIV, we extend recent statistical methods [16] to assess whether individual differences in rIIV are related to individual predicted mean values. Our sampling design also permits us to estimate the repeatability of individual mean values over time, in addition to the familiar estimation of repeatability of individual scores (traditional repeatability or intraclass correlation). We also discuss how the statistical models we use can be further extended to assess how rIIV may covary with individual differences in plasticity, or how rIIV may covary with other aspects of phenotypic variation. We provide annotated code to assist those wishing to explore these questions on their own data.

Study species, housing, husbandry
We randomly selected 111 male guppies (P. reticulata) from laboratory stock. Guppies used were laboratory-raised descendants of fish collected from an invasive population in northern Queensland 4 years prior to this study (see [23] for more information). Data from sick fish were discarded, though data from fish which jumped out of the tank were retained in the analyses, and dead fish were replaced. Our final sample size was 104 individuals. The experiment was run in two batches of roughly 50 fish, separated by one week (and accounted for in our model, see equation (2.1)). A batch of fish was processed over 2 days and then left to recover and acclimate overnight before the behavioural trials began. Guppies were first anaesthetized with MS-222 (Sigma E10521, 0.2 g l −1 , buffered to pH 7.6), gently dabbed with a Kimwipe to remove excess water, and weighed to the nearest 0.001 g. Fish were then placed individually in a 3 l tank with a 1 cm layer of gravel and an air-stone. Two tank walls were covered in opaque white plastic, so fish had no visual contact with neighbouring fish. A 4 cm square grid (5 wide, 3 high) was drawn on the back of the tank to aid observations of activity. Fish were fed once daily a drop from a pipette containing either finely crushed commercial flake food or twice weekly brine shrimp nauplii, and kept in a 12 L : 12 D photoperiod, consistent with the stock tanks. At day 9-10 post-processing, 50% of the water was changed, and positions of tanks were shuffled to control any position effect.

Behavioural measurements
To measure spontaneous activity, a single observer (D.J.M.) counted the number of lines crossed over a 2 min time period, from a distance of 1 m from the tank. Fish were deemed to have crossed a line when the head and pectoral fin had crossed. After assaying 25 fish, the temperature for each tank was measured to the nearest 0.1°C. Over the entire duration of the experiment, temperatures ranged from 23.8 to 25.6°C in a temperature-controlled laboratory. The range of temperatures an individual experienced across the experiment was minimal (median range = 0.65°C), and the range experienced within a week was lower still (median range = 0.2°C). Because subtle fluctuations in temperature can have a significant effect on the behaviour of ectotherms [5], we statistically controlled for temperature at the population level (see equation (2.1)).

Sampling design
Following recent suggestions, we employed a 'burst' sampling design [8,10]. These bouts of intensive data collection were taken over 2-3 days in each of three consecutive weeks (= three bursts per individual). Each individual was assayed on average five times per burst (range = 4-6). This form of burst sampling allows for the consistency of the mean and rIIV to be analysed through time [10,24], and allows for more efficient modelling of time-related change as it does not assume linear trends (details below). The first burst was run for two days after processing (Thursday-Friday), and bursts 2 and 3 were run for three days each (Wednesday-Friday). Activity was measured once in the morning (between 09.00 and 12.00) and once in the afternoon (between 13.00 and 16.00) on trial days. In total, we used data from 104 individuals with typically three bursts each and a total of 15-16 activity assays each (N obs = 1477).

Statistical methods
Despite the potential importance of individual differences in rIIV to the study of labile traits, the requisite statistical tools have been introduced only recently to the field of ecology [16,17]. Previous analyses of rIIV involved either a two-step approach of fitting a model and extracting the residuals for use in a second analysis [8,10,21] or through fitting individual specific residual variances, without assuming a distribution [12,25]. However, such methods allow for only limited power and comparability between studies, whereas the doubly hierarchical GLM (DHGLM) offers a more parsimonious analysis of rIIV [16].
The DHGLM allows for the simultaneous analysis of a mean level model and a residual level model, the latter in log-linked standard deviations. Each is a linear model that can be specified to include both fixed and random effects, thus allowing one to ask whether individuals differ in their within-individual residual variances (rIIV). Such methods are explained in detail elsewhere [16]; therefore, the DHGLM will not be discussed at length here. However, further to the methods previously employed, we fit a random effect covariance matrix to analyse the correlation between individual variation in means and individual variation in rIIV. This covariance matrix can also be easily generalized, if desired, to evaluate whether the intercept and rIIV covary with individual variation in plasticity (random slope effects; details are given below).
The mean model (equation (2.1)) for the guppy data was fitted with the fixed effects of the batch (0 or 1), mass (continuous), time post-processing in weeks (integer), time of day (morning or afternoon) and temperature (continuous). The mean model also included the random intercept effects of individual identity (ID j ) and burst nested within ID (equation (2.1)). The random intercept of ID creates a predicted value for the intercept of each ID j (j = 1 : N ID ). The nested random intercept effect of burst was created as a new dummy variable, representing the interaction between ID (N ID = 104) and week (categorical; N = 3), creating a new factor ('ID * Burst'; N ID * Burst = 294). A predicted ID * Burst deviation is then given from the predicted value of the random intercept of ID plus the predicted value for each ID * Burst k (k = 1 : N ID * Burst ), which is normally distributed (equation (2.4)).
The residual model (equation (2.2)) was fitted with the fixed effect of mass and time post-processing to test for effects of experience on rIIV, and a random intercept effect of ID, giving a predicted standard deviation (i.e. rIIV) for each ID j ( j = 1 : N ID , equation (2.2)). Finally, we allowed for a covariance (equation (2.3)) between predicted mean values of activity (ID µj , equation (2.1)) and predicted log-standard deviation (ID σ εj , equation (2.2)) among individuals.
Activity rates were log-transformed to achieve normality and then Z-transformed (standardized to mean = 0, standard deviation = 1) to aid model fitting and to facilitate comparison of variance parameters. The variances of random intercepts in the mean model can, therefore, be interpreted as proportions of the total variances of the dataset. Mass and temperature were also Z-transformed to centre predictors and aid model fitting, whereas week was centred on the second week of trials. The covariance between the random intercepts in the mean and residual models can be highly sensitive to different transformations, and care must be taken when choosing a transformation [26]. In the case of our data, these were counts, which predict a covariation between the mean and variance proportional to the log-scale, hence we chose to first log-transform the data. Although we fit the data to a normal distribution opposed to a Poisson for statistical convenience, all assumptions of normality and linearity at each ID, ID*Burst and residual levels were checked post hoc, and all assumptions were satisfied. To further evaluate model performance, we also simulated data of known structure.
Data were analysed in the Bayesian, Markov Chain Monte Carlo software Stan [27], through the 'RStan' interface [28]. All parameters were given uninformative priors (see electronic supplementary material). Importantly, the Hamiltonian Monte Carlo sampler, employed by Stan, allows for an LKJ prior to the covariance of random effects [29], opposed to the inverse-Wishart priors typically used by the common Gibbs samplers. The inverse-Wishart priors may yield biases in estimations where the magnitude of variance parameters differs greatly, owing to the assumption of equal degrees of freedom across all dimensions in the covariance matrix [30]. We set the LKJ prior for the correlation to 2, equal to the number of dimensions in the matrix, which causes a small peak in the prior probability of independence of the random effects (COV int,rIIV = 0). Three chains were run in parallel to evaluate convergence, for a posterior of 50 000 iterations with a 10 000 iteration warm-up. Plots of fitted values versus residuals were checked for goodness of fit as well as checks for normality at each of the levels of the analysis (predicted values of ID and ID*Burst). Stan and R code for data specification, the model and goodness-of-fit diagnostics are provided in electronic supplementary material; we also provide model code for the more commonly used JAGS program [27] to aid accessibility of such an analysis.
Through the addition of the ID*Burst interaction into the residual model, it is possible to assess the consistency of individual differences in rIIV through time, with methods analogous to those proposed to quantify the repeatability of the reaction norm intercept (equation (2.5), [24]). However, in quantifying the consistency of rIIV, it is important to have large numbers of repeated measures at the burst level.
Owing to the relatively low number of repeats per burst (N obs = 4-6), there was limited power to test for an effect of ID*Burst in the residual model, and inclusion was deemed to over-fit the model. We do, however, evaluate the consistency of mean values between individuals as the 'repeatability' of intercepts (R int ; equation (2.5)). Variations in reaction norms indicate that mean differences may not be maintained through time, or across contexts [4,31]. Using reaction norms, one can calculate the correlation between trait means across contexts or time [31], and similarly, R int (i.e. the 'intraclass correlation coefficient') quantifies the consistency of the mean values across bursts.
We also quantified the 'short-term' (conditional) repeatability (sensu [24]) which accounts for individual differences in time-related change. This repeatability, thus, quantifies the consistency of scores within the short timeframe of bursts (equation (2.6), [24]). We also estimated the 'long-term' (unconditional) repeatability (equation (2.7)), which, by contrast, estimates the repeatability of scores without accounting for time-related change [4], and gives the overall consistency of scores expressed across all bursts, for the duration of the study. Finally, as a standardized metric of among-individual variation in rIIV, we transformed the variance parameter in the residual model to the coefficient of variation in predictability (CV P , equation (2.8)), as prescribed by Cleasby et al. [16].
and CV P = exp(ω 2 ID σε ) − 1. (2.8) To evaluate the importance of heterogeneous rIIV to the model fit, we ran a model with the random effect of ID removed from the residual model, and calculated the change in fit using Watanabe-Akaike information criterion (WAIC) with the 'loo' package [32].

Discussion
Here, we demonstrated consistent individual differences in the activity rates of guppies, across multiple time periods, and at several hierarchical levels of variation. First, we observed that individuals showed pronounced differences in their rIIV. These differences were evident under highly controlled conditions, after accounting for both significant and substantial individual differences in mean values and accounting for change in activity over time. While the model fit does not enable us to calculate the repeatability of rIIV per se, the burst sampling experimental design does allow for the rejection of the null hypothesis of no consistency in rIIV, which corroborates previous work, quantifying the repeatability of rIIV [10,14]. Had we collected even more data, we could have estimated the consistency of rIIV in the same way as we estimated the consistency of individual mean values over time (see equation (2.5)). To do so would require adding an ID*Burst term to the residual model and adding a second covariance matrix at the burst level.
We believe that our study demonstrates well how burst sampling offers a powerful approach to partition the variance among individuals, and within-individual variance explained by temporal trends, to appropriately control for these temporal patterns in behavioural data [4]. This approach revealed that individual mean values were moderately maintained over time (R int = 0.55), a key assumption of animal personality, though with considerable fluctuations in mean values through time ( figure 1d). Furthermore, examining the repeatability of activity scores revealed substantially higher within-burst consistency than across bursts (0.56 versus 0.31), further indicating the extent to which individuals differed in temporal trends [4]. This reduction in repeatability with time is consistent with results from meta-analyses of behavioural repeatability [2] and metabolic repeatability [3] which showed longer time intervals over which repeated measures are taken to reduce the repeatability of scores. Importantly, the multiple burst sampling design allows for modelling of time as a categorical variable, and therefore does not assume individual temporal trends to be linear (figure 1d), in contrast to more familiar random regression methods.
By quantifying the correlation between predicted mean level trait and rIIV among individuals, we can begin to evaluate whether rIIV relates to better known behavioural syndromes. There was a moderate negative correlation between mean activity rate and rIIV (figure 1e). As active individuals may be typically bolder, more aggressive and less responsive to environmental change [33][34][35], high rIIV could represent a more responsive phenotype and fit into the reactive syndrome, which predicts positive covariations between these traits and being fast explorers and more aggressive. Such a prediction would be consistent with a correlation between high rIIV and low boldness in hermit crabs [8]. Conversely, in a study on flight initiation distances, bolder agama lizards (Agama planiceps) were also more variable, potentially offsetting the risks associated with boldness by reducing predictability to the predator [14].
While there are biological reasons to predict a mean-variance covariance among individuals, one could also argue any such covariance to be an artefact of the scaling of the data, and another transformation would have yielded different results. This is because the log-transformation, as with other power function transformations (e.g. square-root) each predict a different positive relationship between the mean and variance. Such scaling issues could be addressed using methods such as a Box-Cox transformation whereby the data are raised to a series of power functions in search of the most likely fit to a normal distribution [16,26,36]. However, we argue that this is an unadvisable and arbitrary practice, which will further confuse the understanding of residual variation. These methods may not be reproducible in a replicated experiment, yielding different transformations, and the model will not account for this uncertainty in scaling. Quantifying the covariance between the mean intercept and rIIV will also allow for more critical and formal appraisal of the scaling of the data.
Our inclusion of a covariance matrix linking random effects in the mean and residual sides of the model is an important extension to the DHGLMs previously described and used in behavioural ecology [11,16,17]. Not only does this extension allow for quantifying the among-individuals correlation between the mean and rIIV in a more powerful and parsimonious way than the two-step process previously used [10,14,21], the covariance matrix can also be easily generalized to include reaction norms. Environments are often highly dynamic, with individuals responding to multiple environmental factors, which may interact in complex ways [6]. More environmentally responsive (i.e. more 'plastic') individuals could, therefore, have greater rIIV, owing to greater organismal error in calculating the ideal trait value from moment to moment [9,37,38]. Individuals may also be differentially constrained in the range of behavioural or physiological scores they can express, and their ability to gain knowledge on the current environment, which may also affect the amount of organismal error [37]. Limitations to an individual's ability to express a range of scores could limit both contextual plasticity and rIIV, while limiting the reliability of information increases rIIV. Theoretical models of plasticity based on Bayesian updating processes predict rIIV and residual intragenotypic variability to inform the prior distribution for the phenotypic scope for developmental plasticity [39], and these covariance matrices will prove useful in empirically testing these predictions. An important caution though, where individuals differ in both the magnitude and direction of their reaction norms, the most plastic individuals will be those with the largest positive and negative slopes, and the relationship between plasticity and reaction norms will not be a simple linear function [7].
Such a covariance matrix can also be specified to evaluate how rIIV relates to other aspects of the phenotypic variation. The matrix explaining the multivariate distribution of an individual phenotypic variation can link models with different response variables (phenotypic traits) at the individual level to assess how rIIV fits into better known axes of phenotypic variation (e.g. whether rIIV in activity correlates with boldness), and whether individuals are consistently variable across traits (e.g. whether high rIIV in activity correlates with high rIIV in boldness). However, as dimensionality in the covariance matrix increases, the positive definite assumption becomes more problematic in Bayesian models [30], and the recommended priors here will increase the prior probability of independence, making the analysis more conservative.
The social environment may be an important cue for individuals in the level of rIIV to express, and in this experiment, the effect of socially isolating individuals may have altered their mean behaviour and rIIV. This is because individual differences in rIIV can create differences in the predictability to predators or social partners. Where animals repeatedly interact, past experience informs later encounters [40], and being predictable can aid in social interactions where individuals aim to cooperate or coordinate behaviours [41]. Feedback between socially responsive and consistent individuals could maintain variation in both traits [42][43][44]. Conversely, in the short term, social interactions may homogenize behavioural variation among individuals and stabilize activity patterns to facilitate social functioning [41]. However, male guppies are much less social than females in this species [45], which may minimize this effect.
From comparisons with the psychology literature, one would predict rIIV to differ between learning types and to decrease with increasing experience and across ontogeny [39]. However, this did not appear to be the case in this experiment, and rIIV did not change with time over the three weeks of observations. In another short-lived Poeciliid (mosquitofish, Gambusia holbrooki), rIIV in activity rates decreased across the course of months, perhaps representing ontogenetic change or continued acclimation to being held in isolation [10]. Furthermore, as experience builds and animals acclimate (or habituate) to a new environment, a similar decrease in rIIV could be expected on a shorter timeframe, and such an effect is perhaps evident in the initial highly erratic and unpredictable responses of damselfish acclimating to the laboratory environment [46].
While the biological implications of individual differences in rIIV are at this stage speculative, the methodological implications are well established. The mixed-effect models typically used in the study of labile traits have the assumption of equal variances within-individuals. A comparison of individual predicted values taken from two standard deviations either side of the mean rIIV would yield a 28-fold difference in residual variances. This degree of heterogeneity would demonstrate a clear violation of statistical assumptions if ignored in the model, and highlights the importance of considering rIIV in the study of behavioural traits. Such variation should be visualized in the plots of raw data (figure 1a-c), and plotting out residuals by individuals is an important diagnostic step in the analysis of labile traits.
In conclusion, in this experiment, we have provided a model example of experimental design and analysis to study rIIV, and a highly robust analysis of individual differences in rIIV. Male guppies were shown to differ greatly in their residual variation, the consistency of mean values over time and consistency of scores over time that were (not surprisingly) dependent on whether or not we accounted for time-related change. We build on previously discussed statistical approaches [16] to evaluate how rIIV correlates with the mean trait, and discuss how to further extend the model to incorporate covariances between rIIV and plasticity (given by reaction norms). These additions will open novel avenues of research, and we suggest that future work should use these statistical methods and burst sampling strategies to test predictions of the causes and consequences of rIIV in behaviour.
Ethics. All data collection was approved by Deakin University Animal Ethics Committee (permit no. A93-2011). Data accessibility. All data and data analysis code are available in the electronic supplementary material.