Human voice pitch measures are robust across a variety of speech recordings: methodological and theoretical implications
Abstract
Fundamental frequency (fo), perceived as voice pitch, is the most sexually dimorphic, perceptually salient and intensively studied voice parameter in human nonverbal communication. Thousands of studies have linked human fo to biological and social speaker traits and life outcomes, from reproductive to economic. Critically, researchers have used myriad speech stimuli to measure fo and infer its functional relevance, from individual vowels to longer bouts of spontaneous speech. Here, we acoustically analysed fo in nearly 1000 affectively neutral speech utterances (vowels, words, counting, greetings, read paragraphs and free spontaneous speech) produced by the same 154 men and women, aged 18–67, with two aims: first, to test the methodological validity of comparing fo measures from diverse speech stimuli, and second, to test the prediction that the vast inter-individual differences in habitual fo found between same-sex adults are preserved across speech types. Indeed, despite differences in linguistic content, duration, scripted or spontaneous production and within-individual variability, we show that 42–81% of inter-individual differences in fo can be explained between any two speech types. Beyond methodological implications, together with recent evidence that inter-individual differences in fo are remarkably stable across the lifespan and generalize to emotional speech and nonverbal vocalizations, our results further substantiate voice pitch as a robust and reliable biomarker in human communication.
1. Introduction
Largely inspired by acoustic communication research in other animals ([1], review), fundamental frequency (fo) is arguably the most intensively studied voice parameter in human nonverbal communication. Produced by vibration of the vocal folds in the larynx (the source of vocal output in most terrestrial mammals [2]), fo and its harmonics are perceived as voice pitch and are highly perceptually salient. While fo signals static individual differences such as sex, size and identity in many vertebrate species [1,3], it is more sexually dimorphic in human adults than in any other extant great ape [4] and has been repeatedly linked to testosterone levels, masculinity, dominance and social power ([5,6], reviews), as well as to mate preferences across diverse human cultures ([7], review). Thousands of studies have uncovered the communicative relevance of this source signal in humans, from predicting the outcomes of competitive contests (e.g. fights [8], sports [9] and elections [10]), to predicting reproductive success ([6,7], reviews), thus providing strong converging evidence that voice pitch, particularly in men, has been intensely shaped by sexual selection to index biologically and socially relevant information [4,5,11].
Importantly, despite the fact that a person's voice pitch can vary considerably as they speak (i.e. intra-individual differences [12]), and that people can readily modify their pitch by tensing their vocal folds or modulating airflow from their lungs [13], for instance for prosodic emphasis [14], to communicate emotion and motivation in nonverbal vocalizations [15], or even to exaggerate biological traits like body size [16], there remain sizeable inter-individual (between individual) differences in baseline or habitual voice pitch. These individual differences are largely imposed by anatomical and physiological constraints on vocal production. Most notably, at puberty, a surge of testosterone permanently enlarges the male vocal folds and causes male fo to drop to nearly half the frequency of female fo, producing significant pitch differences between adults and children and between men and women [17].
Yet, habitual speech fo can vary substantially even among adults of the same sex. In this study, for example, men's average fo ranged from 78 to 182 Hz and women's from 126 to 307 Hz. Thus, the magnitude of pitch differences within sexes parallels that observed between sexes and greatly exceeds just-noticeable differences in pitch perception from speech [18]. It is the vast inter-individual, within-sex differences in voice pitch that have been repeatedly linked to individual differences in various biosocial traits. Recently, studies have further revealed that between-individual differences in mean fo emerge early in life and remain remarkably stable thereafter, as the cries of 4-month-old infants predict their speech fo in childhood [19], and pre-pubertal fo predicts post-pubertal fo in males across the lifespan [20].
Given the broad ecological relevance of voice pitch and its popularity in the human behavioural sciences, it is important to ascertain whether the vastly different types of voice stimuli used to measure and study fo are valid and comparable. Researchers interested in the functions of human nonverbal vocal parameters have long relied on monophthong vowels (a e i o u), consonant–vowel–consonant (CvC) words or counting [18,21,22]. These standard speech types enjoy linguistic neutrality, high cross-cultural comparability, steady fo and standardized formant contrasts. However, with the desire to increase ecological validity in the voice sciences came an increase in the complexity of speech stimuli used for acoustic analysis and playback experiments, from short greetings (e.g. [23,24]) to longer scripted paragraphs (e.g. [25]) and, finally, bouts of spontaneous or conversational free speech (e.g. [26,27]), sometimes extracted from online videos of real-life vocal exchanges [10,20]. Are fo measures comparable across this wide variety of speech utterances? While one recent study showed that listeners' judgements of dominance, trustworthiness and competence were similar for scripted words versus sentences produced by the same vocalizers [28], the authors did not objectively measure fo nor compare short utterances with longer speech. Here, we use a validated supervised extraction method to measure fo from nearly 1000 affectively neutral speech utterances in over 150 adult men and women, each producing single words, vowels, counting, greetings, read paragraphs and free spontaneous speech.
2. Methods
(a) Participants
We audio recorded 154 adults (n = 83 women, mean age 35.2 ± 1.3, range 19–67; n = 71 men, mean age 29.9 ± 1.4, range 18–65). Participants were recruited from the local community in a large European city (Wroclaw, Poland) using online and public adverts. While, based on previous work, strong effect sizes may be obtained with 25 vocalizers per sex and of a similar age [29], sample sizes here were increased to accommodate a much broader age range (18–67 years). No participant reported acute conditions that could affect their voice (e.g. cold, sore throat) and all provided informed consent.
(b) Voice recording
Participants were audio recorded in private sessions in a quiet room using a Zoom H4n microphone positioned 10 cm from the mouth. Voice recordings were saved as WAV files at 96 kHz sampling frequency and 16-bit resolution. Participants first familiarized themselves with a script containing six items, presented in a randomized order between participants, and were then asked to speak each item aloud. The six randomized items included: a series of five monophthong vowels (/a/ as in ‘bra’, /ɘ/ as in ‘bird’, /i/ as in ‘bee’, /ɔ/ as in ‘bot’, /u/ as in ‘boot’); a series of five words (containing the same five vowels as above, but in LvT context, ‘lot, lat, lej, lit, lud’); counting from 1 to 10; a greeting (‘Hello, I am from …); a read paragraph (5 neutral sentences regarding the weather); and free speech, in which participants were instructed to say several spontaneous sentences about the weather. Weather is a relatively affectively neutral topic and it standardized content between the read paragraph and free speech. Original and translated recording scripts are available as electronic supplementary material. In addition to voice recording, participants completed a short demographic questionnaire and their height was measured with a metric tape.
(c) Acoustic analysis
Acoustic editing and analyses were performed in Praat v. 6.1.08 [30]. Voice recordings were segregated by speech type, resulting in six utterances per vocalizer (figure 1a) for a total of 924 utterances. To retain only a single item for the ‘word’ category, we analysed the central, steady-state word from the series. The duration was measured from the beginning to end of voicing and ranged from an average of 0.3 s (word) to 21.5 s (read paragraph; see electronic supplementary material, table S1).
Figure 1. Voice pitch differences across vocalizers far outweigh differences across speech types. (a) Examples of the six speech types illustrated with waveforms and spectrograms (y-axis 0–5 kHz) from a single adult male, whose mean fo ranged from 105 Hz to 125 Hz across speech types. The fo contour (path) obtained using Praat's pitch tracking and path finder functions is shown in pink below each spectrogram (range 50–250 Hz). (b) Horizontal black bars indicate mean voice pitch (fo, Hz) averaged across all vocalizers for a given sex and speech type. Estimated marginal means and pairwise comparisons derive from a linear mixed model (LMM), where ***p < 0.001, **p < 0.01 and *p < 0.05 following Šidák correction. Overlaid dot plots (jittered along with the x-axis for improved visualization) show the mean fo of each vocalizer plotted along the y-axis, females in (i) (n = 83, orange circles), males in (ii) (n = 71, blue squares).
Fundamental frequency parameters were measured from the full voiced duration of each utterance using a validated custom script and Praat's pitch-extraction algorithm and path finder function, with the recommended search range of 60 to 300 Hz for men and 100 to 600 Hz for women, and a 0.01 time step. The fo contour (path) was systematically extracted and manually inspected (figure 1a), and any erroneous frequency candidates in the selected path (e.g. arising from octave jumps) were de-selected or corrected in the Pitch editor window before computing mean fo (average pitch across the utterance) and foCV (coefficient of variation, a measure of pitch variability that controls for baseline fo, computed as fo s.d./fo mean). These established protocols have been successfully applied in numerous studies (e.g. [29,31,32]). All acoustic parameters are summarized in the electronic supplementary material, table S1.
3. Results
Corroborating previous meta-analyses [33], we first confirmed that mean fo (averaged across speech types within individuals) did not explain a significant amount of variance in the heights of men (5%) nor women (3%; figure 2b), and thus we did not control for vocalizer height in further analyses. We did observe a known small and gradual decrease in women's but not men's mean fo with age (figure 2b), which may be attributed to a number of biological factors (see e.g. [34,35]). Nevertheless, controlling for vocalizer age in our models, the results of which are given below, did not significantly affect the strength of inter-individual fo relationships across speech types (see electronic supplementary material, table S2).
Figure 2. Inter-individual differences in voice pitch are strongly preserved across different speech utterances. (a) Correlation matrix comparing mean fo between all speech types for female (n = 83, orange circles) and male (n = 71, blue squares) vocalizers (Pearson's r, two-tailed, alpha 0.05). Mean fo plotted in Hz on both axes. All p < 0.0000000000317 following Benjamini–Hochberg correction. (b) Correlations between mean fo and vocalizer age or height for female (n = 83, orange circles) and male (n = 71, blue squares) vocalizers (Pearson's r, two-tailed, alpha 0.05), where only age predicted fo, and only in women (***p < 0.01). For all regressions, variance explained (R2) between two speech types is given beside the regression line, followed by 95% CI derived from bootstrapping based on 1000 samples.
An omnibus linear mixed model (LMM) fitted by restricted maximum-likelihood estimation was then used to test for differences in mean fo across speech types. Speech type and sex of vocalizer were entered as fixed variables, and vocalizer identity and age as random variables with random intercept. The omnibus model showed significant effects of speech type (F5,760 = 15.7, p < 0.001) and sex (F1,148.1 = 689.5, p < 0.001) on mean fo, the latter owing to strong sexual dimorphism in voice pitch between adult men (average fo 207 Hz) and women (121 Hz), but no interaction (F5,760 = 1.5, p = 0.194). We thus conducted analogous LMMs separately for each sex. These models confirmed that men's mean fo (F5,350 = 8.1, p < 0.001) and women's mean fo (F5,410 = 9.7, p < 0.001) both varied systematically across speech types.
Pairwise tests with Šidák correction for multiple comparisons revealed that, in both sexes, counting was characterized by the lowest pitch (male fo 116.4, female 200.2 Hz) and greetings by the highest pitch (male fo 124.5 Hz, female 212.6 Hz), with intermediate differences among other speech types (figure 1b). However, as further illustrated in figure 1b, although fo differed significantly between several speech types by 4 to 12 Hz (about one or two times the just-noticeable difference in pitch perception from modal speech [18]), these differences were nevertheless small relative to the much larger inter-individual variability in fo observed across vocalizers within each speech type. For instance, whereas women on average spoke with a voice pitch 12.4 Hz higher in greetings than while counting, there was a 132 Hz difference between the lowest-pitched woman (139.6 Hz) and highest-pitched woman (271.4 Hz) within the greeting category itself. Notably, there were no differences in mean fo between reading a paragraph and producing free speech, nor between single words and a series of words (counting from 1 to 10; figure 1b).
Analogous LMMs showed that voice pitch variability (foCV) also varied across speech types in men (F5,348.2 = 8.7, p < 0.001) and women (F5,411.8 = 45.5, p < 0.001); however, pairwise tests showed that this effect was largely driven by low pitch variability in single word utterances (see electronic supplementary material, table S2; figure 1a). While women spoke with a more dynamic pitch during greetings, pitch variability did not differ substantially among all other speech types, particularly among men, for whom vowels, counting, greeting, read paragraph and free speech were all characterized by similar pitch variability (electronic supplementary material, table S3).
To test our key hypothesis that the large inter-individual differences in mean fo observed within each speech type are preserved across speech types, such that individuals who produce the lowest/highest-pitched speech in one category likewise produce the lowest/highest-pitched speech in all other categories, we conducted a series of simple two-tailed regressions (Pearson's correlations, r). As illustrated in figure 2a, inter-individual differences in fo were indeed strongly preserved across all speech types. In both sexes, the strength of bivariate relationships between values of fo measured from two different speech types exceeded r = 0.65 in all cases and reached r = 0.90 (all p < 3.17 × 10−11; 95% bootstrapping confidence interval (CI) values r = 0.45 and 0.94; figure 2a). All correlations remained highly significant following Benjamini–Hochberg correction for multiple comparisons ([36], where m = 15 comparisons per sex, q = 0.05), and when controlling for vocalizer age (electronic supplementary material, table S2). These robust relationships indicate that typically more than half and as much as 80% of the variance in inter-individual fo measured from a given speech utterance could be explained by the fo measured from any other utterance, within the same sample of vocalizers (figure 2a). While effect sizes were unanimously strong, the read paragraph explained the most variance in the fo of other speech types for both sexes (63%–81%), whereas free spontaneous speech explained the least variance, particularly in word, vowel and greeting fo (42–60%, figure 2a). Our LMM results suggest this latter result is not likely due to differences in pitch variability, as foCV did not differ between free speech and greetings or vowels in either sex (electronic supplementary material, table S1).
4. Discussion
We show that inter-individual differences in mean voice pitch (fo) can be reliably and robustly measured from a variety of affectively neutral speech utterances including a single word, a series of vowels, counting, a short greeting, a longer scripted paragraph or several sentences of spontaneous free speech. Despite differences in linguistic content and duration (500 ms to 20 s) and minor differences in speech variability (foCV) across these speech types, mean voice pitch measured from any of these utterances strongly predicted the pitch of every other speech utterance produced by the same individuals, explaining upwards of 80% of the variance. These results suggest that studies on human voice pitch are likely to produce comparable results whether fo measures are obtained from short, long, scripted or spontaneous speech. The results also support the validity of longitudinal analyses of mean fo measured from the same individuals at different time points, often from different speech utterances (e.g. [19,20,26]). It is important to underscore that while this finding may allow a certain flexibility in the kinds of voice stimuli researchers can use to study between-individual differences in voice pitch, many other biologically relevant acoustic parameters cannot be compared between different kinds of voice stimuli, most notably formant frequencies.
Beyond these methodological implications, our results corroborate a growing number of studies showing that individual differences in voice pitch emerge early in life and are remarkably stable across an individual's lifetime [19,20], across diverse neutral speech utterances (this study), and even when comparing neutral speech with singing [37], with emotional speech [29,38] or with volitional nonverbal vocalizations such as screams and aggressive roars [29]. Thus, while the present study focuses on affectively neutral speech, past studies provide further evidence that between-person differences in voice pitch also generalize to emotional voice stimuli and remain stable as people age. This has theoretical implications for our understanding of the functions of voice pitch, a remarkably information-rich social and biological signal with clear evolutionary underpinnings [4,5] and real-life predictive power [9,10,26].
Stability in individual differences in voice pitch may also help to explain how human listeners can recognize vocalizers, even from extremely high-pitched volitional screams [39]. However, although our earlier work has shown that inter-individual differences in mean fo are preserved in emotional speech and vocalizations [29], the relationships are considerably less robust than those observed among modal speech types in the present study. Indeed, playback experiments have found that identity recognition is likewise degraded for emotional vocalizations [38,40], particularly authentic vocalizations such as spontaneous laughs, compared with volitional (acted) laughs [40], the two being characterized by different pitch profiles [41]. Of course, recognizing speaker identity from the voice relies on much more than source characteristics (e.g. formant frequencies and temporal patterns also play a role [42]), or any single acoustic parameter [12]; however, that research does raise the possibility that inter-individual differences in fo may be less preserved in spontaneous than volitional vocal signals, a key prediction to test in future work. Moreover, while we show here that fo is preserved between scripted and spontaneous speech, researchers may also test the extent to which this stability generalizes to longer more naturalistic bouts of conversational speech produced in real-life contexts.
Ethics
The study was performed in accordance with the American Psychological Association's ethical standards for the treatment of human participants, including obtaining informed consent from all participants, and was approved by the Ethical Committee of the Institute of Psychology, University of Wroclaw (project 2016/23/B/HS6/00771).
Data accessibility
The data are provided in the electronic supplementary material.
Authors' contributions
P.S., K.P. and A.G.-B. conceptualized, designed and conducted the study. K.P. performed statistical analyses, wrote the original manuscript and created figures and tables. All authors revised the manuscript, approved the final version, and agree to be accountable for all aspects of the work.
Competing interests
We declare we have no competing interests
Funding
This work was supported by a grant from the National Science Centre (grant no. 2016/23/B/HS6/00771). A.G.-B. was supported by the Foundation for Polish Science (FNP).
Acknowledgements
The authors thank Katarzyna Bugaj and Tomasz Frackowiak for assisting with voice recordings.