Orangutan information broadcast via consonant-like and vowel-like calls breaches mathematical models of linguistic evolution

The origin of language is one of the most significant evolutionary milestones of life on Earth, but one of the most persevering scientific unknowns. Two decades ago, game theorists and mathematicians predicted that the first words and grammar emerged as a response to transmission errors and information loss in language's precursor system, however, empirical proof is lacking. Here, we assessed information loss in proto-consonants and proto-vowels in human pre-linguistic ancestors as proxied by orangutan consonant-like and vowel-like calls that compose syllable-like combinations. We played back and re-recorded calls at increasing distances across a structurally complex habitat (i.e. adverse to sound transmission). Consonant-like and vowel-like calls degraded acoustically over distance, but no information loss was detected regarding three distinct classes of information (viz. individual ID, context and population ID). Our results refute prevailing mathematical predictions and herald a turning point in language evolution theory and heuristics. Namely, explaining how the vocal–verbal continuum was crossed in the hominid family will benefit from future mathematical and computational models that, in order to enjoy empirical validity and superior explanatory power, will be informed by great ape behaviour and repertoire.

have predicted that the first words and grammatical rules emerged to minimize error and information loss in language's precursor channel. Regarding word origin, this argument asserts that the lengthier a signal combination, the lower the probability of mistaking signals for each other. Regarding syntax origin, it asserts that the more varied a sequence of signal combinations, the lower the probability of mistaking the events being referred to, with words and syntax having, thus, developed in the human lineage to decrease transmission errors. Without basic knowledge about the communication channel used by our ancestors to broadcast information and its 'error limit' [21][22][23], it is impossible, however, to validate these models or their proposed evolutionary scenario.
Human evolution unfolded in parallel with acute climate and ecological changes in the African continent [24], however, it is unclear when and where the first forms of language manifested among human ancestors. Regardless of whether proto-language originated in the rainforest, woodland or savannah, the hypothesis that the first linguistic structures emerged to avert error can be best tested in forested habitats, which pose the most adverse conditions to sound transmission, and thus, where signal and information limits can be assessed.
To implement an empirical proof of the currently prevailing mathematical models of linguistic evolution, we assessed information loss in wild orangutan voiceless consonant-like and voiced vowel-like calls [7]. These calls exhibit articulatory homology with their human counterparts, and therefore, represent living proxies of spoken language's putative prelinguistic units [25][26][27]. Namely, we played back consonantlike 'kiss-squeaks' and vowel-like 'grumphs' [28] and rerecorded these calls at increasing distances. Critically, bar humans, orangutans are the only known great ape to produce consonant-like and vowel-like calls combined into syllablelike combinations [29], therefore, presenting a privileged hominid model for this study [30].

Material and methods (a) In brief
Calls were originally recorded from wild orangutan individuals across contexts and populations of Sumatran (Pongo abelii) and Bornean orangutans (Pongo pygmaeus). Only consonant-and vowel-like calls that were from the same syllable-like combination were used for playback. We extracted four acoustic parameters over distance. We used individual, contextual and geographical acoustic signatures [25] to assess information loss. This set-up mimicked the putative proto-combinatoric conditions at the moment of language origin. Methodologically, this allowed us to control for biasing factors between consonant-and vowel-like calls (e.g. individuals, context, recording settings).

(b) Study site
Playback experiments were conducted at the Sikundur Research Station (3°55 0 48.07 00 N; 98°2 0 31.17 00 E), Leuser Ecosystem, North Sumatra, Indonesia. The Sikundur forest is located on the eastern forest margin of the Alas River dividing the Leuser Ecosystem along its north-south axis and constituting a major dispersal barrier for orangutans at this altitude [31]. Presently, the forest is a dipterocarp tropical rainforest, comprising disturbed primary forest and secondary/regrowth forest, which was the target of previous logging operations (between 1970 and 1980, and later during the 1990s [32]).

(c) Data collection
Recordings for the playback playlist were previously collected at three research stations: Tuanan and Gunung Palung (Central and West Kalimantan, respectively, Indonesian Borneo) and Sampan Getek (North Sumatra, Indonesia). The playback playlist included 120, 118 and 249 calls to assess individual ID, context and population ID information, respectively (see more in electronic supplementary material). Orangutan kiss-squeaks [28] were used as living proxies of voiceless proto-consonants, orangutan grumphs [28] as living proxies of voiced proto-vowels.
All kiss-squeaks and grumphs were selected from call combinations composed of the two calls, specifically kiss-squeak + grumph (see 'Data analyses' ( §2e) and electronic supplementary material). All recordings were set to the same peak amplitude prior to playback using Raven interactive sound analysis (v. 1.2.1, Cornell Lab of Ornithology, Ithaca, New York). No further signal transformations were conducted.
Playbacks were conducted using a Marantz Digital Recorder PMD-660 (D&M Holdings, Kawasaki, Japan) connected to a Nagra DSM speaker (Audio Technology Switzerland S.A., Romanel, Switzerland). The speaker was set at 1-1.5 m from the ground. Because Sikundur is partially a regrowth/secondary forest, with abundant undergrowth below the understorey, this height offered a suitable means to explore the effects of complex habitat structure on broadcast performance. Playback volume was set at approximately 100 dB SPL at 1 m distance to facilitate assessment of sound degradation over distance and was not meant to emulate orangutan natural vocal loudness. Playbacks were conducted between 5.30 and 6.30 local time in the absence of wind and with no rain during the previous 48 h. This time was elected for playbacks because, in this habitat, early mornings were the time of day with the least biotic noise. We made no presumptions as to whether early human ancestors communicated predominantly at this time. All recordings along the same transect were conducted in the same morning.
Playbacks were conducted twice, at two locations (i.e. along two transects), i.e. once at each location. Re-recordings were conducted every 25 m along the two transects across the forest up until 100 m away, at which point playbacks became too faint to be analysed. Transects started within 10 m from each other and advanced forward in an oblique direction one from other. Using different transects allowed us to assess the impact of particular phonological features (e.g. larger tree trucks, leaf density) on broadcast performance. Transects were straight, flat and included no obvious canopy openings or clearings. Playbacks were re-recorded using a ZOOM H4next Handy Recorder (ZOOM Corporation, Tokyo, Japan) connected to a RØDE NTG-2 directional microphone (RØDE LLC, Sydney, Australia). Audio data were recorded using the WAVE PCM format at 16 bits. The microphone was set at 1-1.5 m from the ground. Data for distance zero were extracted from the original playback recordings. In total, 7826 calls (incl. original at 0 m and re-recordings up to 100 m) were collected (see electronic supplementary material, for sample breakdown). For each transect, three playbacks sessions were conducted, one for each information type: one playlist comprised recordings varying in individual subjects, an other in context and an other in population.

(d) Data measurements
We manually measured four acoustic parameters from all calls using Raven interactive sound analysis (v. 1.2.1, Cornell Lab of Ornithology, Ithaca, New York) using the spectrogram window (window type: Hann; 3 dB filter bandwidth: 124 Hz; grid frequency resolution: 2.69 Hz; grid time resolution: 256 samples): duration (s), maximum frequency (Hz), maximum power (uncalibrated dB) and maximum time. Duration was the time difference between call end and onset. Maximum frequency was the frequency with maximum energy (i.e. power, dB) in a call. Maximum power was the power of the maximum frequency. Maximum time was the moment when the maximum power occurred proportional to the total duration of a call (e.g. max. time = 0.5 means it occurred half-way through the call's duration). These parameters have been found to be strong descriptors of orangutan calls and their informational content [25,28,33]. Critically, they were extractable from both consonant-and vowelcalls, enabling direct comparison between acoustic and information broadcast performance between the two call categories.

(e) Data analyses-acoustic performance
To assess acoustic broadcast performance during transmission, linear mixed models (LLMs) (model type: III sum of squares; test model terms: Satterthwaite, using restricted maximum-likelihood) were conducted using JASP [34] (v. 0.14.1). One model was generated per acoustic parameter (×4) per call type (×2), with a total of eight models. Per model, the acoustic parameter was inserted as dependent variable (N = 3560 per call type). Distance (treated as ordinal: 0, 25, 50, 75, 100 m), transect (two levels), context (three levels: towards human observers, tigerpatterned predator-model, plain-white predator-model) [29] and population (three levels: Tuanan, Gunung Palung, Sampan Getek) were inserted as fixed effect variables. Individual (20 levels) and call number (N = 249 per call type) were inserted as random effect, since some calls were re-used for different playbacks and from the same individual. Random slopes for distance and transect were allowed to vary per individual. No explicit indication of nested variables (e.g. individual within population) was provided since this is automatically identified by the model (see [25] and electronic supplementary material).

(f ) Data analyses-information performance
To assess information broadcast performance, we conducted discriminant function analyses (DFAs) per distance [33]. All analyses were based on the four measured acoustic parameters simultaneously. Six analyses were conducted to test information content (×3; individual ID, context, population ID) for each call type (×2). LMM results indicated that 'transect' had a significant effect on acoustic performance over distance, hence, all ( p)DFA analyses were conducted using one transect only. We conducted DFAs with leave-one-out procedure using SPSS (IBM SPSS Statistics, v. 27; electronic supplementary material) to assess information content about individual ID (same context used across individuals). To assess information content about context and population, we performed permuted DFAs ( pDFAs) with crossclassification [35]: crossed pDFA for context (to control for individual variation) and nested pDFA for population (individual variation nested within population; electronic supplementary material). pDFAs were conducted in R [36] with MASS [37] and using a function provided by Mundry & Sommer [35]. Because crossed pDFAs do not tolerate null data, only three individuals with calls in all contexts were included. Figures were prepared using ggplot2 [38] and gridExtra [39]. A script example was: pdfa.res = pDFA.crossed (test.fac = 'Context', contr.fac = 'Individual', variables = c (Duration', 'Max frequency', 'Max time', 'Max power), n.to.sel = NULL, n.sel = 100, n.perm = 1000, pdfa.data = test.data).

(a) Acoustic performance over distance
Consonant-like and vowel-like call acoustic parameters changed significantly during transmission (table 1 and figure 1, electronic supplementary material). This was expected since different parameters interact differentially with the environment (e.g. max. power declines over distance following the general inverse square law of sound attenuation). Several significant differences were found between transects (electronic supplementary material), confirming that acoustic performance was ( partly) dictated by the physical structure of the transmission channel. The context had a significant effect on the acoustic performance of some parameters (electronic supplementary material). Given that both call types are known to exhibit marked contextual variation [25], this shows that the acoustic features of different contextual subtypes affect how their transmission plays out. For both consonant-like and vowel-like calls, population had a significant effect on some acoustic parameters (electronic supplementary material), suggesting that geographical accents [25] may endow calls with better transmission properties. Given that forest structure is no longer pristine across virtually all orangutan sites, it is unclear whether these gains can be attributed to adaptive selection in some populations.

(b) Information performance over distance
Despite poor acoustic performance, informational performance of consonant-and vowel-like calls was not affected during transmission (figure 2). Both call categories allowed correct assessment of information about individual identity, context and population well above chance levels (figure 2). Information loss was only observed for individual identity when transmitted by vowel-like calls; however, this effect was only observed when computing a leave-one-out DFA procedure (a more stringent model) and information performance remained overall above chance (table 2; electronic supplementary material). Information performance was equivalent between consonant-and vowel-like calls; their trend lines remained relatively parallel over distance (figure 2).  , and acoustic performance during transmission (c-f ), (based on raw data). uncal., uncalibrated. Box plots represent median and 25-75% interquartile range; whiskers represent lowest/highest value within 1.5 times interquartile range below/above; outliers omitted for clarity. Linear trend lines represented across distance are for visual aid only (based on raw data). *p < 0.001 (LMM ANOVA; see table 1).
Consonant-like calls tended to exhibit higher percentage of correct assignments, suggesting heavier information load (figure 2).

Discussion
We found no evidence for information loss in the only nonhuman living hominid that combines consonant-like and vowel-like calls to produce syllable-like combinations. Information content remained uncompromised until either call type became inaudible, indicating that homologous proto-linguistic units would have remained functionally discriminable as long as they could be heard. Results refute, therefore, mathematical predictions for linguistic evolution.
Orangutan consonant-like calls exhibited extreme spectral differences compared with their vowel-like counterparts (i.e. frequency centred at approx. 4000 versus 250 Hz, respectively, figure 1a,d ). However, both can be information-dense [25] and their information performance was equivalent. This suggests that similar results would have been likely if other nonhuman hominid consonant-and vowel-like calls had been selected. Our analyses covered a wide frequency band wherein the actual (but now extinct) proto-linguistic units of language probably lay.
Information loss was assessed by measuring calls' biometric information content (i.e. about individual ID, context and population ID). There is no evidence that other types of informational content (e.g. culturally conventionalized arbitrary information, such as a word's meaning) transmit differently via the same acoustic signals. Some orangutan consonant-like calls exhibit arbitrary function [40] and other great ape consonant-like and vowel-like calls are transmitted culturally [7,10,11,[41][42][43][44][45][46]. Thus, these calls are not unescapably limited to the transmission of biometric information, even though this was the information used for our empirical validation.
Findings offer three insights into language origin and linguistic evolution. First, proto-consonants and -vowels encoded ample information [25] and were resilient against   Table 2. Information performance over distance: Spearman's correlation summary (n = 5). norm: correlation based on % correctly classified selected cases using DFA; L1out: correlation based on % correctly cross-classified using DFA with leave-one-out procedure; selec.: correlation based on % correctly classified selected cases using pDFA; cross: correlation based on % correctly cross-classified cases using pDFA. Italic type indicates p < 0.05. information loss up to 100 m distance across channels adverse to signal transmission. Second, the structural complexity of our first linguistic ancestors' habitat was an unlikely source of transmission error and information loss. Palaeo-climate change across African habitats brought about major habitat structural changes, and with them, new soundscapes. Open habitats (e.g. savannah) offer few physical obstructions to signal transmission, thus, ecological changes happening across Africa are predicted to have diminished channel noise in language's precursor system, not the opposite. Systematic assessment will be required for conclusive resolution.
Third, mathematical and computational approaches to language evolution have not, thus far, explicitly or implicitly modelled hominid behaviour. Theoretically, current models could apply to any communication system transitioning to a combinatorial state, not necessarily within the hominid family. The fact that language transpired in the human clade, but none other, implies, thus, that 'being a hominid' cannot be discounted from theoretical incursions that might stand a chance to enlighten us as to how linguistic evolution ensued from the repertoire of an ape-like ancestor [47]. While current models assuredly encapsulate a possible evolutionary scenario, this was not the one to have likely catalysed language. The most beneficial future theoretical models will be those that conform with, and factor in, the (consonant-vowelbased) combinatorics shared between great apes and humans.
Ethics. As this project did not involve living animals, no ethical considerations apply. No animal observation, handling, contact or interaction took place during this study. Research at the station is managed by the Sumatran Orangutan Conservation Programme (SOCP) PanEco Foundation. The study was performed in agreement with regulations and permissions from the relevant Indonesian authorities.
Data accessibility. All data needed to evaluate the conclusions in the paper are present in the paper and/or the electronic supplementary material.
Authors' contributions. A.A. and R.V. conducted experiments and acoustic measures and drafted the manuscript. M.G. conducted analyses and wrote the paper. M.G.N. and S.W. provided materials, conducted analyses and wrote the paper. A.R.L. conceived the study, provided materials, conducted analyses and wrote the paper. All authors approved the final version of the manuscript and are accountable for the content.