Abstract
Phonology and syntax represent two layers of sound combination central to language's expressive power. Comparative animal studies represent one approach to understand the origins of these combinatorial layers. Traditionally, phonology, where meaningless sounds form words, has been considered a simpler combination than syntax, and thus should be more common in animals. A linguistically informed review of animal call sequences demonstrates that phonology in animal vocal systems is rare, whereas syntax is more widespread. In the light of this and the absence of phonology in some languages, we hypothesize that syntax, present in all languages, evolved before phonology.
1. Introduction
Human language and its origins have intrigued philosophers and scientists since early antiquity [1]. This is unsurprising, as language is responsible for much that distinguishes humans from other species and makes us so successful, including the transmission of knowledge [2–5]. Unfortunately, the search for the origins of language is complicated by the fact that language, unlike other biological traits, does not fossilize or leave any traces to study its cumulative evolution. Empirical studies must therefore circumvent this problem and various different approaches have been undertaken to attempt to unpack the evolution of language [6–8]. These include, among others, the study of child language acquisition [9], hominid morphology [10–12], genetics [13] and the use of computer simulations to test specific hypotheses [14–16].
One method that has received particular attention is the exploration of similarities and differences between human language and animal communication systems [6,17]. If similarities are found between humans and a closely related species, then it is possible that they are derived from the same feature present in their common ancestor, representing homologues [18]. If, on the other hand, similarities are found between humans and more distantly related species, these features represent analogues and hence do not give any information on the phylogenetic origins of the feature, but can help elucidate the environmental or social conditions favourable to its convergent evolution [4].
One particular feature of human language that has received considerable attention by both linguists and animal communication researchers, and been highlighted as a ‘fundamental universal structural characteristic’ [19], is duality of patterning [20,21]. Also known as double articulation [22], duality of patterning is a property of language that allows a combinatorial structure on two levels: (i) phonology, where meaningless sounds called phonemes (i.e. the smallest meaning-differentiating elements of a language that do not themselves have meaning) can be combined into morphemes (i.e. the smallest meaningful elements) and words; and (ii) syntax, in which these morphemes and words can be combined into larger structures [23]. Critically, duality of patterning is the property that allows human languages to create a large lexicon from a few distinct signals [21,24–26]. Unpacking the evolutionary route that led to duality of patterning is thus considered central to a more holistic understanding of language evolution.
Researchers of vocal communication in animals have emphasized the fact that animals are also capable of forming different types of sound combinations that could potentially be analogous or homologous to one or both levels of duality of patterning found in human languages [27–31]. Peter Marler played an important role in establishing the link between the levels of patterning found in human language and the different types of call combinations found in animal communication by introducing the terms phonological and lexical syntax, loosely based on the two levels of duality of patterning [32]. Marler defined phonological syntax (or phonocoding) as the level at which meaningless sounds are combined into sequences, and lexical syntax (or lexicoding) as the higher level at which the meaningful elements are combined. More recently, Hurford has used the terms combinatorial syntax (or combinatoriality) and compositional syntax (or compositionality) to designate the same phenomena as phonological and lexical syntax, respectively [26] (see table 1 for the terms and definitions of sound combinations used in animal communication research and their linguistic equivalents). Our goal here is to examine several examples of animal call combinations from a linguistic perspective and determine which level of duality of patterning they most resemble.
2. Examples of combinations in animal communication systems
(a) Winter wrens: phonological syntax?
Some of the best-studied examples of animal sound combinations come from birdsong [33]. One classic example of phonological syntax noted by Marler is the song of the winter wren (Troglodytes troglodytes) [32]. Kroodsma & Momose [34] describe the songs of a Japanese population of winter wrens whose song types consist of a highly predictable sequence of notes or syllable types (a note being a continuous trace on a sonogram and a syllable being a repeated unit of identical notes or groups of notes). In their study population, the typical repertoire for a male includes six or seven song types. These different song types are obtained by reusing many of the same syllables or syllable sequences in a different order. However, as Marler noted, these syllables do not differentiate the song types from one another. In fact, all six or seven song types in a male wren's repertoire convey the same ‘message’ and none of them have any referential meaning [32]. Therefore, while superficially there seem to be structural similarities between bird song and human phonology, there are important differences when it comes to meaning differentiation. For the wren's song to have phonology in the linguistic sense, the different order of syllables in the different song types would have to bring about a change in meaning between the song types, just as in English pat and tap differ in meaning but are made up of the same sounds in a different order. Because of this, the structure of the wren's song (and that of most other bird and whale songs) would be better described not as phonological syntax but as phonetic patterning. Phonetics describes the physical properties of sound and, unlike phonology, it does not presuppose that sound patterns carry any function that serves to differentiate meanings.
Despite these critical differences, the search for comparative examples of phonology in animal communication has, in a similar way to Marler, continued to focus on bird [35,36] or whale song [37,38]. However, a more phoneme-focused approach could be taken by searching for the use and comprehension of minimal pairs (pairs of meaningful signs or words distinguished by only one element drawn from a finite list; such as tap versus lap in English) in animal communication systems [39].
(b) Campbell monkeys: lexical syntax
Both Marler and Hurford argue that lexical syntax is only found in human language [26,32]. However, at least one example of call combination in an animal communication system could correspond to its definition. This is the use of an affixation system by Campbell monkeys (Cercopithecus campbelli campbelli) [40]. Campbell monkeys have two main predators: leopards (Panthera pardus) and crowned eagles (Stephanoaetus coronatus). The Campbell monkeys give a ‘krak’ call when they detect a leopard and a ‘hok’ call when they detect a crowned eagle. They can also add an affix ‘-oo’ to both of these calls to produce two new calls: ‘krak-oo’ and ‘hok-oo’. The ‘krak-oo’ call is given to any general disturbance and the ‘hok-oo’ call is given to any disturbance in the canopy. The critical aspect here is that the same ‘-oo’ is affixed to both calls (‘krak’ and ‘hok’). It is this use of the same elements, with the same meanings, in different sequences, that makes them compositional rather than combinatorial. The affixation modifies the meaning of the stem calls in a predictable way: changing a call designating a specific predator into a call designating a less specific disturbance in the same general physical space. Perhaps the closest language analogy would be the suffix ‘-like’, changing the meaning of the call from ‘leopard’ to ‘leopard-like (disturbance)’. The meaning of this suffix is fairly abstract: it does not refer to a concrete entity of its own, but directs the hearer to imagine a general situation that is disturbing in a similar way to the presence of a predator yet is not as dangerous as a real appearance of the predator. Abstract meaning operators of this kind are ubiquitous in human languages. Here, Campbell monkeys put together elements that conserve their meaning no matter what sequence they are part of, and obtain assemblies whose meaning reflects the meaning of their parts. This fits Hurford's definition of compositional syntax [26] and so deserves the name syntax, even if it is only a very rudimentary one.
(c) Putty-nosed monkeys: a less clear-cut example
The putty-nosed monkey's (Cercopithecus nictitans) combinatorial system is not so easy to categorize. In their communication system described by Arnold & Zuberbühler, putty-nosed monkeys produce two different loud calls: ‘pyows’ and ‘hacks’ [41]. These calls can be used as alarm calls when a predator is detected. If the predator is a leopard, the putty-nosed monkeys use ‘pyows’, and if it is a crowned eagle, they use ‘hacks’. In addition to this, the monkeys can combine these two calls into another structure, the ‘pyow-hack sequence’. This sequence normally consists of two to three ‘pyows’ followed by up to four ‘hacks’. The ‘pyow-hack sequences’ elicit the movement of the group. While the components of this sequence bear meaning individually, the meaning of the sequence does not appear to derive from the meaning of these components, and so this combination does not conform to Marler's definition of lexical syntax [32]. There do, however, exist three alternative analyses that can be invoked to linguistically categorize and understand this call combination in relation to human language.
First, this communication system can be interpreted as a simple phonological system. Under this analysis, the ‘pyows’ and the ‘hacks’ of the putty-nosed monkeys would be considered as phonemes in the linguistic sense, elements carrying no meaning per se but allowing the differentiation between the two single-segment morphemes (i.e. meaningful elements made up of only one sound) ‘pyow’ (‘leopard’) and ‘hack’ (‘eagle’), and a morpheme composed of a sequence, ‘pyow-hack sequence’ (‘let's go’). Thus, the element ‘pyow’ in the single call ‘pyow’ and in the ‘pyow-hack’ would be comparable to, say, the sound s in the single-segment morpheme s (as in John's) and in the sequence so or us—with no meaning in common, but serving as a diacritic for distinguishing meanings.
However, the data also allow alternative analyses that do not assume phonology and duality of patterning. Under one analysis, it would be possible to analyse the ‘pyow-hack sequences’ as idioms, where the original meanings of ‘pyow’ and ‘hack’ have become blurred. A possible etymology is this: the sequence first meant ‘leopard and eagle’ and then, derived from this by implication, ‘danger all over’. This in turn came to mean ‘danger all over, therefore let's go’ and finally just ‘let's go’. The human language analogue would be expressions like kick the bucket, the meaning of which is no longer transparently related to the meaning of the components, but has undergone complex etymological developments.
Alternatively, under another analysis, one could ascribe much more abstract meanings to ‘pyow’ and ‘hack’, such as ‘move-on-ground’ and ‘move-in-air’. When produced on their own, listeners would seek the contextually most relevant and most suitable interpretation of these calls, possibly using similar heuristic processes such as are well established for human communicators in the theory of implicature inferences [42–44]. A default and common implicature would be, in the case of ‘pyow’, inference to a prototypical danger on the ground, a leopard; and, in the case of ‘hack’, a prototypical danger in the air, an eagle. Since under this analysis the calls themselves have very abstract meanings, it is possible to analyse pyow-hack sequences as lexical compositions: meanings like ‘move-on-ground’ and ‘move-in-air’ combine to a general meaning like ‘we move; let's go’ since putty-nosed monkeys themselves move both in the tree canopy and, though more rarely, on the ground [45].
Under either of these last two analyses, putty-nosed monkeys would, contrary to Arnold & Zuberbühler's conclusions [28], have lexical syntax in Marler's sense. At first sight, these alternative analyses are perhaps less plausible than positing phonology because they ascribe more complex cognitive processing to the monkeys: language change in the idiom-based analysis or abstract semantics and a well-tuned pragmatic inference machinery in the compositionality-based analysis. However, the communication system of the Campbell monkeys, a species closely related to the putty-nosed monkeys, suggests that their possible use of lexical syntax with abstract semantics is especially worth considering and should not be ruled out a priori.
(d) Banded mongooses: a non-primate example of lexical syntax?
Potential examples of lexical syntax are not limited to primate species: there are also examples from species more distantly related to humans, such as the close calls of the banded mongoose (Mungos mungo) [29]. Banded mongooses emit close calls while looking for food and these calls differ in structure depending on the exact nature of the behaviour: digging, searching in the same foraging patch or moving between two patches. In all these contexts, the close call begins with an initial noisy segment that encodes the caller's identity, which is stable across all three contexts. Additionally, in the searching and moving context, there is a second tonal harmonic segment that does not encode identity; however, its length varies consistently with context, the segment being longer when the mongoose is moving rather than searching. These two segments, noisy and harmonic, come together in the call and indicate both the caller's identity and his activity.
As with the putty-nosed monkeys, it is possible to interpret these calls as a simple phonological system, with the noisy segment and short and long harmonic segments being three distinct phonemes. The noisy segment can then be produced alone as a single-segment morpheme when digging, or in combination with one of the other ‘phonemes’, which allow distinguishing between the different two-segment morphemes for searching or for moving.
In another interpretation, the banded mongoose close calls can act in an analogous way to short sentences: noisy segment + Ø → ‘I (Fred) dig’; noisy segment + short harmonic segment → ‘I (Fred) search’; noisy segment + long harmonic segment → ‘I (Fred) move’; with the noisy segment acting as a referential expression that also encodes individual identity (somewhat like the caller's name) and the tonal segment as the ‘predicate’ that can be compared to simple subject–predicate compositions in human languages. Indeed, some human languages also use individually distinct expressions (i.e. personal names) in lieu of first-person pronouns. This is the case, for example, in Thai, where the use of first-person pronouns equivalent to ‘I’ is rude. Instead people routinely use their personal name instead of a first-person pronoun, for example saying ‘Bill is cooking’ while referring to oneself [46]. Under this analysis, the meaning of the assemblies produced by banded mongooses directly reflects the meaning of their different components, making these combinations, in a similar way to the Campbell monkeys', syntactic.
For now, either interpretation is possible, particularly because, in the absence of playback experiments, it is not clear what information listeners extract from these calls. Such experiments are therefore vital in helping shed light on whether banded mongoose close calls represent a syntactic or phonological system.
3. Examples from human languages where phonology is absent
While in animal communication systems sound combinations seem to be the exception, in human language they are the rule: all human languages combine words at the syntactic level and nearly all human languages, spoken or signed, have phonology, or cherology as it is known for sign languages. However, there do exist some languages possessing features without phonology, or lacking phonology altogether. Understanding why this structural feature of language is and can be absent could shed important light on the origins of syntax and phonology in human languages.
(a) Al-Sayyid Bedouin Sign Language
Most sign languages have phonology (cherology). This was first determined by Stokoe [47] in his work on American Sign Language (ASL). Stokoe specifically demonstrated that ASL has three major categories (hand shape, location and movement) and that they each contain a certain number of features. Replacing one of these features by another causes a change in the meaning of the sign. This allowed Stokoe to conclude that ASL was not made up of holistic signs but of meaningless elements that are recombined into words.
Currently, one sign language is known that does not have phonology, or at least phonology has not fully developed throughout its entire lexicon. This is the Al-Sayyid Bedouin Sign Language (ABSL) described by Sandler et al. [48]. ABSL is a relatively new language used in the Al-Sayyid Bedouin group of the Negev region of Israel. The first deaf members of the group were four siblings born around 75 years ago. Over the next two generations, the number of deaf members increased as more were born into the community, most probably due to recessive congenital deafness [48]. There are now around 120–150 deaf members for a total of around 4000 members. ABSL is also used by a significant proportion of hearing members of the community.
Sandler et al. [48] looked for phonology in ABSL by searching for minimal pairs. For sign language, these can be distinguished by location, orientation, hand shape or movement. The authors did not find minimal pairs in ABSL [48]. On the contrary, they found a great variety in the signs for single words. For example, the sign for ‘tea’ can be represented by three different hand shapes and the sign for ‘dog’ can be made either in front of the mouth or in front of the torso (difference in location), depending on the signer. This lack of minimal pairs lead Sandler et al. to conclude that ABSL has no phonology and thus no duality of patterning [48]. Despite its lack of duality of patterning, from a linguistic point of view ABSL is a fully operational language, both in its function, allowing users to have conversations, make plans, tell stories and give instructions, and linguistically, having grammatical regularity at the syntactic, morphological and prosodic levels.
(b) Spoken languages
Of course, it could be that absence of duality of patterning is a peculiarity of an emerging communication system such as ABSL. However, although the spoken languages studied so far undeniably present duality of patterning, it is not implausible to assume, as does Blevins, that duality of patterning is not an absolutely universal property. Blevins discusses segment-sized morphemes in a number of languages [49]. An example is the English morpheme s, which can mean ‘plural’ (book-s), ‘third-person singular present’ (she look-s) or ‘possessor’ (Rik's). The reason we analyse s as three morphemes with a phoneme /s/ is because the same phoneme recurs in a great number of other morphemes (soup, test, miss, etc.). If this were not the case, one could just as well say that we have a meaning-bearing segment s that happens to be three-ways ambiguous. If a language has a large inventory of such meaning-bearing segments and the meanings are sufficiently abstract, this would easily allow a sizeable expressive power without duality of patterning. The two critical requirements for this—abstract meanings and large inventories of segments—are both well established in extant languages.
First, there are languages whose lexicon is composed of words with highly abstract meanings. Consider, for example, words like st'uswalíć ‘I picked up the rag’ in the North American language Atsugewi, which is composed of a prefix s'w- for ‘I’ followed by the three morphemes tu ‘do something by hand’, swal ‘for limp (not stiff or resilient) material to move or be located’ and ić ‘upward’ [50]. In such a system, a limited number of abstract meanings are strung together and then subjected to a rich machinery of pragmatic inference, deriving concrete meaning effects.
Second, there are languages with impressively large segment inventories. The known maximum is found in !Xõõ in Botswana, with 164 segmental phonemes [51]. Many languages in addition have suprasegmental features like tone (as also found in !Xõõ), vowel and consonant lengthening, nasalization (e.g. owoku ‘house’ versus õ ̃wõŋgu ‘my house’ in the Terena language of Brazil [52]) and holistic sound sequences such as are found in interjections (e.g. ʔṃ 'hṃ for ‘yes’ and ʔṃ 'ʔṃ for ‘no’ in English). It is easy to imagine that all these possibilities co-occur in a single language, so that inventories quickly reach between 160 and 180 units, each carrying its own abstract meaning.
Furthermore, Blevins notes that in many languages, meanings depend on position and context [49] (just as the English -s means different things depending on whether it follows a noun or a verb stem; cf. above). Even just distinguishing word-initial and word-final positions in two-segment words would thus already yield a potential for more than 300 meanings; adding a noun versus verb distinction could double this number again. Finally, as Blevins also observes, many languages have what are called bi-partite or tri-partite stems, where stems are non-transparently composed of morphemes, like idioms (cf. e.g. in Andi, a language of the Caucasus, abcho ‘someone washed it’, with the bipartite stem a-ch ‘wash’, interrupted by an agreement marker b- and followed by a past tense marker -o [53]). This quickly adds a few hundred other meanings (in fact, with 180 units that can freely combine with each other in first and second part, a language could potentially have up to 1802 bipartite stems, which is already beyond an average speaker's lexicon in daily use). Given all these possibilities, it is perfectly possible that there might have been (or will be) a spoken language in the world that lacks duality of patterning. The lexicon of such a language might not (easily) allow growth on the scale of languages with duality of patterning, but if we also allow for borrowing words from other languages, even these limitations are not as detrimental as one might think.
4. Discussion
(a) Syntax before phonology
The examples discussed in this review demonstrate that (i) while phonology in the linguistic sense seems to be rare in animal communication systems, lexical syntax seems to be more widespread than previously thought, and (ii) while there is no human language without syntax, it seems possible for some human languages to lack phonology. This appears to indicate that a single layer of compositional structure (syntax) is less complex to develop than adding to this an extra layer of phonological structure. This leads us to hypothesize that, contrary to the traditional view in both linguistics and animal communication research [54], syntax developed before phonology in human languages.
This hypothesis seems to be further supported by the fact that human languages lacking phonology but possessing syntax, such as ABSL, are emerging languages that do not yet seem to be fully formed. This suggests that syntax develops first to allow the expression of more concepts with only a few words, while phonology appears later on in the development of a language, when the need for a larger vocabulary makes it a more efficient way to produce an increased number of words. If this is the case, we would expect any new emerging languages to present a similar pattern, with syntax developing before phonology. Preliminary surveys suggest that this may be the case for most spontaneous sign languages [55]. In terms of spoken languages, it is harder to search for similar developmental patterns, as emerging spoken languages such as pidgins and creoles are created when people who speak different languages need to communicate. Therefore, these languages are not created from scratch and their sound system is most often taken from one of the original languages [56].
Why syntax developed before phonology is of course open to discussion, but it could be that, from a cognitive perspective, syntax is simpler to process than phonology. Intuitively, it would seem that syntactical combinations would require less memorizing, as only the meanings of the individual signals would need to be learned and remembered, the meaning of the combination being derived from them. For phonological combinations, on the other hand, it would seem that a new meaning has to be learned for each different sequence of sounds.
(b) Insights into the origins of syntax and phonology
While the examples analysed in this review can give some insight into the order of development of different types of sound combinations, they also allow us to formulate hypotheses regarding the conditions favouring their evolution. One obvious similarity between the species demonstrating combinations of meaningful calls is that they all reside in groups characterized by high sociality. This social dimension may well require such species to express more concepts than would be possible with only the individual calls from their anatomically constrained vocal repertoire. One solution to this constraint is to develop a more open-ended vocal repertoire through learning, as is the case in a number of bird species and social mammal species [57]. Alternatively, as we see here, calls could be flexibly combined to express related (compositional syntax) or even unrelated (combinatorial syntax) meanings [41].
Furthermore, of the three major examples we present, two represent call combinations used in less urgent situations. In the case of the Campbell monkeys signallers use single alarm calls on their own to indicate a predator, whereas they use the affixed call for a more general, less immediately threatening disturbance. In a similar way, banded mongoose call combinations occur while foraging rather than in immediate predation contexts. As a shorter time between the perception of the danger by the emitter and the reaction of the receiver would be more advantageous in urgent situations, one might predict clearer evidence for syntax in more relaxed, social contexts [58]. Indeed, for human language, it is well established that more complex and elaborate kinds of syntax are better represented in written than in spoken language [59] (i.e. in a mode of language use that is removed from the rapid and socially challenging interactions that characterize spoken language).
Given the current absence of unambiguous examples of phonology in the linguistic sense in animal communication systems (i.e. there is no clear evidence of patterns of communication that cannot be explained without assuming phonology), variation among human languages may provide additional insight into the origins of this feature. First, the examples of human language features lacking phonology, such as segment-sized morphemes or holistic sound sequences, suggest that duality of patterning is an empirically observed correlation and not a logically necessary property of language [48]. New observations are constantly providing additional empirical data to be interpreted. Second, the absence of phonology in certain aspects of languages, or even in whole languages, points towards a non-genetic basis for this feature in human language. Like songbirds [35] and some mammal species (cetaceans [60], pinnipeds [61], elephants [62], bats [63]), humans are vocal learners capable of producing a large number of different sounds. However humans are, as far as we know, the only species that use these sounds phonologically to distinguish between the meanings of two sequences. This suggests that vocal learning and the capacity to produce a large number of different sounds alone are not sufficient to induce the emergence of a phonological level. We therefore argue that the constraints leading to the use of a phonological level are more likely to be cognitive in nature rather than linked to the production capacity of a given species. Specifically, once humans developed the cognitive capacities to memorize phonological combinations and their meanings, phonology itself could become subject to cultural, as opposed to biological, evolutionary processes [23,64]. If this is the case, it might explain why phonology in the linguistic sense is so rare in the communication systems of other species.
The constraints driving the cultural evolution of phonology should be widespread across human cultures, reflecting the distribution of the property itself. These constraints could include the need for distinctiveness and learnability, as well as a tendency to keep meaningful distinctions while trying to make an utterance sound similar to other utterances in a population [23]. As Hockett noted, phonology is most efficient when there is a large set of meanings to be expressed, because the combination of phonemes is generally less constrained than the combination of morphemes: the combination of morphemes must ‘make sense’ [24]. ABSL may lack phonology because it does not currently have these constraints. It is a small community language and its users know each other, potentially making pragmatics and inference an important part of their communicative understanding. However, if the use of ABSL were to spread to a larger population of signers, we could expect a gradual emergence of phonology. In fact, ABSL already seems to have a blueprint for the development of phonology, with the emergence of categories, the regularization of signs within familylects and young signers using conventionalized signs rather than iconic ones [48].
(c) Conclusion
Duality of patterning is considered an important feature of language. From a comparative perspective, this has led to great interest in animal call combinations and their similarities to the two levels of structure found in duality of patterning: phonology and syntax. In this review, we have shown that there exist no clear examples for phonology in the linguistic sense in animal communication systems, and that, contrary to traditional thought, syntax or compositionality is actually more widespread. When also analysing the structure of human languages, we found that some parts of some languages, and at least one entire language, do not display phonology. From these observations, we alternatively argue that syntax developed before phonology and that the former seems to be a cognitively simpler process, with the latter possibly being the product of cultural evolution. This could be taken into account in future research on meaningful animal call combinations by assuming lexical, and not phonological, syntax as the simplest explanation.
If a certain language property, such as phonology, is not universally present in all human languages, then it is probably unsurprising that it is non-existent in a large number of animal communication systems. However, if the factors leading to the presence (or absence) of this property can be determined, they may allow us to make predictions on which species or social contexts to focus our research effort to find these analogous or homologous properties in animal communication systems if they do exist. This focus fits with recent developments in linguistics that increasingly challenge the idea of a given set of properties defining all and only human languages, and instead probe into the social and biological factors that condition how specific properties of language arise, develop and disappear again in the course of time [65–67].
Acknowledgements
We thank Per Lundberg and two anonymous reviewers for helpful comments on a previous version of the manuscript, and Sabrina Engesser for discussions.
Funding statement
This work was funded by a collaborative University of Zurich Research Priority Program grant (