Philosophical Transactions of the Royal Society B: Biological Sciences
You have accessReview articles

The neurobiology of innate, volitional and learned vocalizations in mammals and birds

Andreas Nieder

Andreas Nieder

Animal Physiology Unit, Institute of Neurobiology, University Tübingen, Auf der Morgenstelle 28, 72076 Tübingen, Germany

[email protected]

Google Scholar

Find this author on PubMed

Richard Mooney

Richard Mooney

Department of Neurobiology, Duke University School of Medicine, Durham, NC 27710, USA

[email protected]

Google Scholar

Find this author on PubMed



    Vocalization is an ancient vertebrate trait essential to many forms of communication, ranging from courtship calls to free verse. Vocalizations may be entirely innate and evoked by sexual cues or emotional state, as with many types of calls made in primates, rodents and birds; volitional, as with innate calls that, following extensive training, can be evoked by arbitrary sensory cues in non-human primates and corvid songbirds; or learned, acoustically flexible and complex, as with human speech and the courtship songs of oscine songbirds. This review compares and contrasts the neural mechanisms underlying innate, volitional and learned vocalizations, with an emphasis on functional studies in primates, rodents and songbirds. This comparison reveals both highly conserved and convergent mechanisms of vocal production in these different groups, despite their often vast phylogenetic separation. This similarity of central mechanisms for different forms of vocal production presents experimentalists with useful avenues for gaining detailed mechanistic insight into how vocalizations are employed for social and sexual signalling, and how they can be modified through experience to yield new vocal repertoires customized to the individual's social group.

    This article is part of the theme issue ‘What can animal communication teach us about human language?’

    1. Introduction

    Vocalizations play a fundamental role in intraspecific communication in birds and mammals. As in any communication system, vocal signals need to be produced by a sender and deciphered by a receiver. Thus, both the production and the perception of vocalizations are integral parts of vocal communication. This article focuses on the neural mechanisms of vocal production from a comparative point of view in songbirds, in primates and laboratory mice, the latter species providing the most genetically tractable vertebrate organism in which to map and manipulate central circuits for vocal control.

    In discussing the behavioural and neural foundations of vocal production, two issues are of special importance. The first issue concerns whether vocalizations are innately programmed, or instead acquired through learning. The second issue is whether vocalizations are elicited solely through emotional (affective) mechanisms or can also be volitionally controlled. These two aspects constitute orthogonal axes in a ‘vocal feature space’ because any combinations between degrees of learning and control are possible, from innate-affective (e.g. alarm calls), innate-volitional (instructed calls), learned-affective (birdsong), to learned-volitional (human speech).

    (a) Innate versus learned vocalizations

    The vast majority of avian and mammalian species produce only innate vocalizations, defined here as vocalizations that are closely controlled by hard-wired brain structures and their underlying genetic programmes with little or no environmental influence. These innate vocalizations occur spontaneously in all healthy members of a species whenever they are exposed to a certain stimulus, and are not learned through cultural experience or through vocal practice and performance evaluation. They help to communicate about food, mediate social interactions or signal the presence of different predators. For instance, chickadees produce a high-frequency, low-amplitude ‘seet’ alarm call in response to flying raptors, but a loud, broadband ‘chick-a-dee’ alarm call in response to a perched or stationary predator [1]. Similarly, vervet monkeys are well known to elicit three types of predator-specific alarm calls: ‘leopard alarm calls’ are short tonal calls produced in a series of inhalations and exhalations, ‘eagle alarm calls’ are low-pitched grunts, while ‘python alarm calls’ are high-pitched ‘chutters’ [2] (but see [3] for evidence that both cognitive appraisal of the situation and internal state contribute to the variation in call usage and structure).

    Despite the complex ways that monkeys can use different vocalizations, Kaspar-Hauser experiments in squirrel monkeys [4], hybridization studies in gibbons [5] and cross-fostering experiments in macaques [6] underscore that the underlying vocal patterns produced by non-human primates are innate. Cross-fostering experiments and early deafening experiments in mice also indicate that their vocal repertoires are innate ([79], but see [10]). Humans also produce a wide range of vocalizations, such as cries or laughter, even when they are congenitally deaf, indicating that these vocalizations are innate rather than learned [11].

    While vocal utterances in non-human primates are innate, recent behavioural experiments in the marmoset, a new-world monkey, demonstrate the influence of social experience on vocal maturation. Marmoset infants initially produce a high ratio of ‘cries’ to ‘phee’ calls in the first month of postnatal life, then transition to producing a repertoire comprising predominantly phee calls. This transition occurs earlier in development in infants whose phee calls more frequently elicited phee calls from their parents [1214]. Moreover, this change cannot be attributed to effects of social experience on overall growth rates; rather, an infant marmoset appears to select and refine elements from its innate vocal repertoire via social–auditory reinforcement from its parents. In support of this view, a study of twin marmoset infants found that the twin that received greater contingent ‘phee’-back played from a speaker transitioned to phee calls earlier in life [15]. Caveats include that infant marmosets with less responsive parents still transition to adult proportions of phee calls and that the effects of social–vocal experience on factors other than gross body weight that could influence vocal development, such as steroid levels in the juvenile's brain and larynx, have yet to be explored. Nonetheless, the social reinforcement-based change in marmoset vocalizations may be regarded as vocal production learning in a very broad sense [1618]. In that respect, it resembles aspects of early (prelinguistic) vocal development in humans [19,20], songbirds [21,22] and bats [23,24]. This simpler form of vocal plasticity suggests that marmosets may serve as valuable models for rudimentary aspects of speech development and evolution, in addition to their established utility for studying how vocalizations are used for social communication and recognition (see table in [25] for a list of studies of plastic primate vocal behaviours). However, such a broad definition of vocal production learning needs to be clearly differentiated from vocal learning via imitation.

    As more narrowly (and accurately) defined, vocal production learning involves imitation of the vocal patterns produced by another individual, including that of another species or even artificial (man-made) sounds. Vocal imitation requires that the ‘pupil’ hear an appropriate model and engage in extensive vocal practice and evaluation through auditory feedback, and involves enhanced forebrain control of the vocal organ [26]. In fact, surprisingly few groups of vertebrates learn their species-typical vocalizations: other than for speech learning in humans, evidence of vocal learning exists only in a few different orders of birds (oscine songbirds [2729], parrots [3032] and certain hummingbirds [33]) and of mammals (pinniped carnivores [34], cetaceans [35], bats [36] and elephants [37]). Even then, systematic experimental evidence of the necessity of models, practice and auditory feedback-dependent performance evaluation is lacking in pinnipeds, whales, bats and elephants. In conclusion, despite the fact that the vocalizations of non-human primates change over development, the limited vocal modifications seen in nonhuman primates, including chimpanzees, are categorically different from vocal production learning in this more narrow and traditional sense [38].

    Although the capacity for vocal learning emerged independently in songbirds and humans, birdsong and speech learning share many striking similarities. Most notably, songbirds and humans learn their species-typical vocalizations during a juvenile sensitive period in a process that depends on auditory experience of an appropriate vocal model, extensive vocal practice, and self-evaluation and adaptive modification of vocal performance through auditory feedback [3943]. In addition, both humans and songbirds have evolved a complex hierarchy of specialized forebrain sensorimotor structures important to the acquisition and production of learned vocalizations, and that exert their influence on vocalizations through the same brainstem regions important to the production of innate vocalizations [4447]. Indeed, innate and learned vocalizations depend on the activity of common pools of motor neurons and a single vocal end organ, as will be discussed later in this review in §2. Thus, an understanding of the neural mechanisms underlying innate and learned vocalizations in songbirds and humans can inform how innate and learned motor patterns coexist and cooperate with each other to generate a full vocal repertoire.

    (b) Internal state-based versus volitional control of vocalizations

    Many and perhaps most vocalizations uttered by birds and mammals are a consequence of specific internal states. The alarm calls made by chickadees and vervet monkeys just described are illustrative examples of such internal state-based vocalizations. Here, the internal state change in the presence of a predator is sufficient to explain the vocal utterances. The internal state is used as an umbrella term that contains different intrinsic factors. According to one framework, the internal state consists of an affective component (related to the individual's evaluation of the environment), a motivational component (related to the individual's action tendencies) and an arousal component (related to the individual's likelihood and urgency to respond) [48]. Thus, affect, motivation and arousal all describe different, sometimes orthogonal internal states that might elicit vocalizations.

    Without a direct measure of physiological parameters in controlled vocalization contexts, it is virtually impossible to figure out how these various internal states relate to the production of vocalizations [49]. In one rare attempt in which a direct link between specific call types and internal state could be established in squirrel monkeys, the monkeys were trained to increase or avoid electric stimulation of specific brain areas by switching between different compartments in a cage [50]. Aversive situations in which the monkeys avoided stimulation were predictably associated with specific call types, whereas appetitive situations reliably elicited other innate types of vocalizations [50]. The results indicate that calls are loosely tied to different affective or motivational states.

    A recent study in marmosets used noninvasive surface electromyography to measure heart rates as a proxy for arousal while the monkeys produced different vocalizations that were elicited by systematically manipulating social context [51]. It was found that arousal levels were correlated with changes in different acoustic features and the production of different vocalizations. These findings are in agreement with the notion of arousal level changes resulting in call type differences. However, this study also found that the production of these different call types is also affected by extrinsic factors such as the timing of a conspecific's vocalization in contexts where marmosets are interacting.

    While internal state is a natural driver of many human vocalizations, we can also vocalize under volitional control, such as when we engage in spoken conversation. In fact, a distinction between volitional and affective (spontaneous, emotional) movements has long been recognized in clinical neurology, especially for orofacial movements that are indispensable to vocalization [52,53]. Patients with facial paralysis owing to damage to the descending pathways from the motor cortex have considerable difficulty smiling or frowning in response to a neurologist's commands (a condition called ‘voluntary facial paresis’), even though they smile or frown naturally in response to their emotional state. Similar dissociations have been observed for vocalizations, where some patients with neurological insults may lose volitional control of their speech (despite intact speech comprehension), but can still laugh, scream or groan when they are happy, frightened or in pain. For example, following damage to the inferior frontal gyrus, non-verbal vocal utterances remain intact despite devastating impairments in speech and language production (Broca's aphasia) [5457]. Moreover, patients with a clinical diagnosis of primary progressive aphasia develop abnormal laughter-like vocalizations that increasingly replace speech in the context of progressive speech output impairment leading to mutism, until ultimately laughter-like vocalizations are the only extended utterance produced by these patients [58]. Neuropsychological studies in these patients further help to show that affective and volitional vocalizations are mediated by distinct neural pathways, as will be discussed in further detail in §5.

    Whether non-human animals share the capacity for volitional control of vocalization is a matter of intense debate. Despite claims of volitional vocal control in some animals, particularly in nonhuman primates [59], these claims remain controversial. Without specific experimental efforts to control the affective status of an animal under investigation, for instance, affective causes will always remain the simplest explanation for vocal initiation, particularly when working with spontaneously behaving animals.

    A key issue is how to assess whether non-human animals can exert volitional control of their vocalizations. Clinical neurology provides three criteria useful for making such an assessment. First, vocalizations need to be uttered in response to an arbitrary instruction stimulus that is neutral in its value or emotional valence. Second, vocalizations need to be uttered in manner that is temporally contingent to the instruction stimulus. Third, vocalizations need to be produced reliably after the presentation of the instructive stimulus, and withheld in its absence. Note that these criteria for volitional vocalization are not identical to current definitions of ‘intentional communication’ in animals, which primarily rely on the effects that communicative signals may exert on a potential recipient [59].

    Indeed, recent studies determined that monkeys and crows (corvid songbirds) can be trained in a manner that satisfies these three criteria, and thus display volitional vocal control [60,61]. There, the animals were trained in a computer-controlled Go/Nogo-protocol to vocalize as soon as they saw a visual stimulus, which was delivered with an unpredictable onset time. The red or blue squares used as instructive stimuli were arbitrary and contained no inherent hedonic value that might trigger vocalization through an affective process, satisfying the first criterion. Second, the animals vocalized promptly after the onset of the instruction stimulus (approx. 1.5 s). Third, the animals responded reliably to the instruction stimulus with high hit rates after presentation of the instruction stimulus, and displayed low false alarm rates in the absence of the instruction stimulus [60,61]. Finally, one monkey even learned to switch between two distinct call types from trial to trial in response to different visual cues [60].

    These findings indicate that monkeys and songbirds are able to volitionally initiate vocal production and thus can instrumentalize their vocal behaviour in a goal-directed manner. Interestingly, however, this volitional control was developmentally restricted in monkeys: while both monkeys reliably vocalized on command during juvenile periods, they discontinued this controlled vocal behaviour in adulthood [62]. This developmental loss of volitional vocal production contrasts with persistent affective vocal production, as both monkeys continued to vocalize spontaneously as adults. Furthermore, both monkeys could continue to make volitional manual gestures as adults. We speculate that the loss of voluntary vocalizations reflects maturational changes in the monkeys' brains that are unrelated to vocal practice or production per se. For instance, hormonal changes associated with sexual maturation contribute to adolescent-typical behavioural changes that necessarily have an impact on large-scale networks, so that functions beneficial during childhood may become inhibited during adulthood. Moreover, synaptic elimination during adolescence likely involves adjustment of the excitatory/inhibitory balance on individual neurons and within networks, given that excitatory synapses are selectively pruned whereas inhibitory synapses are spared [63]. Whatever the reasons, volitional vocal control is a recent evolutionary invention in primates, and only fully available throughout life in humans.

    2. Peripheral vocal production mechanisms

    Vocalization in birds and mammals requires precisely coordinated activity of many different respiratory and vocal muscles, the latter of which includes the muscles of the sound-generating organ (the intrinsic muscles of the larynx in mammals or of the syrinx in birds) as well as those of the upper vocal tract, which can further filter and amplify components of these vocal sounds [6469]. This integration can be enormously complex: human speech is estimated to involve the coordinated activity of approximately 100 distinct vocal, orofacial and respiratory muscles [47]. Moreover, the intrinsic vocal (laryngeal or syringeal, respectively) muscles are active not only during vocalization but also during quiet breathing [7072], where they help regulate and gate respiratory airflow [73]. In fact, vocal motor networks are embedded in and likely derived from brainstem respiratory pattern-generating networks [7477]. Thus, vocal and respiratory pattern generating networks are strongly intertwined and may not be wholly separable.

    One important consideration when comparing vocal communication in birds and mammals is that they vocalize through functionally similar but anatomically distinct vocal end-organs [67,73] (figure 1). Birds sing and call through their syrinx, a bipartite muscular organ located at the junction of the trachea and the two bronchi, whereas mammals vocalize through the larynx, a unipartite structure rostral to this junction, closer to the cranium. Notably, many songbird species exploit the bipartite architecture of the syrinx to simultaneously generate two independent and often harmonically unrelated voices [78,79], something that is not possible to achieve with laryngeal vocalizations (although harmonically related two-voice effects can be achieved by skilled human vocalists who learn to accent the fundamental and certain of its harmonics through upper vocal tract filtering [80]).

    Figure 1.

    Figure 1. Songbird syrinx (a) and mammalian larynx (b) shown in en face (top) and frontal cross-section (bottom) views. The syrinx is located at the juncture of the trachea with bronchi. Abbreviations, left two panels: TL, m. tracheolateralis; ST, m. sternotrachealis; vTB, m. tracheobronchialis ventralis; dTB, m. tracheobronchialis dorsalis; dS, m. syringealis dorsalis; vS, m. syringealis ventralis; T, tracheal ring; Ty, tympanum; P, pessulus; ML, medial labium; MTM, medial tympaniform membrane; LL, lateral labium; S, syringeal muscle; A1–A3, tracheo-bronchial semi-rings; B, bronchial rings. Abbreviations, right two panels: E, epiglottis; Hy, hyoid bone; LTL, lateral thyrohyoid ligament; Th, thyroid cartilage; MCL, median cricothyroid ligament; CT, cricothyroid muscle; Cr, cricoid cartilage; T, tracheal rings; FF, ventricular or false folds; V, lateral laryngeal ventricle; TA, thyroarytenoid muscle; VF, vocal folds. Figures reproduced with permission from [67].

    Despite structural differences in their vocal end organs, many functional aspects of the vocal periphery are similar in birds and mammals, as reviewed in depth by [67]. In most instances, vocalizations are emitted during the expiratory phase of the respiratory cycle. Furthermore, to phonate, mammals and birds must simultaneously increase airway pressure and contract the intrinsic laryngeal or syringeal muscles to apply tension to the vocal folds and the labia, respectively. This synergistic respiratory and vocal activity causes periodic vibrations of the vocal folds or labia, which in turn cause periodic vibrations in surrounding air molecules. These airborne vibrations manifest as relatively low frequency (less than 10 kHz) ‘voiced’ harmonic sounds, the fundamental frequency of which (100–200 Hz in adult humans) directly correlates with the activity of certain intrinsic laryngeal or syringeal muscles (i.e. the cricothyroid muscle in mammals and the ventral syringeal muscle in birds) [68,81]. Other similarities include the ability to emit a wide range of vocal frequencies by exploiting different vocal tract dynamics, spanning from very low frequencies (‘vocal fry’ or ‘pulse tone register’) to higher frequencies (‘modal register’); the capacity to utter multiple vocal notes, elements, calls or syllables during a single exhalation (a ‘fusion’ event that figures prominently in human speech); and the accenting and filtering of certain harmonics through the active manipulation of the resonant properties of the upper vocal tract.

    Upper vocal tract filtering (‘articulation’) plays an especially prominent role in speech and birdsong, and thus warrants further attention. When humans or birds produce voiced sounds in a helium-oxygen (heliox) atmosphere, which increases the resonant frequency of the upper vocal tract without affecting the oscillatory frequency of the vocal folds or syringeal labia, previously suppressed harmonics can be detected, providing experimental evidence of this upper vocal tract filter [82,83]. In humans, the articulatory muscles of the face, tongue, jaw and pharynx contribute to the filter; in birds, the muscles that control the oropharyngeal-esophageal cavity (OEC) appear to figure prominently in upper vocal tract filtering [67,84,85]. Songbirds also actively modulate the gape of their beaks during singing, although most likely to optimize acoustic transmission, rather than for spectral filtering [86].

    In addition to producing audible vocalizations (‘squeaks’) through vibrations of their vocal folds, many rodents, including rats and laboratory mice, emit ultrasonic vocalizations (USVs; greater than 30 kHz) [8789]. Notably, USVs figure prominently in social affiliation and courtship, whereas lower frequency vocalizations are produced more commonly in aggressive encounters, or in response to stress or pain [90]. While rodent ‘squeaks’ are produced by mechanical vibrations of the vocal folds, USVs are produced by an aerodynamic whistle. Support for this idea comes from the knowledge that the fundamental frequency of an aerodynamic whistle depends on the density of the surrounding gases, whereas the fundamental frequency of a mechanical oscillator (such as the vibrating vocal folds) does not. Consistent with this distinction, the fundamental frequency of rodent USVs shifts upward in a heliox atmosphere, whereas the fundamental frequency of voiced sounds remains unchanged [89]. Moreover, structural analysis of the rodent airway supports the idea that USVs result from an edge-tone whistle effect generated by airflow over a ventral laryngeal pouch [91]. Despite their distinct physical mechanisms, both ‘whistled’ USVs and voiced vocalizations originate in the larynx and depend on a precise coordination of respiratory and vocal activity. In fact, to produce USVs, rats (and presumably mice) must maintain respiratory pressure below the level that induces vocal fold oscillations, and must precisely adjust the position of the larynx to sustain airflow necessary for generating the ultrasonic whistle [92].

    In summary, many striking parallels exist in how vocal sounds are produced in different mammals and birds. Thus, a comparative approach using non-human primates, rodents and birds is likely to be relevant to understanding the neuromuscular control of vocalization more broadly, including for human speech. Even the whistles that give rise to rodent USVs have their analogues in the whistle languages used by humans living in mountainous regions [93]. And while to date the cognitive and semantic roles served by human speech are without obvious parallels in other animal vocalizations, research in songbirds and non-human mammals can provide insights into many neural processes relevant to human speech, including respiratory–vocal integration in the brainstem, the modulation of these brainstem networks by the forebrain to facilitate vocal learning and volitional control of vocalizations, and the control of vocal sequences necessary to phonological syntax.

    3. Brainstem networks for vocalizations

    Humans, other primates, mice and songbirds can produce innate vocalizations of normal acoustical structure even in the absence of forebrain inputs to the brainstem. Indeed, while the motivation and ability to gate innate vocalizations as a function of social or environmental context may require an intact forebrain in some species, the pattern of these innate vocalizations is unaffected when the forebrain inputs are removed by surgical transection, focal forebrain lesions or genetic mutations that result in anencephaly [8,55,94,95]. In contrast, speech and birdsong depend intimately on the forebrain: focal lesions in speech- or song-related regions of the human or songbird cortex render individuals unable to speak or sing, respectively, while their ability to utter acoustically normal innate vocalizations remains intact [8,55,9496]. Here we take a ‘bottom up’ approach to exploring these brainstem networks.

    (a) Vocal and respiratory motor neurons

    Fluent vocalization depends on the precise coordination of numerous vocal and respiratory motor neuron pools. Motor neurons for breathing are largely located in the spinal cord and serve similar functions in mammals in birds, except that birds lack a diaphragm and the associated phrenic motor nucleus [97]. Indeed, one important distinction is that avian respiration involves active inspiration and active expiration (rather than passive expiration, as in mammals), mediated in part by a collection of bellows-like air sacs that perfuse a relatively rigid lung [65,97]. Vocal motor neurons in both birds and mammals are located in the caudal medulla, with the intrinsic laryngeal muscles in mammals receiving input from motor neurons in the nucleus ambiguus and syringeal muscles in songbirds receiving their innervation from motor neurons located in the tracheosyringeal part of the hypoglossal motor nucleus (interestingly, the muscles of the tongue, which play a prominent role in human speech (but not in birdsong), are innervated by the lingual part of the hypoglossal nucleus) [97,98]. A notable feature of vocal motor neurons in mammals and birds is that they also are active in normal respiration and, at least in mammals, to effect swallowing [76,99].

    Motor neurons important to the muscles of the upper vocal tract originate from multiple sources, including from motor neurons in the nucleus ambiguus, the hypoglossal motor nucleus, the facial motor nucleus and the trigeminal motor nucleus [98,100,101]. Because all of these motor neurons lack axon collaterals and only innervate their respective target muscles, they serve only to read out rather than directly participate in vocal pattern-generating circuits. Consequently, vocal pattern generators must synchronize activity of motor neurons located over a wide extent of the brainstem, rather than relying on reciprocal connections between disparate motor neuron pools to achieve vocal and respiratory coordination.

    (b) Brainstem pattern-generating networks for innate vocalizations

    An operational framework that enables the experimental identification of vocal pattern-generating circuits includes: (i) the capacity to trigger vocalizations when artificially stimulated with electrical, chemical, chemogenetic or optogenetic methods; (ii) the ability to alter, degrade or entirely suppress spontaneous vocalizations when stimulated, partially lesioned or pharmacologically inactivated; (iii) functionally direct (mono- or paucisynaptic) linkage to vocal and respiratory motor neurons important to phonation; and (iv) component neurons that are active before and during vocalization and that display firing patterns that correspond to acoustic features of the vocalization, such as duration, frequency or call type.

    In mammals, brainstem structures that meet these criteria include portions of the lateral reticular formation (LRF) and the nucleus retroambiguus (NRA; or the caudal ventral respiratory group (VRG)) (reviewed in [75,102]). Both the LRF and the NRA contain neurons that project to laryngeal motor neurons, to the motor neurons that innervate the articulatory muscles, and to expiratory motor neurons, thus providing the highly divergent architecture needed to coordinate vocal and expiratory motor activity during vocalization. Moreover, electrical stimulation in these structures is sufficient to elicit a range of vocalizations that resemble the innate vocalizations of the subject species. Conversely, inactivating the NRA abolishes spontaneous vocalizations as well as vocalizations that would otherwise be elicited by stimulating upstream regions, such as the periaqueductal grey (PAG; see §4). Finally, both the LRF and the NRA contain neurons that are active before and during vocalization and exhibit firing patterns that correspond to specific acoustic features of innate vocalizations. These features advance the LRF and NRA as two crucial components for the patterning of innate vocalizations in mammals.

    Although the functional interrogation of the brainstem circuits for innate vocalizations has been less extensive in birds, a likely avian homologue to the mammalian NRA (or caudal VRG) is the nucleus retroambigualis (RAm), which projects to expiratory and syringeal vocal motor neurons (figure 2) (the avian counterpart to the rostral VRG is the nucleus parambigualis, which contains inspiratory premotor neurons and is reciprocally connected to the forebrain song system (discussed in further detail in §5)) [103]. Both the NRA in mammals and RAm in birds project bilaterally onto their respective vocal motor neuron pools [98,103]. This feature may facilitate bilateral coordination of the vocal muscles when upstream regions are highly lateralized in their vocal functions, as with speech cortical regions in humans [96], or are anatomically lateralized, as with descending projections from forebrain song nuclei the brainstem in certain songbirds [104107]. Finally, because NRA and RAm receive highly convergent input from upstream regions in the brainstem and forebrain that are important to vocal gating, respiratory patterning and the volitional and learned control of vocalization, they likely represent a final common node in the mammalian and avian brainstem for vocal–respiratory patterning and coordination [98,103,108,109].

    Figure 2.

    Figure 2. Schematic of the vocal control system of the songbird, emphasizing descending and ascending (recurrent) pathways important to learned song production. Abbreviations: HVC (used here as a proper name); RA, robust nucleus of the acropallium; DM, dorsomedial nucleus of the intercollicularis (the avian equivalent to the PAG); PBvl, ventrolateral part of the parabrachial nucleus; IOS, superior infra-olivary nucleus; RVL, rostroventral lateral medulla; PAm, nucleus parambigualis; RAm, nucleus retroambigualis; Uva, nucleus uvaeformis; nXIIts, tracheosyringeal part of the hypoglossal motor nucleus; nTS, nucleus of the solitary tract; INSP, inspiratory motor neurons; EXP, expiratory motor neurons; AFP, anterior forebrain pathway. The AFP and auditory pathways are highly simplified in this diagram. Descending connectivity is emphasized on the left side, but is not depicted on the right to highlight the role of Uva in providing interhemispheric connections important to bilateral coordination of HVC and central control of song. Figure courtesy of Marc Schmidt. (Online version in colour.)

    4. The contributions of the midbrain periaqueductal grey to vocalizations

    A crucial question is how the vocal pattern-generating networks in the caudal brainstem are switched on and off to ensure that vocalizations are produced in a manner that is appropriate to the environment, social context and the internal (emotional, developmental or hormonal) state of the animal. In this light, the lateral part of the caudal PAG has emerged as an essential hub for producing innate vocalizations in mammals and birds [109112]. In primates, the PAG receives inputs from various forebrain structures important to reproductive and social behaviours, including the hypothalamus, amygdala, anterior cingulate cortex, preoptic area and the bed nucleus of the stria terminalis [113] (figure 3). In songbirds, the dorsolateral PAG receives input from forebrain nuclei important to producing learned song and from hypothalamic regions, providing a link between hormonal state of the animal and courtship vocalizations [94,114,115]. Finally, the PAG projects to the NRA and to the rostral VRG (PAm in songbirds), which contain inspiratory premotor neurons and that are in turn reciprocally connected to respiratory pattern-generating circuits [109,113]. Consequently, even though the PAG in the primate provides at best only sparse input onto phonatory motor neurons [113], it is nonetheless well-situated to rapidly influence and coordinate vocal and respiratory activity.

    Figure 3.

    Figure 3. Primate vocal systems. Voice control in humans and nonhuman primates is accomplished by two hierarchically organized vocal systems. The phylogenetically ancient system (‘primary vocal motor network’, black colour) is responsible for innate and affective vocalizations. The phylogenetically new system (‘volitional articulatory motor network’, grey colour) is responsible for volitional vocalizations and learned speech in humans. The projections in nonhuman primates, in particular the indirect projection from the motor cortex to nucleus ambiguous, are indicted by dotted lines; solid lines represent connections in the human brain.

    (a) The mammalian periaqueductal grey ‘gates’ vocalization

    Perhaps not surprisingly given its anatomical relationship to vocal patterning circuits, the mammalian PAG plays an obligatory role in vocal production. Indeed, bilateral lesions of the PAG induce mutism in a wide range of mammals and, in humans, abolish both speech and innate vocalizations [113,116]. In monkeys, the mutism induced by PAG lesions has been shown to be specific to the generation of vocal responses rather than a consequence of accessory deficits, because the vocal folds show normal respiratory movements (there is no paresis of the vocal folds) [108]. Interestingly, partial lesions of the caudolateral PAG lead to a loss of some vocalization types while others remain intact, suggesting that subsets of PAG neurons regulate specific vocalizations.

    Further underscoring the important role of the PAG in vocalization, electrical stimulation of the PAG can elicit non-habituating vocalizations at short latencies (50 ms or less) [117] in apes, monkeys, cats, bats and rodents, with acoustic features that are typical of innate vocalizations produced by that species (primates [118121]). Furthermore, stimulation applied at different sites in the PAG can elicit different vocalization types, complementing the selective loss of certain vocal types following partial PAG lesions. And while electrical stimulation in some parts of the PAG can produce appetitive or aversive effects, self-stimulation experiments in monkeys support the idea that vocal effects of PAG stimulation are more direct and not attributable to stimulation-induced affective changes. Finally, the pattern of electrical stimulation in the PAG has little relationship to the pattern of vocalizations that are elicited, supporting a model in which the PAG gates downstream vocal pattern-generating networks, rather than contributing directly to the vocal patterning process itself [122].

    (b) A genetic approach to understanding the structure and function of the periaqueductal grey

    The functional and structural complexity of the PAG presents a particularly vexing challenge to understanding its role in context-dependent vocal gating. In addition to vocalization, the PAG helps to regulate respiration, nociception, and defensive and sexual behaviours, and vocalization-related PAG neurons are intermingled with the neurons that serve these various other functions [112,123,124]. This multifunctional and interwoven organization of the PAG has provided something of a roadblock in extending our understanding of the PAG's function in vocalization beyond classical lesion- and stimulation-based approaches. Consequently, whether a distinct and specialized subset of PAG neurons plays a dedicated role in vocal gating has remained unclear. Resolving this issue is important for understanding how vocalizations are integrated into more holistic behaviours, such as courtship or territorial defence, and for more explicitly mapping the circuits for social communication and courtship.

    In this regard, an important recent advance is the use of genetic methods to identify and manipulate vocalization-related neurons in the mouse PAG [125]. Tschida and her co-workers labelled PAG neurons that were active in male mice producing USVs (PAG-USV neurons) by using an intersectional strategy employing a knock-in mouse line in which highly active neurons express a specialized receptor that binds an engineered virus [126]. Injecting the virus into a region of the male mouse's PAG that expressed high levels of neuronal activity during vocalization genetically ‘tagged’ these ‘vocal’ neurons [125]. A distinct advantage of this approach is that it enables functionally specialized neurons to be genetically labelled and manipulated even when they are anatomically intermingled with neurons that serve other roles.

    This approach shows that a subset of neurons in the caudolateral PAG are specialized for producing USVs: genetically silencing these ‘PAG-USV’ neurons with tetanus toxin prevented males from producing USVs in response to female mice without suppressing other aspects of the male's courtship behaviour; conversely, chemogenetic or optogenetic activation of PAG-USV neurons evoked abundant USVs in socially isolated male mice, which typically utter USVs only when a female is present [125]. Notably, in optogenetic experiments, the pattern of USVs that mice produced were similar with tonic or pulsed (10 Hz) light, further supporting the idea that the PAG gates rather than patterns vocalizations. The specialized and selective role of PAG-USV neurons is further underscored by the finding that optogenetic stimulation of PAG-USV neurons did not evoke the running or ‘ballistic’ escape movements that are routinely elicited by pan-neuronal stimulation in the PAG. Moreover, these experiments suggest that functionally distinct vocalizations are gated by anatomically distinct subsets of PAG neurons, because genetically silencing PAG-USV neurons did not prevent mice from producing audible ‘squeaks’ in response to foot shock. Lastly, anterograde tracing reveals that PAG-USV neurons make efferent projections to NRA and the ventral parabrachial complex, providing an experimental avenue for understanding how the PAG gates vocalization [125].

    (c) Birds also have a vocal periaqueductal grey

    Although the precise homology between the mammalian and avian midbrain PAG still awaits a full accounting [127], a dorsolateral part of the avian PAG just medial and ventral to the inferior colliculus (i.e. the dorsomedial nucleus of the intercollicular complex, or simply DM) serves an important role in producing innate calls [111] (figure 2). Similar to the behavioural effects of stimulating the mammalian PAG, electrical stimulation in DM elicits species-typical calls [72,109]. One potentially important distinction is that songbirds with bilateral DM lesions can still sing their learned songs [94], which contrasts with the loss of speech in humans with analogous brainstem injuries [116]; further experimental confirmation of this important distinction is warranted. Anterograde tracing from the avian DM reveals a pattern of descending projections remarkably similar to those described for the PAG in mammals, with DM axons terminating in RAm (caudal VRG), the ventrolateral parabrachial nucleus, as well as the nucleus parambigualis (rostral VRG), a region that contains inspiratory premotor neurons and that is reciprocally connected to the respiratory pattern-generating network [109]. One distinction from the mammalian PAG is that DM axons also terminate in the syringeal part of the hypoglossal motor nucleus, providing the ‘vocal’ midbrain direct access to the phonatory motor neuron pool [109].

    5. Forebrain inputs to brainstem vocal networks and their contributions to vocalization

    The effects of cortical lesions on human speech can be so devastating that it is easy to forget that significant forebrain (telencephalic and diencephalic) contributions to vocalization are the exception rather than the rule in almost all other vertebrates. Given this exceptionalism, a reasonable assumption is that the ancestral circuits for vocalization reside entirely in the brainstem and the involvement of the forebrain in vocal behaviours has only recently evolved in just a few distantly related species. Nonetheless, numerous forebrain regions make connections with brainstem vocal–respiratory circuits, even in those species that produce only innate, affective vocalizations, presumably providing the scaffolding from which volitional and learned vocal control could be derived. A matter of current debate is whether the forebrain has largely usurped the role of pattern generation in those vocalizations highly dependent on cortical input, such as speech or birdsong. Alternatively, the forebrain might be engaged in a reciprocal interaction with the brainstem, mediated in part by the recurrent connections that brainstem respiratory–vocal regions make with the forebrain, to affect the patterning of these learned vocalizations.

    (a) Forebrain vocal circuitry in primates: two parallel, segregated pathways

    Two distinct and parallel vocal systems are recognized in primates [47,122,128]. The primordial brainstem vocal system called ‘primary vocal motor network’ [128] comprises projections from the anterior cingulate cortex (ACC) and other forebrain limbic structures to the PAG, as well as the reticular formation, and participates in stereotypic innate vocalizations driven by affect. The evolutionarily more recent cortical vocal system termed ‘volitional articulation motor network’ [128] includes projections from the laryngeal motor cortex (LMC) directly to phonatory and respiratory motoneuron pools and is the dominant system for human speech and language (figure 3).

    (i) The primary vocal motor network

    This system consists of two structurally and functionally distinct parts: the previously described midbrain PAG and brainstem vocal pattern-generating system, and an upstream limbic vocal initiating network that switches vocal pattern generation on and off based on affective state. This limbic network includes the ACC as well as other telencephalic and diencephalic structures, as detailed in the following paragraphs.

    The ACC is typically considered to sit atop the limbic vocal initiating hierarchy. Three cortical areas in the ACC are involved in vocal behaviour and all three of these areas also project to the LMC of the modern volitional articulation motor network [129]. These areas include the rostral (CMAr/M3/area 24c) and caudal (CMAc/M4/areas 24d, 23c) cingulate motor areas (CMA), located in the cingulate sulcus [130], and area 24b rostral to the genu of the corpus callosum [129].

    Several lines of evidence demonstrate a role of these ACC regions in vocalization. First, electrical stimulation in the ACC elicits species-specific vocalizations in monkeys as well as other mammals [131134]. Second, neurons in the macaque CMA modulate their discharge in association with vocalization [135,136]. Finally, bilateral ablation of the ACC vocalization region has mild effects on spontaneous vocalizations, usually characterized by a decrease in the vocalization rate [137140], although call amplitudes and durations may also decrease [141,142]. Together, these findings suggest that the ACC plays a modulatory role in vocalization.

    Several clinical studies implicate the ACC in the initiation of non-verbal vocal utterances and emotional speech in humans. In a type of frontal lobe epilepsy characterized by involuntary and stereotyped bursts of laughter (‘gelastic seizures’; [143]), the cingulate gyrus appears to be the most commonly disrupted site [144]. Indeed, electrical stimulation of the rostral ACC can evoke uncontrollable but natural-sounding laughter [143,145,146]. Besides controlling non-verbal vocalizations, the ACC is also important for speech. In humans, bilateral infarction of the ACC near the rostrum of the corpus callosum results in akinetic mutism [147,148], which over a longer period can resolve to speech characterized by monotonous intonation [149]. These clinical findings together with functional imaging in vocalizing humans [150] suggest that the ACC is involved in the emotional intonation of human speech [149].

    In addition to the ACC, focal electrical stimulation applied in several other telencephalic and diencephalic structures of the primate can elicit species-specific vocalizations. Such ‘evocative’ foci are found in the amygdala, the BNST (bed nucleus of the stria terminalis), the substantia innominata (including the basal nucleus of Meynert), nucleus accumbens, septum, preoptic area of the hypothalamus, hypothalamus, midline thalamus and zona incerta in the subthalamus [120,131]. Interestingly, stimulation in different limbic areas produced different call types, while lesions in the amygdala or hypothalamus suppressed distinct types of spontaneously uttered vocalizations [108,151].

    The relatively long latency of the vocal responses elicited from these sites (greater than 1 s) and their fast habituation are consistent with stimulation-induced affective changes rather than primary motor responses [120]. In fact, studies of freely moving squirrel monkeys that could either seek or avoid electrical stimulation support the idea that the vocalizations elicited from many of these limbic areas result from stimulation-induced changes in affective state [120]. A notable exception was the ACC, for which elicited vocalization and specific affective states were not correlated. Compared to the other limbic forebrain areas, the influence of the ACC on vocal behaviour thus seems to be more direct.

    (ii) The volitional articulation motor network

    The dominant vocal network of humans is a more recently evolved cortical vocal system that is largely distinct from the primordial vocal motor system. This modern cortical vocal system consists of cortical structures that are essential for human speech (figure 3). Although this system is also present in non-human primates, it is anatomically and functionally underdeveloped relative to the human cortical vocal system. The three major cortical components of this system are the laryngeal motor cortex, the supplementary motor area and prefrontal cortical regions that include Broca's area in the human brain.

    In humans, the laryngeal motor cortex (LMC), which is located in the face region of the primary motor cortex (M1), provides direct cortical control over volitional vocalizations, most notably speech. This includes the phonatory motor neurons of the cranial nerve nuclei—trigeminal motor nucleus (V), facial nucleus (VII), nucleus ambiguus (IX, X, XI) and hypoglossal nucleus (XII) [47,152,153]. These projections are functionally significant, as the ability to produce speech depends on the LMC [47].

    In humans, bilateral damage to the LMC causes complete loss of voluntary control over the speech apparatus, rendering patients unable to speak or sing [55,154]. Although such patients are occasionally able to utter non-verbal vocalizations, such as grunts, wails and laughs, presumably mediated by an intact primary brainstem vocal system, they cannot voluntarily modulate the pitch, intensity or the harmonious quality of their vocalizations [98]. In contrast, bilateral lesions of the LMC in nonhuman primates have no effect on calling behaviour, emphasizing that the pronounced role of the cortical vocal system in human speech is an evolutionarily recent enhancement [139,141,142,155].

    The human LMC is located in the posterior part of the primary motor cortex (M1, BA 4) [156] and may consist of two distinct laryngeal representations (figure 4) [157,158]. The first laryngeal representation, which is the likely homologue of the non-human primate LMC [159], is found at the ventral extreme of the orofacial motor cortex, a region where electrical stimulation can elicit or disrupt the production of a range of human vocalizations [160,161]. A more dorsal laryngeal representation in the dorsal part of the orofacial M1 has recently been located that may be specific to humans [157,158]. Electrical stimulation applied in this dorsal laryngeal representation results in an involuntary forced exhalation and prolonged utterance of vowel-like sounds [158].

    Figure 4.

    Figure 4. Putative evolution of the modern cortical vocal system (‘volitional articulatory motor network’) in primates with respect to the laryngeal motor cortex and its projections to phonatory motor nuclei. During the course of primate evolution (left to right), the laryngeal motor cortex (LMC; red concentric circles) shifted from the premotor cortex (PMC; area 6) in monkeys (a) and apes (b) to the primary motor cortex (area 4) in humans (c). This shift was accompanied by a gradual transition from an indirect (via the reticular formation; green colour) to a direct projection of the LMC to nucleus ambiguus (blue colour). In addition, the LMC of humans comprises not one, but two parts in the primary motor cortex. (Online version in colour.)

    The different locations of the (ventral) LMC between primate species suggest an anterior-to-posterior shift during primate evolution. In apes, the LMC region is positioned more anteriorly at the border between the primary motor cortex (M1) and the ventral premotor cortex (PMv, BA 6). In macaque monkeys and squirrel monkeys, the LMC sits even more anteriorly in the premotor cortex (PMv, BA 4), between the inferior arcuate sulcus rostrally and the subcentral dimple caudally (figure 4). Unlike the reliable vocal effects of LMC stimulation in humans, electrical stimulation in the LMC of apes and monkeys results only in vocal cord movements, but not in vocalizations [152,162166].

    Important differences in connectivity also distinguish the human LMC from that of other primates. Only LMC neurons in humans make direct projections to nucleus ambiguus (figures 3, 4) [152,167169]. In contrast, the monkey LMC is connected only indirectly with the nucleus ambiguus via the dorsal and parvicellular nuclei of the reticular formation of the brainstem [152,166,167]. Apes take an intermediate position and show a sparse monosynaptic pathway from the LMC to the nucleus ambiguus [169]. Another noteworthy distinction of the human relative to the macaque LMC is a nearly sevenfold stronger connectivity with somatosensory and inferior parietal cortices [170]. These enhanced LMC–parietal connections in humans likely allowed for more refined sensorimotor integration of information necessary for learned vocal production. Conversely, the monkey LMC has greater connectivity with the ACC, which may be important for the cortical initiation of innate vocalizations [98].

    An intriguing idea is that over the course of hominid evolution, the LMC shifted caudally from an ‘old’ motor cortex present in all primates to a phylogenetically ‘new’ motor cortex found only in humans [171]. Presumably, the precise, volitional control of laryngeal movements necessary for human speech is the result of a bipartite LMC, the LMC's direct access to laryngeal motoneurons and the LMC's enhanced intracortical connectivity [156,172]. In contrast, the LMC in monkeys may primarily serve evolutionarily conserved non-vocal laryngeal functions, such as breathing, coughing and swallowing [129].

    Of course, the LMC does not operate in isolation. Neuroanatomical tract tracing in the rhesus monkey reveals that the LMC is connected with several cortical motor regions [129]. These cortical regions include surrounding (non-laryngeal) parts of M1, the ventrolateral and caudal ventrolateral premotor cortex (vlPM and vlPFC), which includes Broca's area in humans, the supplementary motor area (SMA) and the ACC.

    The SMA in the frontal agranular cortex is an important source of input to the LMC in primates that has acquired a prominent role in human speech production [129]. Since the SMA does not project to the PAG [173], but does form a direct projection to the LMC, it can be regarded as part of the volitional articulation motor network. Once defined as a single motor area, it is nowadays divided into SMA proper (also called F3) and the rostral part, now termed preSMA (F6) [174]. Only SMA proper is intensely connected with M1, contains cortico-spinal projection neurons [175177] and readily elicits body movements after electrical stimulation. The preSMA, instead, is heavily interconnected with regions of prefrontal cortex (PFC) [178,179] and therefore not considered a premotor area [180].

    Various functional studies support a role for the SMA in human speech. For example, electrical stimulation of the SMA in humans reliably elicits vocalizations [181], whereas lesions in SMA severely reduce the motivation to speak [182,183]. Moreover, functional imaging shows that the SMA is involved in singing, word selection and word production [184186]. In contrast, the SMA appears to play a much more modest role in vocalizations of other primates, because electrical stimulation of the SMA in monkeys does not elicit vocal utterances [120]. However, bilateral lesions in the SMA decrease the number of spontaneous vocalizations in monkeys, while also increasing the response latencies of instructed vocalizations [139,187]. Moreover, vocalization-related neuronal activity is found in the SMA prior to call onset [136].

    A key structure endowing humans with volitional speech control is Broca's area in the PFC. Broca's area classically comprises Brodmann areas 44 and 45 in the inferior frontal gyrus of the granular ventro-lateral PFC (vlPFC) [188]. Broca's pioneering work on the brains of aphasics revealed that areas 44 and 45, usually on the left side of the brain, are instrumental for producing speech and language [54], a finding that has been confirmed since then [189,190].

    The primate PFC is regarded as the central executive of the brain [191] and hosts a variety of cognitive functions necessary for the evolution of semantic [192194] and syntactical [195197] language functions in humans [128]. In macaques, areas 44 and 45 in the rostral parts of the vlPFC have been identified as an anatomical homologue of Broca's area in the human brain [198,199]. Whereas bilateral ablation of ventrolateral aspects of the frontal lobe were reported to have no significant impact on discriminatively conditioned vocal behaviour, the variation and lack of complete symmetry in lesion locations complicate interpretation of these negative results [141,142]. In contrast, it has been reported that bilateral removal of prefrontal and orbitofrontal regions (including areas 44 and 45) produced marked vocal deficits, often causing near total and permanent muteness [200]; this mutism may, however, partly be attributed to a disruption of social behaviour rather than a specific vocal deficit. Interestingly, anatomically precise electrical stimulation in area 44 of monkeys elicits orofacial and laryngeal movements, suggesting that this area might have originally enabled volitional control over orofacial actions, including those related to social communication [201], properties that one could readily imagine were exploited over hominid evolution to enable speech and verbal expression.

    In macaques trained to vocalize, neurons in vlPFC respond specifically in preparation of volitional calls [202,203], and the pre-vocalization activity of many call-related vlPFC neurons correlates with the acoustic parameters of the ensuing vocalization. Several aspects of this vocalization-related activity indicate that the vlPFC plays a prominent role in volitional call initiation. First, call-related neurons in vlPFC showed the strongest and earliest pre-vocal modulation compared to neurons in the ACC and SMA [136]. Second, neuronal modulation in the vlPFC was absent whenever the monkeys missed a cued vocalization, or when they vocalized spontaneously between trials. Third, and most importantly, vlPFC neurons showed a strong correlation between the onset of neuronal activity and the timing of vocal output: irrespective of the monkeys' call reaction times, vlPFC neurons showed ramping-onset activity approximately 1.2 s prior to the call [136]. These various observations point to a direct involvement of the vlPFC in forming a decision signal for initiating vocalization. A realistic scenario is that the vlPFC, which does not directly project to M1, gains control over the vocal motor network via the premotor cortex (PM), which in turn projects directly to M1 [176,204,205]. In fact, the corticobulbar (for orofacial and laryngeal movements) and corticospinal (for thoracic and diaphragm movements [160]) pathways that originate from both M1 and PM are ideal substrates for volitional vocal control.

    In new-world monkeys, the role of the frontal lobe in volitional vocal control has not yet been investigated, but there is emerging evidence for the general involvement of frontal cortical areas in vocal production [206]. Microelectrode recordings of prefrontal (PFC) and premotor (PMC) cortical areas during antiphonal calling have found vocalization-related activity that often precedes or is phase locked to the onset of vocal production [207,208]. In addition, immediate early gene studies in marmosets demonstrated vocal production-related expression particularly in dorsal PFC/PMC [209,210], while other studies have also found expression in ventral PFC [211]. In apes, functional imaging similarly showed activation in the chimpanzee left inferior frontal gyrus, a homologue to Broca's area in humans, during communicative gestural and vocal signalling [212]. Together, these studies suggest an involvement of the lateral frontal cortex in vocal production in non-human primates, with evidence for a role of the vlPFC in volitional vocal control in old-world monkeys (macaques).

    Although the vlPFC, and more specifically Broca's area, clearly play a privileged role in human speech, this role may be more modulatory and preparatory in nature. A classical observation in this regard is the finding that electrical stimulation of Broca's area leads to speech arrest rather than speech production [189,213,214]. Moreover, direct cortical surface recordings in neurosurgical patients indicate that Broca's area is predominantly activated before the utterance of a speech element, but is silent during the corresponding articulation [215]. Thus, Broca's area may play a more indirect role in coordinating speech instead of a direct role in speech production [57,216]. In support of this hypothesis, cooling of Broca's area in awake neurosurgical patients slows speech without affecting articulation, whereas focal cooling in the LMC led to slurring [217], consistent with the idea that Broca's area may control aspects of vocal timing without directly regulating speech articulation (figure 5).

    Figure 5.

    Figure 5. Cooling telencephalic vocal structures slows down vocal output in humans and songbirds. (a) Cortical sites on the left hemisphere that showed effects on speech timing (yellow colour) or deterioration of speech quality (blue colour) when a cooling probe (inset) was placed on Broca's area or primary motor cortex on the precentral gyrus. Cooling of Broca's area primarily affected speech timing, whereas speech quality was mostly affected during cooling of the primary motor cortex. Adapted from [217]. (b) Top panel: a Peltier device was used to cool down neurons in HVC of singing zebra finches. Bottom panel: sonograms recorded from a bird during the cooling experiment show that the song became progressively dilated (indicated by the percentage value to the right) with cooler temperatures relative to the control. Slightly increasing the temperature even mildly shortened the song (top sonogram). Adapted from [218]. (Online version in colour.)

    One idea is that Broca's area controls articulation by disinhibition of articulatory motor activity briefly before vocal output [219], which would predict that non-verbal, emotional vocalizations might emerge once the modulatory (and/or inhibitory) influence of the voluntary articulation network vanishes [220]. Indeed, after damage to Broca's area, non-verbal vocal utterances remain intact despite devastating impairments in speech and language production [5457]. Moreover, patients with a clinical diagnosis of primary progressive aphasia develop abnormal laughter-like vocalizations that increasingly replace speech in the context of progressive speech output impairment leading to mutism, until ultimately laughter-like vocalizations are the only extended utterance produced by these patients [58]. Finally, some non-verbal vocal utterances are more common during conversational speech of some aphasic patients [221]. Together, these observations suggest that the volitional articulation motor network and primary vocal motor network may compete to some extent for access to the brainstem vocal–respiratory network.

    (b) Forebrain contributions to vocalizations in rodents

    (i) Descending inputs to the nucleus ambiguus

    In contrast to the primate, the consensus from numerous anatomical studies is that laryngeal motor neurons in rodents receive no direct (monosynaptic) input from the forebrain, and instead receive direct inputs only from medullary, pontine and midbrain regions. For example, injections of horseradish peroxidase into the nucleus ambiguus of the rat retrogradely label afferent neurons extensively in the medulla and pons [222]. These afferents to the nucleus ambiguus include the lateral aspect of the medial parabrachial nucleus, the Kolliker-Fuse nucleus, the reticular formation, the nucleus of the solitary tract, as well as in midbrain structures such as the red nucleus and the superior colliculus, but not any neurons rostral of the midbrain [222]. More recently, transsynaptic viral tracers injected into the laryngeal muscles of rats and mice have helped to map central regions that may be involved in vocalization [10,223225]. Specifically, pseudorabies virus (PRV, an alpha-herpesvirus) injected into the laryngeal muscles first infects laryngeal motor neurons (i.e. first-order neurons) and then jumps retrogradely into neurons that are directly (second-order) or indirectly (third-order) connected to these motor neurons [223]. Although highly sensitive, PRV and other herpes simplex virus (HSV)-based retrograde synaptic tracers ‘jump’ synapses rapidly yet variably, and thus this method must be carefully monitored and accompanied by internal controls to establish which labelled neurons synapse directly onto the motor neurons [223,225227].

    Indeed, at the shortest times at which central labelling can be detected after PRV injection into the laryngeal muscles, central labelling is only found in the nucleus ambiguus, a first-order structure, and in second-order structures including the nucleus retroambiguus and the nucleus of solitary tract [223,224]. Notably, third-order structures, such as the midbrain PAG, are only labelled at longer latencies [223], consistent with anterograde tracing studies that fail to reveal a direct projection from the PAG to the laryngeal motor neurons [113]. And although various forebrain neurons, including hypothalamic, amygdalar and cortical neurons are also labelled at longer latencies following PRV injections, these most certainly reflect higher-order and thus relatively indirect projections to the phonatory motor neurons [10,223]. A further caveat is that these retrograde tracing studies only reveal pathways that ultimately access laryngeal motor neurons, but cannot distinguish whether these pathways contribute to vocalization or other behaviours that engage the intrinsic laryngeal muscles, such as respiration or swallowing.

    (ii) Forebrain afferents to PAG-USV neurons in mice

    As previously discussed, the multifunctional nature of the PAG presents a challenge to gaining further insight into how the forebrain modulates the vocal ‘gate’ in the PAG. A major recent advance in this regard has been to combine genetic tagging of PAG-USV neurons in the mouse with modified rabies virus tracing methods (K. Tschida 2019, unpublished). This combinatorial approach enables the identification of neurons that make monosynaptic inputs onto PAG neurons that gate USVs, rather than afferents to the PAG more broadly, effectively focusing the hunt for forebrain neurons that are most likely important to vocal communication. Consistent with prior analysis of vocalization-related forebrain afferents to the PAG, this approach emphasizes the high degree of forebrain convergence onto PAG-USV neurons, including from neurons in the cingulate cortex, motor (M1 and M2) cortex, insular cortex, the BNST, nucleus accumbens, central amygdala, ventral pallidum, the preoptic area and the lateral hypothalamus.

    Such highly convergent architecture presumably reflects the extensive and often competing demands to either generate or suppress vocalizations as a function of specific social, reproductive and environmental cues. These results also elevate the mouse as a suitable model for understanding how these various cues are integrated in the forebrain to ultimately engage the vocal machinery of the brainstem in the service of social and sexual signalling, as well as territorial defence. Thus, a profitable future line of enquiry will be to monitor and manipulate the activity of various forebrain neurons afferent to PAG-USV neurons during naturalistic encounters in which USVs figure prominently, including male–female courtship, and same-sex as well as adult–pup interactions.

    (iii) The mouse cortex plays little or no role in vocal patterning

    Although there are many forebrain inputs to the vocal PAG, their vocal functions are either less obvious or simply not yet known. Notably, mice with bilateral lesions of the motor cortex or that have been genetically engineered to lack a cortex altogether still produce a normal repertoire of USVs [10,228]. And while slight differences in the relative abundance of certain syllables can be detected in genetically decorticate mice [228], and syllable variability in mice with motor cortical lesions may increase very slightly [10], the cortex of the mouse, as in monkeys and most other mammals outside of humans, simply does not appear to be necessary for vocal patterning. And while the vocal role of the many other forebrain inputs to the PAG-USV neurons has yet to be tested, we speculate here that in rodents, as in most mammals, the brainstem contains all of the neural machinery necessary to produce a normal and complete repertoire of innate vocalizations.

    (iv) The mouse cortex may play a role in the contextual control of vocalization

    Independent of a role in vocal patterning, the forebrain of the mouse may still play an important role in motivational and emotional aspects of vocalization, for example by regulating vocal output as a function of social or reproductive context. In fact, support for a motor cortical role in the social control of vocalization comes from studies of a wild species of muroid rodent, the short-tailed singing mouse (Scotinomys teguina) [229,230]. Male Scotinomys produce long strings (tens) of voiced frequency-modulated syllables, typically as part of a ‘call and response’ behaviour with other males. When two males first encounter each other, they take turns at singing, minimizing temporal overlap between their songs. Although male Scotinomys can still produce songs with normal patterns following motor cortical lesions, they lose the ability to sing antiphonally in response to hearing another male's song, suggesting that the role of the motor cortex is to help control vocalization as a function of socially salient auditory cues [230].

    Although male Mus do not engage in such antiphonal calling with other males, they do produce USVs in the presence of females or female odorants [125]. This ‘sexual’ gating of vocalization implicates reproductive structures, such as the preoptic area (POA) and the lateral hypothalamus, both of which project to the PAG and where electrical stimulation evokes vocalizations in many mammalian species [98]. Moreover, in socially isolated male mice, optogenetically stimulating POA axons in the caudolateral PAG is sufficient to evoke USVs (V. Michael and K. Tschida 2019, unpublished observations), suggesting that POA activity during social encounters provides a contextual signal to the PAG that is sufficient to trigger USV production. In the future, a similar approach can be applied to other forebrain afferents to the PAG to better understand their role in vocalization, whether different afferents recruit different types of vocalizations, or conversely act to suppress vocalization in response to stress or threat.

    (c) Forebrain contributions to birdsong: the song system

    Other than the essential cortical involvement in human speech, the executive role of the telencephalon in producing learned birdsong has few parallels. Following the revolutionary insights of Nottebohm and his colleagues four decades ago [94,231], a slew of functional and anatomical studies have established that birdsong results from the interaction between a specialized network of forebrain nuclei and the more ancestral vertebrate brainstem vocal–respiratory network described in previous sections (for extensive reviews, see [41,43,45,46,76,94,231233]). The forebrain components of this ‘song system’ can be further divided into a song motor pathway, which is obligatory for singing and song learning, and an anterior forebrain pathway that is necessary to song learning but may be largely dispensable for producing songs that a bird has already learned (figure 2). For the purposes of this review, we focus largely on the role of the song motor pathway in adult song production.

    Although precise homologies between the avian and mammalian forebrain are still a matter of fascinating debate, the song motor pathway consists of a serially connected chain of cortical nuclei (nucleus interface of the nidopallium (NIf), HVC and RA), the latter of which projects directly onto the syringeal motor neurons, the respiratory premotor complex (nucleus retroambiguus and parambiguus), as well as DM (figure 2) [94,103,234]. This direct projection of RA onto the vocal motor neurons may distinguish songbirds from other bird species that only produce innate vocalizations as well as all non-human mammals, where forebrain projections to the vocal motor neuron pool are typically di- or polysynaptic. In fact, bilateral lesions of HVC and RA abolish learned vocalizations in songbirds, pointing to heavy investment of telencephalic control in birdsong [94,95]. Further, songbirds with HVC lesions can still ‘babble’ and produce other innate calls, paralleling the specialized role of the human cortex in the utterance of speech but not innate vocalizations [94,95]. However, bilateral lesions placed in RA do disrupt antiphonal calling in zebra finches [235], an expressive deficit similar to that reported in short-tailed singing mice, a parallel property that may point to a generalized role for motor cortex in regulating innate vocalizations as a function of social context.

    (i) Top–down versus recurrent models of song control

    The telencephalic control of birdsong has proven to be an extremely rich system for experimentalists to mine, resulting in wide-ranging studies employing correlative physiology and imaging, as well as causal manipulations such as electrical stimulation, focal cooling and reversible inactivation [95,218,236244]. Taken together, these various studies underscore that HVC and RA directly contribute to the temporal and spectral patterning of birdsong. Of great current interest is whether song results from the ability of HVC and RA to effectively override or supplant the brainstem vocal and respiratory pattern generators [218,245], or instead arises from recurrent interactions between the forebrain and the brainstem [76,246]. In this latter model, song patterning results not only from the descending projections RA makes to the vocal–respiratory brainstem, but also from the ascending projections that certain neurons in nucleus parambiguus make with NIf and HVC via a thalamic intermediary, the nucleus Uva. As such recurrent architecture is present in most mammalian forebrain–brainstem circuits, resolution of this debate is likely to have ramifications that extend beyond the neural control of birdsong.

    (ii) Temporally precise bursts of activity underlie song motor codes

    Single-unit extracellular recordings and, in some cases, intracellular recordings, made in the HVC of singing birds reveal that neural activity at these sites can precede song onset by tens to hundreds of milliseconds and is insensitive to auditory feedback perturbations, both of which are features consistent with premotor activity [236,239,242,247,248]. Moreover, during song, HVC neurons that project to RA (i.e. HVCRA neurons), as well as RA neurons that innervate the brainstem, show clock-like precision and stereotypy of firing patterns that are time-locked to individual notes and syllables [239,241]. In the most closely studied species, the zebra finch, individual HVCRA neurons typically fire only one brief (approx. 10 ms) burst of high frequency (approx. 400 Hz) action potentials during an entire polysyllabic motif, with different HVCRA neurons firing at different times [239,247], possibly tiling the entire motif [245]. While some debate remains as to the uniformity of this neural tiling of song [249], the most recent evidence involving tour-de-force in vivo multiphoton imaging in singing finches indicates that the temporal coverage of the motif provided by large populations of HVCRA neurons is both even and complete [250]. One interpretation of such even tiling of the song is that HVCRA neurons are serially connected in a feed-forward excitatory architecture characteristic of a synfire chain [243,245].

    In contrast to HVCRA neurons, RA neurons in the zebra finch fire multiple (8–10) action potential bursts per motif [241,245], presumably reflecting a highly convergent input they receive from HVCRA neurons that fire at different times in the motif [245]. Nonetheless, the singing-related activity patterns of RA neurons are also remarkably reproducible across motifs and locked with sub-millisecond precision to acoustic features of individual syllables [251]. A noteworthy feature of this transformation in how song is encoded by HVCRA and RA neurons is its efficiency for song learning [252], as error signals arising from a ‘mistake’ made at one time in the song only affect HVC>RA synapses at that single time point, and do not propagate inadvertently to other times elsewhere in the song where the performance is ‘correct’.

    The stereotyped structure and sequence of syllables sung by adult male zebra finches is a boon to perturbation experiments in singing birds. Thus, electrical microstimulation in HVC can interrupt and sometimes entirely reset a bird's motif [237,240], consistent with a role for HVC in pattern generation. In a pioneering study using this approach, similar stimulation applied in RA could interrupt an ongoing note, but not reset the motif, supporting a model in which pattern generators are hierarchically organized, with HVC specifying global patterning of the motif and RA neurons encoding the structure of individual notes or syllables [237]. However, subsequent experiments have elicited motif resetting not only by focal stimulation applied in RA, as well as HVC, but also by stimulating in regions of the rostroventral medulla that contain nucleus parambiguus [240]. Along with the finding that bilateral lesions placed in Uva severely and permanently disrupt song, these more recent microstimulation studies lend support to a recurrent model of song production.

    (iii) Focal cooling of HVC slows down song

    Yet another highly inventive approach has been to focally cool HVC in singing zebra finches with a miniature Peltier device [218] (figure 5). A quite extraordinary set of behavioural observations show that cooling HVC bilaterally over a moderate temperature range (less than 10°C) slows song timing without affecting the spectral or amplitude features of the song [218]. The stretching effect is close to uniform across timescales ranging from milliseconds to seconds, with single syllables and hence the entire motif stretching approximately 3% per degree of cooling [218,244]. Whereas cooling stretches syllable gaps less than syllables, the remarkably selective effects of temperature on timing indicate that a major function of HVC is to encode—quite selectively—the temporal aspects of birdsong. As qualitatively similar effects on human speech result from focal cooling of Broca's region [217] (figure 5), this regionalized control of vocal timing in humans and songbirds may reflect a common strategy for encoding temporal features of learned vocalizations.

    While the effects of HVC cooling on song timing have been interpreted as experimental confirmation that HVCRA neurons are linked together to form feedforward excitatory (synfire) chains, some important caveats remain. First, the effects of temperature on song timing are quite modest, in that the neural machinery comprising a synfire chain (i.e. action potentials and chemical synapses) show a much steeper temperature dependence (10% per degree cooling) [244]. In other words, if synapses and action potentials in HVC constitute a synfire chain for song timing, cooling HVC 10°C should double the length of a motif, rather than stretching it slightly less than a third, as actually observed. Second, cooling RA [218] or Uva [244] with implanted Peltier probes also stretches the bird's song. Whereas the effects of RA cooling on song timing were largely attributed to indirect cooling of HVC by the implanted probe, song still stretched when Uva was cooled and the radiative cooling effect on HVC from the implanted probes was compensated for by simultaneously warming HVC with a surface Peltier [218]. Finally, while modest cooling stretches song more or less uniformly, more extreme cooling of HVC applied in canaries causes individual syllables to break into smaller fragments [244]. Notably, breaks form when a single expiratory pulse of air underlying the original syllable is interrupted by a short inhalation [253], a process that presumably reflects the interaction of respiratory premotor networks in the brainstem with HVC.

    (d) Duetting

    Duetting is a special form of antiphonal singing between two individuals, most typically mated partners. Duetting has been described for many species of birds and several species of gibbons [254,255]. Duets comprise a vocalization initiated by one individual that is answered by its partner such that their vocalizations tightly overlap or alternate. Duetting therefore requires a special coordination of vocal timing and song type between partners [256]. Unlike most other primates, including other great apes, duetting occurs in most species of gibbons, where mated pairs characteristically combine their songs in a relatively rigid pattern to produce coordinated duet songs [255].

    The production of such highly coordinated antiphonal singing involves fine-scale adjustments of the vocal elements that the partners produce through the course of the duet. Since these adjustments require integration of vocal production and perception on a scale of milliseconds, highly developed pathways linking the auditory and vocal pathways can be expected. While the neuronal correlates of duet singing in gibbons are unknown, recent studies provide a first glimpse into brain mechanisms for antiphonal calling in songbirds. Perhaps not surprisingly, given the precise interplay between hearing and vocalization, the song production system is directly involved in antiphonal calling. In a particularly clever experiment, male and female zebra finches learn to precisely time their calls to avoid the ‘jamming’ calls of a vocal ‘robot,’ but were no longer able to avoid jamming after RA lesions [235]. This indicates that the descending song-production pathway functions as a general-purpose sensorimotor communication system for both antiphonal calls as well as the acquisition and production of learned songs. Interestingly, calls in males and females are innate, raising the possibility that the song system initially evolved to facilitate antiphonal calling, an auditory–vocal capacity that may have led eventually to a more flexible and profound capacity for vocal imitation.

    An arguably more sophisticated form of duetting involves the elaborate song duets produced by tropical wrens. A pioneering neurophysiological recording study performed in anesthetized plain-tailed wrens (Pheugopedius euophrys), a songbird species in which male and female breeding pairs sing well-coordinated duets, found that HVC neurons were not only responsive to playback of the bird's own part of the duet but also to the partner's vocalizations [257]. Most importantly, neurons in the wren's HVC responded most strongly to presentations of the complete duet sequence, suggesting that auditory responses in HVC to the partner's half of the duet might be important for the precise coordination of vocal premotor activity during duetting [257]. That HVC in duetting songbirds might rapidly switch between premotor and auditory roles was a surprising idea, because HVC activity in awake (nonduetting) songbirds is mainly premotor, and responses of HVC neurons to auditory stimulation are largely or entirely suppressed during singing [242,248,258,259]. Neural recordings in the HVC of actively duetting birds were needed to test this ‘multiplex’ model.

    Indeed, recent multi-unit recordings in freely-moving and duetting white-browed sparrow-weaver (Plocepasser mahali), another duetting songbird, challenge this multiplex model [260]. In actively duetting sparrow-weavers, the neural activity in HVC was exclusively premotor during singing, just as in non-duetting birds. However, the auditory information generated by the duet partner somehow alters the temporal parameters of HVC activity in the duet-initiating bird, thus enabling the birds to alternate their vocalizations [260]. Such studies underscore that simultaneous measurements of brain activity in pairs or groups of individuals during natural vocal communication are indispensable to understand the neural mechanisms that underlie duetting and other forms of antiphonal vocalizations, including call and response singing and conversational speech in humans.

    6. Concluding remarks

    A comparative approach using primates, rodents, other mammals and birds is highly advantageous for understanding the central and peripheral control of vocalization. Each group brings distinct advantages to such an understanding, ranging from the close homology to human vocal circuits provided by vocal pathways in non-human primates, to genetic tractability of the mouse and the close parallels to human speech learning afforded by studies of birdsong learning. The increasing ease of applying genetic, physiological and optical methods across different animal models should increase the relevance and power of these comparative approaches.

    Human speech is a tremendously complex behaviour but has clear antecedents in the vocal behaviours of other vertebrates. The conservation of the vertebrate brainstem and indeed much of the subcortical forebrain means that human speech is built on a general platform for vocalization that solved the fundamental problem of how to integrate vocal motor and respiratory activity. Speech and birdsong both require cooperation and coordination between motor cortical elements specialized for learned vocal control and these more ancestral brainstem vocal–respiratory networks. While much work remains, studies in songbirds are beginning to provide the clearest insights into how such forebrain–brainstem coordination is achieved.

    An important goal will be to extend dynamical manipulations, such as focal cooling of cortical regions, to studies of vocalizations in non-human primates and rodents. Such perturbation experiments are necessary to move beyond lesion- and inactivation-based ‘necessity’ experiments. The ability to produce normal vocal patterns in the absence of a cortex does not rule out cortical contributions to vocal modulation, contextual engagement of vocalization and associative processes in which animals may learn to vocalize in response to conditioning.

    The rudimentary capacities that gave rise to our highly flexible vocal abilities may be found in context-dependent vocalizations that can be produced by songbirds, non-human primates and certain rodents. That is, the transition from a purely affective vocal gating mechanism to the production of antiphonal vocalizations triggered by the sound of another conspecific's vocalizations may represent a bridge to more complex auditory–motor interactions that ultimately form the foundation for vocal learning.

    In a similar vein, the ability to learn through conditioning to vocalize in response to an arbitrary sensory stimulus, as has been achieved in non-human primates, may provide a stepping stone for understanding how more complex associations between context and vocal behaviour are achieved. Therefore, a useful goal will be to systematically explore how widespread the capacity for conditioned vocalizations may be in other mammals and birds.

    Data accessibility

    This article has no additional data.

    Authors' contributions

    Both A.N. and R.M. wrote this review.

    Competing interests

    We declare we have no competing interests.


    R.M. was funded by grants NIH R01 DC013826, MH117778, NS099288 and NSF 1354962. A.N. was funded by grants DFG NI 618/4.1 and NI 618/6.1.


    One contribution of 15 to a theme issue ‘What can animal communication teach us about human language?

    Published by the Royal Society. All rights reserved.