Philosophical Transactions of the Royal Society B: Biological Sciences
You have accessReview articles

Multilevel rhythms in multimodal communication

Wim Pouw

Wim Pouw

Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands

Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands

[email protected]

Google Scholar

Find this author on PubMed

,
Shannon Proksch

Shannon Proksch

Cognitive and Information Sciences, University of California, Merced, CA, USA

Google Scholar

Find this author on PubMed

,
Linda Drijvers

Linda Drijvers

Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands

Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands

Google Scholar

Find this author on PubMed

,
Marco Gamba

Marco Gamba

Department of Life Sciences and Systems Biology, University of Turin, Turin, Italy

Google Scholar

Find this author on PubMed

,
Judith Holler

Judith Holler

Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands

Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands

Google Scholar

Find this author on PubMed

,
Christopher Kello

Christopher Kello

Cognitive and Information Sciences, University of California, Merced, CA, USA

Google Scholar

Find this author on PubMed

,
Rebecca S. Schaefer

Rebecca S. Schaefer

Health, Medical and Neuropsychology unit, Institute for Psychology, Leiden University, Leiden, The Netherlands

Academy for Creative and Performing Arts, Leiden University, Leiden, The Netherlands

Google Scholar

Find this author on PubMed

and
Geraint A. Wiggins

Geraint A. Wiggins

Vrije Universiteit Brussel, Brussels, Belgium and Queen Mary University of London, UK

Queen Mary University, London, UK

Google Scholar

Find this author on PubMed

    Abstract

    It is now widely accepted that the brunt of animal communication is conducted via several modalities, e.g. acoustic and visual, either simultaneously or sequentially. This is a laudable multimodal turn relative to traditional accounts of temporal aspects of animal communication which have focused on a single modality at a time. However, the fields that are currently contributing to the study of multimodal communication are highly varied, and still largely disconnected given their sole focus on a particular level of description or their particular concern with human or non-human animals. Here, we provide an integrative overview of converging findings that show how multimodal processes occurring at neural, bodily, as well as social interactional levels each contribute uniquely to the complex rhythms that characterize communication in human and non-human animals. Though we address findings for each of these levels independently, we conclude that the most important challenge in this field is to identify how processes at these different levels connect.

    This article is part of the theme issue ‘Synchrony and rhythm interaction: from the brain to behavioural ecology’.

    1. Introduction

    The rhythms animals can sustain in communicative perception and action characterize in great part their social-ecological niche. It is only recently that disparate research fields have focused on the study of temporal aspects of communication as a truly multimodal process [13]. Lessons about the different scales or levels at which multimodal processes happen are however still scattered over different fields, such as psycholinguistics [3], neuroscience [4] and evolutionary biology [5]. The goal of this paper is to align some of the important findings of these fields concerning the different ways in which the brain, body and social interaction each contribute uniquely to the temporal structure of multimodal communication (figure 1 for an overview). Although we overview findings at each level (neural, body, social) independently, we hope to stimulate investigation into potential interactions between levels. We provide some broad terminology for the phenomenon of multilevel rhythm of multimodal communication (§2), and then overview rhythmic multimodal processes on the neural-cognitive (§3), the peripheral body (§4) and the social interactional level (§5).

    Figure 1.

    Figure 1. Multilevel rhythm in multimodal communication. Graphical overview of how each level contributes uniquely to the rhythms sustained in multimodal communication. Figures are adapted from [6,7] and inspired by Gilbert Gottlieb's (1929–2006) view on epigenesis.

    2. Concepts and terminology

    Multimodal processes interest researchers from largely disparate fields and consequently terminology varies [1,5,8], where related meanings potentially get lost in translation. In box 1, we have marked the terms and their senses that occur throughout our overview. This glossary also aims to capture a very general meaning of specialist terms offered to address a particular process in perception or production, or at a neural, structural or behavioural level. The definitions are as general as possible, for instance, so as to underline a continuity of the perception and production of multimodal signals or so as to include phenomena not traditionally treated as multimodal in nature. For example, in sign languages, both the hands as well as facial and labial expressions are combined in complex utterances [9]. Such complex signs are designed to be received through one sensory channel and are thus unimodal by common definitions (but see [10]). In our view, signed languages are an example of a multimodal production in virtue of combining several otherwise independent signalling modes/signal features. Similarly, neural processes can be multimodal in our view, in virtue of coupling neural ensembles that independently would be tuned to differently structured information in the environment. Note, that we cannot address all the rich and varied (temporal) functions of complex multimodal signalling [5,8,10]. But in our review, the common thread resonates with a recent overview by Halfwerk et al. [10] who suggest that multimodal signalling functions are not exhausted by simply (i) providing redundant backup information or (ii) combining multiple independent messages. Instead, what is central to the temporal functioning of multimodal systems is that the resulting perception or production of a signal ‘is qualitatively different from the sum of the properties of its components' [10, p. 2], i.e. has emergent properties [8].

    Box 1.Definitions.

    general definition of phenomenon term context of term example
    a distinct measurable aspect of a system, which can be measured independently of other aspects component ethology frequency or duration of a signal; an intellectual instance determining behaviour
    component mathematics; electronic engineering (EE) partial at frequency x; regions of energy concentration
    feature EE; computer science (CS) spectral centroid; signal onset/offset; duration of a signal
    feature music; current paper pitch; fundamental frequency; rhythm; harmony
    unitary communication event X which is informative about state of affairs Y to a receiver (1) and/or producer (2) cue (1) ethology size of an animal, not intentionally communicated
    natural signs (1) (Peircian) semiotics footsteps in the sand, not intentionally communicated
    sign (1 and 2) (Peircian) semiotics; current paper word or gesture, intentionally communicated; understood in a three place relation of sign, referential target, and the user of the sign
    sensory and/or effector communication channel conventionally treated as functionally separable from others modality neuroscience; current paper specific neural ensembles associated with processing of a specific sensory channel or structure
    modality psycholinguistics; psychomusicology; ethology; current paper audition; vision; touch (usually ascribed to senses of the receiver — the receiver processes light signals via the sense of vision)
    mode movement science; current paper whispering, phonating; in-phase, anti-phase synchrony; resonance; punching, kicking
    a measurable aspect of a producing system, changing in time, which is used by a receiver system signal mathematics; EE; CS; current paper frequency, voltage, amplitude
    signal ethology a (sequence of) vocalization(s), or movement(s), etc intentionally produced for a receiver, e.g. a specific mating call
    informational, temporal, and/or mechanical coupling between two or more measurable aspects, the coupling of which benefits communicative purposes; the benefit can be for the recipient (1) and/or the producer (2) multimodal cue (1) ethology; current paper information about body movement or size from vocal patterning; indexical cues
    multimodal signal (1 and 2) ethology; psycholinguistics; current paper sonic communication with facial and/or manual gesture
    multi-component signal (1 and 2) ethology combined vocal and visual signalling
    coordination of modes (1 and/or 2) movement science; current paper entrainment of neural ensembles for sensory integration; coordination of respiratory, jaw, and articulatory modes for speaking; gesture (person 1) and speech (person 2) interactions

    3. Neural level: multimodal neural-cognitive processes

    Here, we present an overview of how temporal coupling in the production and perception of multimodal signals can be constrained by neural ensembles that are independently tuned towards specifically structured information in the environment. In their multimodal arrangement, they yield unique stabilities for tuning to the rhythms of multimodal communication. Furthermore, some neural ensembles are uniquely specialized to attune to multisensory information.

    When integrating a cascade of sensory signals to form a unified, structured percept of the environment, the brain faces two challenges. First, integrating different sensory signals into a unified percept relies on solving the ‘binding problem’: whether signals need to be integrated or segregated. Second, these sensory signals require integration with prior and contextual knowledge to weigh their uncertainty.

    The neural integration of multiple sensory signals is describable at several neural levels and measurable using wide-ranging methods (e.g. single-unit recordings, optogenetics, EEG, MEG, fMRI, combined with psychophysical experiments [1113]). Although the potential multisensory integration mechanisms are debated, the integration likelihood of two signals seems highly dependent on the degree of spatio-temporal coherence between those signals: unisensory signals that are closer in time and space have a greater likelihood of being integrated (cf. [3] and §5). Both human and non-human animal research demonstrates that multisensory neurons in the superior colliculus respond more robustly to spatio-temporally congruent audio-visual cues than to individual sensory cues [1416]. For example, in macaques (Macaca mulatta), single-unit activity measurements in one specific area in the superior temporal sulcus (anterior fundus) show unique sensitivity to facial displays when temporally aligned with vocal information, while other areas (anterior medial) are sensitive to facial displays alone [17]. Behavioural evidence of multisensory integration is shown in the territorial behaviour of dart-poison frogs (Epipedobates femoralis), who aggress conspecifics more when auditory and visual cues are sufficiently spatio-temporally aligned [18]. Note though, multimodal temporal alignment need not entail synchronization but can specifically involve structured sequencing (i.e. alignment at a lag). This is evidenced by research on a taxa of flycatcher bird species (Monarcha castaneiventris) who are uniquely responsive to long-range-emitted song followed by seeing plumage colour of potential territorial rivals as opposed to their, reversely ordered, synchronized or unimodal presentation [19]. Integration by temporally aligned presentation can be a developmentally acquired disposition, as research in cats shows that the development of multisensory integration in the superior colliculus is dependent on exposure to spatio-temporally coherent visual and auditory stimuli early in life [12].

    Although the lower level and higher level multimodal integration mechanisms are not well understood, both feedback and feed-forward interactions between early and higher level cortices might be relevant for integration. Specifically, it has been hypothesized that synchronized neural oscillations provide a mechanism for multisensory binding and selecting information that matches across sensory signals [20]. Here, coherent oscillatory signals are thought to allow for functional connectivity between spatially distributed neuronal populations, where low-frequency neural oscillations provide temporal windows for cross-modal influences [21]. This synchronization can occur through neural entrainment and/or phase resetting, which might be relevant for phase reorganization of ongoing oscillatory activity, so that high-excitability phases align to the timing of relevant events [21]. New methods, such as rapid invisible frequency tagging [2224], might clarify how multisensory signals are neurally integrated, and what the role of low-frequency oscillations is in this process over time. Moreover, novel approaches focusing on moment-to-moment fluctuations in oscillatory activity combined with methods with an increased spatial resolution (e.g. ECoG/depth-electrode recordings) could significantly advance our knowledge of the role of oscillatory activity in routing and integrating multisensory information across different neural networks [21]. This will be especially relevant in more complex, higher level multimodal binding scenarios, such as (human) communication.

    Communicative signals in naturalistic settings arguably include multiple features that work together to maximize their effectiveness. Different sensory modalities may operate at different timescales, with specific well-matched combinations of features across modalities, leading to common cross-modal mappings that are intuitively associated (e.g. visual size and auditory loudness, cf. [2527]). Prominent well-matched cross-modal mappings (see §§4 and 5) are sensorimotor mappings: signals transmitted to, from, or within visuomotor and auditory-motor systems. Given the high sensitivity of the auditory system for periodic signals aligned with motor periodicities [28,29], auditory signals often entrain movement, with examples seen in various kinds of joint action (e.g. marching or other timed actions). Less commonly, visual signals serve this purpose, as seen in musical conductors. Moreover, perception of both auditory and visual rhythms shares neural substrates with the motor system in terms of timing mechanisms [30]. While the auditory versus visual modality seems better suited to guide movement [31], it appears that within different sensory modalities, different features may be better suited to cue movement [32]. For example, movement is most easily cued by discrete events in the auditory domain (e.g. beeps), followed by continuously moving objects in the visual domain (e.g. moving bars) [32]. For discrete visual stimuli (e.g. flashes), or continuous auditory stimuli, (e.g. a siren), sensorimotor synchronization is less stable (see for similar results in audio-visual speech: [33]). By contrast with humans, rhesus macaques (Macaca mulatta) more easily synchronize to discrete visual cues [34] perhaps due to weaker audiomotor connections in the macaque brain [35]. These findings indicate that multimodal perception is not simply a matter of adding more modalities, but rather the combination of temporal structure and signal content, affecting behavioural performance and neural activations [36,37]. Moreover, compelling arguments based on multimodal mating signals in a range of species, as reviewed by Halfwerk et al. [10], suggests that exactly this integration of signals, leading to a multimodal percept rather than a main and a secondary modality, is what makes them informative.

    Behavioural and neural studies show that temporal structures in one sensory domain can affect processing in another. Examples are auditory [38] or even multisensory rhythmic cues such as music or a metronome [28] not only regularizing movement (i.e. changing motion trajectories as compared to uncued movements), but also entraining visual attention [37,39], by increasing visual sensitivity at time points predicted to be salient by an auditory stimulus. The neural underpinnings of such interactions are largely unclear. Music-cued versus non-cued movement leads to additional neural activation in motor areas, specifically cerebellum [40,41], suggesting that the neural activations related to multimodal processing are synergetic. This may explain findings of enhanced learning with multimodal cues, for instance when auditory feedback of movement (or sonification) is provided [42,43]. Even when multimodal embedding of motor learning does not show clear behavioural increases, differences in learning-related neural plasticity were reported for novices learning a new motor sequence to music as compared to without [44], suggesting that the learning process is implemented qualitatively differently [45].

    Taken together, different sensory modalities, and the features embedded in these signals, have different sensitivities for specific timescales, making some features especially suitable for cross-modal combinations. When investigating features that naturally combine, behavioural and neural responses emerge which amount to more than a simple addition of multiple processes.

    4. Body level: multimodal signalling and peripheral bodily constraints

    Understanding rhythmic multimodal communication also requires a still underdeveloped understanding of peripheral bodily constraints (henceforth biomechanics) in the production of multimodal signals. Here, we overview findings which show how multimodal signalling sometimes exploits physical properties of the body in the construction of temporally complex signals.

    Speech is putatively a superordinate mode of coordination between what were originally stable independent vocal and mandibular action routines [46]. In chimpanzees (Pan troglodytes), non-vocal lip smacking occurs in the theta range (approx. 3–8 Hz) typical of the speech envelope and labial kinematics of human speech [47]. Marmosets (Callithrix jacchus) occupy bistable modes of vocal-articulatory coordination, where mandibular oscillation is only synchronized at the characteristic theta range with vocal modulations at the final but not starting segments of the call [48]. Similarly, in the zebra finch (Taeniopygia guttata), respiratory pulses are timed with syrinx activity and rapid beak movements, the coordination of which is held to sustain the highly varied vocalization repertoire of this bird species [49]. Human speech is characterized by even more hierarchically nested levels of such coordinated periodicities of effectors and is in this sense multimodal [50].

    Human communicative hand gestures have acceleration peaks co-occurrent with emphatic stress in speech, which are tightly and dynamically coupled under adverse conditions, though with more temporal variability for more complex symbolizing gestures [51]. This coupling of gestures' acceleration-induced forces and speech can arise biomechanically from upper limb–respiratory coupling, e.g. by soliciting anticipatory muscle adjustments to stabilize posture during gesture [52], which also include respiratory-controlling muscles supporting speech-vocalization [53]. Comparable biomechanical interactions and synergies have been found in other animals long before such associations were raised to explain aspects of human multimodal prosody. In brown-headed cowbirds (Molothrus ater), vocalizations are produced with specific respiratory-related abdominal muscle activity. Such modulations are reduced during vocalizing while moving the wings for visual displaying, even though air sac pressure is maintained. This suggests that visual displays in cowbirds biomechanically interact with respiratory dynamics supporting vocalization [54]. During their more vigorous wing displays, these birds are vocally silent, likely so as to avoid biomechanical instability of singing and moving vigorously at the same time. Such biomechanical interactions are consistent with findings of the wing beats of flying bats (e.g. Pteronotus parnellii), which are synchronized with echo vocalizations due to locomotion–respiratory biomechanical synergies [55]. The echo vocalizations during a flight are often isochronously structured (at 6–12 Hz), and this rhythmic ability is attributed to locomotion–respiratory couplings as they share a temporal structure. However, isochrony (at 12–24 Hz) has also been observed in stationary bats when producing social vocalizations [56]. In this way, biomechanical stabilities from one domain may have scaffolded the rhythmic vocal capabilities that are sustained in social vocal domains [57].

    Rhesus macaques assume different facial postures with particular vocalizations. Lips usually protrude when emitting coos or grunts (e.g. during mother–infant contact or group progression). During the emission of screams (e.g. copulation or threats), lips retract [58]. In macaques, facial gestures are associated with peculiar vocal tract shapes, which influence acoustic signals during phonation [59] and can be discriminated by conspecific listeners [60]. Relatedly, in humans, perceiving lip postures allows the perceiver to derive a /ba/ or /pa/ from an auditory signal. It is the auditory–visual–motor co-regularity that makes the visual or haptic perception of articulatory gestures possible in this classic McGurk effect [61]. Recently, a ‘manual gesture McGurk-effect’ has been discovered [62]. When asked to detect a particular lexical stress in a uniformly stressed speech sequence, participants who see a hand gesture's beat timed with a particular speech segment tend to hear a lexical stress for that segment [62]. We think it is possible that the gesture–speech–respiratory link as reviewed previously is actually important for understanding the manual McGurk effect as listeners attune to features of the visual–acoustic signal that are informative about such coordinated modes of production [63]. Similarly, communicative gestures can also influence the heard duration of musical notes. For example, the absolute duration of a percussive tone sounds longer to an audience member when seeing a long- versus short-percussion gesture [64,65].

    Furthermore, spontaneous movements are naturally elicited by music. Whether this spontaneous movement stems from generalizable cross-modal associations is debated, but they might be identified when properly related to biomechanics. For instance, hierarchical bodily representations of metre can be elicited in spontaneous music-induced movement, with different aspects of the metre embodied in hand, torso or full-arm movements [66]. Additionally, specific coordination patterns emerge between different body parts of interacting musicians during musical improvisation [67]. Thus, what one hears in music might be constrained to what body part can be optimally temporally aligned with a feature in the music.

    To detect multimodal cues in this way may be very closely related to indexical signals, such as hearing the potential strength of a conspecific from vocal qualities [60,68]. Indexical signals are often the result of perceptual and/or morphological specialization to detect/convey features from multimodal couplings. For example, frogs (Physalaemus pustulosus) and frog-eating bats (Trachops cirrhosus) have learned to attune to frog calls in relation to the water ripples produced by the calling frog's vocal sac deformations [69]. Similarly, crested pigeons (Ochyphaps lophotes) are alarmed by the sounds of high-velocity wing beats of conspecifics, where the feathers turn out to have morphologically evolved to produce the aeroelastic flutter needed to sustain these unique alarm calls during fleeing locomotion [70]. In broad-tailed hummingbirds (Selasphorus platycercus), the characteristic high-speed courtship dives seem to be driven to attain exactly the right speeds to elicit sonification from aeroelastic flutter, which is synchronized with attaining the correct angle transition relative to the to-be-impressed perceiver so that the gorget dramatically changes colour during sound production [71]. In sum, multimodal communication sometimes involves a specialized exploitation or attunement of physics that constrains particular (combined) modes of acting (with the environment).

    Note that the multimodal information embedded in communicative acoustic signals can have impacts on complex communication in humans too. Speakers who cannot see but only hear each other tend to align patterns of postural sway suggesting that vocal features are used to coordinate a wider embodied context [72]. These emergent coordinations are found to increase social affiliation and can align bodily processes [73]. For example, synchronized drumming in groups synchronizes physiology, aligning participants' heartbeats [74]. Further, visual observation of interpersonal synchronous movement between others may lead observers to rate higher levels of rapport (liking) between the interacting individuals [75] and increase an audience's affective and aesthetic enjoyment of a group dance performance [76].

    To conclude, we have overviewed examples of peripheral bodily constraints which influence the perception and production of multimodal signals across species. Specifically, these biomechanical processes mediate the temporal structuring of multimodal communicative signals.

    5. Social level: complex rhythms in interactive multimodal communication

    In this section, we overview how social interaction complexifies the rhythms which are sustained in communication relative to the rhythms that would arise out of more simple sending or receiving of signals.

    Temporal structure is often rhythmic, but studies have also found quasi-rhythmic structure in sounds of speech, music and animal communication [77], and likewise for movements produced while talking, singing or performing music [78]. The multiscale character of these sounds and movements is readily illustrated in speech—phonemes of varying durations combine to create longer syllabic units with more variability in duration, which combine to form phrases with even more variability in length, and so on, thus creating quasi-rhythmicity at each timescale.

    The durations of linguistic units like phonemes and syllables are difficult to measure in the acoustic speech signal, but they generally correspond to modulations in a specific feature of the acoustic signal, called the amplitude envelope. Within the amplitude envelope, units are expressed in terms of bursts and lulls of energy, and their temporal patterning can be distilled in the timing of bursts via peak amplitudes. Speech analysis [79] shows that smaller bursts cluster to form larger bursts, where larger bursts cluster to form even larger bursts across timescales that roughly correspond with (phonemic, syllabic, phrasal) units of language. Musical recordings also exhibit degrees of multiscale structure whose specifics depend on the genre of music or type of speech performance [77]. Even recordings of animal vocalizations have been found to exhibit multiscale structure using those same analysis methods. While we do not have access to the underlying units, recordings of communicative vocalizations produced by killer whales were found to have a quasi-rhythmic structure across timescales surprisingly similar to human speech interactions [77].

    Multiscale structure in speech and music is also multimodal. Analyses of sounds and movements in video recordings have found coordinated multiscale structures in the amplitudes of co-speech face, head and body movements [78], and the degree of coordination in speech sounds and movements depends on the communicative context. Studies of the rhythmic structure have also found that visual communicative signals are tightly coordinated with the acoustic signals of speech [3]. However, while gestures with a beating quality coincide closely with pitch peaks, on the semantic level object- or action-depicting gestures frequently precede corresponding lexical items by several hundred milliseconds [80]. Facial signals, too, can precede the speech they relate to [81]. Variable timing is most obvious if we consider multimodal utterances in their entirety, where speech is embedded in a rich infrastructure of visual signals coming from the hands, head, face, torso, etc. [3]. These different signals are typically not aligned in time but distributed over the entire length of utterances and beyond, with varying onsets and offsets.

    Typically, multimodal utterances in human social interaction are produced within a scaffold of speaking turns. Sacks et al. [82] propose that interlocutors abide by a clear set of rules which, combined with linguistic information (semantics, pragmatics, prosody and syntax), afford precise timing of turns, yielding minimal gaps and overlaps. Indeed, quantitative cross-language analyses support this tight temporal coupling [83], in line with a putative ‘interaction engine’ providing cognitive-interactional predispositions for this human ability [84], though gestural turn exchanges in bonobos point towards an evolutionary precursor [85].

    The rhythmical structure may further facilitate the temporal coupling of turns. Wilson & Wilson [86] specify a mechanism by which interlocutors' endogenous oscillators are anti-phase coupled, allowing next speakers to launch their turn ‘on time’, while decreasing the chance of overlap. This may be enhanced through temporal projections derived from linguistic information [87], but the rhythmical abilities grounding this mechanism are evolutionarily basic [88]. Wild non-human primates, like indris, gibbons and chimpanzees, show coordination during joint vocal output, suggesting the ability to coordinate to auditory rhythms [89,90]. The captive chimpanzee Ai was able to synchronize her keyboard tapping with an acoustic stimulus [91], and captive macaques can flexibly adjust their tapping in anticipation of the beat of a visual metronome [34]. Moreover, cotton-top tamarins and marmosets have been observed to avoid initiating and adjust the duration and onset of their calls such that they avoid interfering noise [92].

    However, conversational turn-taking is also characterized by temporal variation, including periods of overlap and gaps ranging up to hundreds of milliseconds [83,93]. The full breadth of factors influencing turn transition times remains opaque, but turn duration, syntactic complexity, word frequency and social action are some of them [94]. A coupled-oscillator turn-taking mechanism can accommodate this large variation in turn timing, since entrained interlocutors could begin speaking at any new anti-phased periodic cycle [86,88]. A recent study based on telephone interactions shows a quasi-rhythmic structure regulated by turn-by-turn negative autocorrelations [95]. The coupled-oscillator mechanism that may form the basis for dealing with quasi-rhythmicity at the interactional level may also govern communication in non-human species, such as the interactional synchronization of non-isochronous call patterns in the katydid species Mecopoda [96].

    To conclude, the temporal organization of intentional communication is an intricate matter, characterized, on one hand, by synchrony serving the amplification of signals or specific features/components thereof, as well as semantic enhancement and smooth coordination between interlocutors. On the other hand, the temporal organization is characterized by quasi-rhythmic, multiscale structure within and across modalities, serving complex communication and coordination patterns that are widespread in communicative animal vocalizations, human speech and even music.

    6. Conclusion

    We have argued that to understand communicative rhythms which characterize animal communication, a multimodal perspective is necessary and multiple levels need to be examined. The current overview takes the first step towards a multilevel multimodal approach, showing how each level (neural, bodily and interactive) uniquely contributes to the communicative rhythms of animals. We think that when processes on these levels are understood we can productively investigate why the rhythms of, for example, human conversation are so complexly varied. Though we have addressed the unique contributions at each level independently, the biggest challenge is understanding how levels intersect.

    A historic lesson in this regard comes from early theories about human vocalization. Early theories held that phonation was actively neurally driven, such that active muscle contractions would be needed to complete each vocal fold cycle [97]. This hypothesis was soon refuted in favour of a biomechanical theory [98], which correctly posited that vocal fold oscillation arises out of more neurally passive dynamics. Namely, vocal fold oscillations arise due to air pressure flux around a tensed elastic material (i.e. vocal folds). Similarly, neurally passive dynamics have been discovered in subsonic phonations in elephant (Loxodonta africana) trunks [99]. But interestingly, it turns out that for several cat species low-frequency purring is actively neuro-muscularly driven to complete a cycle [100]. The lesson is that the neural-cognitive mechanisms that are invoked in our explanations of rhythmic communication will crucially depend on our knowledge of biomechanics, and any redundancies present biomechanically can completely reshape the type of neural-cognitive control mechanisms that need be invoked. In the same way, understanding the unique neural constraints can lead to the discovery that neural-cognitive mechanisms need to be in place to exploit certain bodily capacities [101]. A recent integrative approach has been proposed in the understanding of beat perception and motor synchronization, where it is suggested that a network of biological oscillators are at play when moving to a rhythm, which involves more neurally passive dynamic biomechanics and neural processes [28]. Finally, social interactions allow for new rhythmic stabilities that are simply absent or qualitatively different in nature to non-interactive set-ups [102]. Indeed, there are increasingly louder calls for action for understanding neural processes as sometimes softly assembling into a wider distributed multi-person system in social interactions [103106]. The current contribution further underlines a call for such a multiscale investigation of temporal rhythms of multimodal communication, where neural processes are properly embedded in bodily processes unfolding in social interaction.

    Data accessibility

    This article has no additional data.

    Authors' contributions

    W.P. led the writing of the manuscript. S.P. has designed the figure. W.P., S.P., L.D., M.G., J.H., C.K., R.S. and G.A.W. have written the manuscript.

    Competing interests

    We declare we have no competing interests

    Funding

    W.P. is supported by a Donders Fellowship and is financially supported by the Language in Interaction consortium project ‘Communicative Alignment in Brain & Behaviour’ (CABB). L.D. is supported by a Minerva Fast Track Fellowship from the Max Planck Society. L.D. and J.H. are supported by the European Research Council (CoG grant no. 773079, awarded to J.H.).

    Acknowledgements

    We would like to thank the organizers of the Lorentz workshop ‘Synchrony and rhythm interaction’ for their leadership in the field.

    Footnotes

    One contribution of 17 to a theme issue ‘Synchrony and rhythm interaction: from the brain to behavioural ecology’.

    These authors contributed equally to this review.

    Published by the Royal Society. All rights reserved.