The processing of audio-visual speech: empirical and neural bases
Abstract
In this selective review, I outline a number of ways in which seeing the talker affects auditory perception of speech, including, but not confined to, the McGurk effect. To date, studies suggest that all linguistic levels are susceptible to visual influence, and that two main modes of processing can be described: a complementary mode, whereby vision provides information more efficiently than hearing for some under-specified parts of the speech stream, and a correlated mode, whereby vision partially duplicates information about dynamic articulatory patterning.
Cortical correlates of seen speech suggest that at the neurological as well as the perceptual level, auditory processing of speech is affected by vision, so that ‘auditory speech regions’ are activated by seen speech. The processing of natural speech, whether it is heard, seen or heard and seen, activates the perisylvian language regions (left>right). It is highly probable that activation occurs in a specific order. First, superior temporal, then inferior parietal and finally inferior frontal regions (left>right) are activated. There is some differentiation of the visual input stream to the core perisylvian language system, suggesting that complementary seen speech information makes special use of the visual ventral processing stream, while for correlated visual speech, the dorsal processing stream, which is sensitive to visual movement, may be relatively more involved.
References
Alsius A, Navarra J, Campbell R& Soto-Faraco S.S . 2005Audiovisual integration of speech falters under high attention demands. Curr. Biol. 15, 839–843.doi:10.1016/j.cub.2005.03.046. . Crossref, PubMed, ISI, Google ScholarAndersson U& Lidestam B . 2005Bottom-up driven speechreading in a speechreading expert: the case of AA (JK023). Ear Hear. 26, 214–224.doi:10.1097/00003446-200504000-00008. . Crossref, PubMed, ISI, Google ScholarAuer E.T& Bernstein L.E . 1997Speechreading and the structure of the lexicon: computationally modelling the effects of reduced phonetic distinctiveness on lexical uniqueness. J. Acoust. Soc. Am. 102, 3704–3710.doi:10.1121/1.420402. . Crossref, PubMed, ISI, Google ScholarBernstein L.E, Auer E.T, Moore J.K, Ponton C.W, Don M& Singh M . 2002Visual speech perception without primary auditory cortex activation. Neuroreport. 13, 311–315.doi:10.1097/00001756-200203040-00013. . Crossref, PubMed, ISI, Google ScholarBernstein L.E, Auer E.T& Moore J.K Audiovisual speech binding: convergence or association?. In The handbook of multisensory perception, Calvert G.A, Spence C& Stein B.E . 2004app. 203–224. Eds. Cambridge, MA:MIT Press. Google ScholarBernstein L.E, Auer E.T& Takayanagi S Auditory speech detection in noise enhanced by lipreading. Speech Commun. 44, 2004b5–18.doi:10.1016/j.specom.2004.10.011. . Crossref, ISI, Google ScholarBrass M& Heyes C . 2005Imitation: is cognitive neuroscience solving the correspondence problem?. Trends Cogn. Sci. 9, 489–495.doi:10.1016/j.tics.2005.08.007. . Crossref, PubMed, ISI, Google ScholarBuccino G, 2001Action observation activates premotor and parietal areas in a somatotopic manner: an fMRI study. Eur. J. Neurosci. 13, 400–404.doi:10.1046/j.1460-9568.2001.01385.x. . PubMed, ISI, Google ScholarBurnham D& Dodd B . 2004Auditory–visual speech integration by prelinguistic infants: perception of an emergent consonant in the McGurk effect. Dev. Psychobiol. 45, 204–220.doi:10.1002/dev.20032. . Crossref, PubMed, ISI, Google ScholarCallan D.E, Jones J.A, Munhall K, Kroos C, Callan A.M& Vatikiotis-Bateson E . 2004Multisensory integration sites identified by perception of spatial wavelet filtered visual speech gesture information. J. Cogn. Neurosci. 16, 805–816.doi:10.1162/089892904970771. . Crossref, PubMed, ISI, Google ScholarCalvert G.A& Campbell R . 2003Reading speech from still and moving faces: the neural substrates of seen speech. J. Cognit. Neurosci. 15, 57–70.doi:10.1162/089892903321107828. . Crossref, PubMed, ISI, Google ScholarCalvert G.A& Lewis J.W Hemodynamic studies of audiovisual interaction. The handbook of multisensory perception, Calvert G.A, Spence C& Stein B.E . 2004pp. 483–502. Eds. Cambridge, MA:MIT Press. Google ScholarCalvert G.A, Bullmore E, Brammer M.J, Campbell R, Woodruff P, McGuire P, Williams S, Iversen S.D& David A.S . 1997Activation of auditory cortex during silent speechreading. Science. 276, 593–596.doi:10.1126/science.276.5312.593. . Crossref, PubMed, ISI, Google ScholarCalvert G.A, Campbell R& Brammer M . 2000Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Curr. Biol. 10, 649–657.doi:10.1016/S0960-9822(00)00513-3. . Crossref, PubMed, ISI, Google ScholarCampbell R, Zihl J, Massaro D.W, Munhall K& Cohen M.M . 1997Speechreading in the akinetopsic patient. Brain. 121, 1794–1803. Google ScholarCampbell R, MacSweeney M, Surguladze S, Calvert G.A, McGuire P.K, Brammer M.J, David A.S& Suckling J . 2001Cortical substrates for the perception of face actions: an fMRI study of the specificity of activation for seen speech and for meaningless lower-face acts (gurning). Cognit. Brain Res. 12, 233–243.doi:10.1016/S0926-6410(01)00054-4. . Crossref, PubMed, Google ScholarCapek C.M, Bavelier D, Corina D, Newman A.J, Jezzard P& Neville H.J . 2004The cortical organization of audio-visual sentence comprehension: an fMRI study at 4 Tesla. Cognit. Brain Res. 20, 111–119.doi:10.1016/j.cogbrainres.2003.10.014. . Crossref, PubMed, Google Scholar- Capek, C. M., Campbell, R., MacSweeney, M., Woll, B., Seal, M., Waters, D., Davis, A. S., McGuire, P. K. & Brammer, M. J. 2005 The organization of speechreading as a function of attention. Cognitive Neuroscience Society Annual Meeting, poster presentation, San Francisco, CA: Cognitive Neuroscience Society. Google Scholar
- Capek, C. et al. In preparation. Cortical correlates of the processing of stilled speech images—effects of attention to task. Google Scholar
Colin C, Radeau M, Soquet A, Demolin D, Colin F& Deltenre P . 2002Mismatch negativity evoked by the McGurk–MacDonald effect: a phonetic representation within short-term memory. Clin. Neurophysiol. 113, 495–506.doi:10.1016/S1388-2457(02)00024-X. . Crossref, PubMed, ISI, Google ScholarDiehl R.L . 2008Acoustic and auditory phonetics: the adaptive design of speech sound systems. Phil. Trans. R. Soc. B. 363, 965–978.doi:10.1098/rstb.2007.2153. . Link, ISI, Google ScholarDriver J . 1996Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading. Nature. 381, 66–68.doi:10.1038/381066a0. . Crossref, PubMed, ISI, Google ScholarFowler C.A& Dekle D . 1991Listening with eye and hand: crossmodal contributions to speech perception. J. Exp. Psychol. Hum. Percept. Perform. 17, 816–828.doi:10.1037/0096-1523.17.3.816. . Crossref, PubMed, ISI, Google ScholarGhazanfar A.A, Maier J.X, Hoffman K.L& Logothetis N.K . 2005Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. J. Neurosci. 25, 5004–5012.doi:10.1523/JNEUROSCI.0799-05.2005. . Crossref, PubMed, ISI, Google ScholarGrant K.W, Greenberg S, Poeppel D& van Wassenhove V . 2004Effects of spectro-temporal asynchrony in auditory and auditory–visual speech processing. Semin. Hear. 25, 241–255.doi:10.1055/s-2004-832858. . Crossref, Google ScholarGreen K.P, Kuhl P.K, Meltzoff A.N& Stevens E.B . 1991Integrating speech information across talkers, gender, and sensory modality: female faces and male voices in the McGurk effect. Percept. Psychophys. 50, 524–536. Crossref, PubMed, Google ScholarHall D.A, Fussell C& Summerfield A.Q . 2005Reading fluent speech from talking faces: typical brain networks and individual differences. J. Cogn. Neurosci. 17, 939–953.doi:10.1162/0898929054021175. . Crossref, PubMed, ISI, Google ScholarHickok G& Poeppel D . 2004Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition. 92, 67–99.doi:10.1016/j.cognition.2003.10.011. . Crossref, PubMed, ISI, Google ScholarJordan T.R& Sergeant P.C . 2000Effects of distance on visual and audiovisual speech recognition. Lang. Speech. 43, 107–124. Crossref, ISI, Google ScholarLudman C.N, Summerfield A.Q, Hall D, Elliott M, Foster J, Hykin J.L, Bowtell R& Morris P.G . 2000Lip-reading ability and patterns of cortical activation studied using fMRI. Br. J. Audiol. 34, 225–230. Crossref, PubMed, Google ScholarMacSweeney M, 2002Neural systems underlying British Sign Language and audio-visual English processing in native users. Brain. 125, 1583–1593.doi:10.1093/brain/awf153. . Crossref, PubMed, ISI, Google ScholarMassaro D.W Speech perception by ear and by eye. 1987Hillsdale, NJ:Lawrence Erlbaum Associates. Google ScholarMcGurk H& MacDonald J . 1976Hearing lips and seeing voices. Nature. 264, 746–748.doi:10.1038/264746a0. . Crossref, PubMed, ISI, Google ScholarMeltzoff A.N& Moore M.K . 1983Newborn infants imitate adult facial gestures. Child Dev. 54, 702–709.doi:10.2307/1130058. . Crossref, PubMed, ISI, Google ScholarMiller L.M& D'Esposito M.D . 2005Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. J. Neurosci. 25, 5884–5893.doi:10.1523/JNEUROSCI.0896-05.2005. . Crossref, PubMed, ISI, Google ScholarMöttönen R, Schurmann M& Sams M . 2004Time course of multisensory interactions during audiovisual speech perception in humans: a magnetoencephalographic study. Neurosci. Lett. 363, 112–115.doi:10.1016/j.neulet.2004.03.076. . Crossref, PubMed, ISI, Google ScholarMöttönen R, Järveläinen J, Sams M& Hari R . 2005Viewing speech modulates activity in the left S1 mouth cortex. Neuroimage. 24, 731–737.doi:10.1016/j.neuroimage.2004.10.011. . Crossref, PubMed, ISI, Google ScholarMunhall K.G, Jones J.A, Callan D.E, Kuratate T& Vatikiotis-Bateson E . 2004Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol. Sci. 15, 133–137.doi:10.1111/j.0963-7214.2004.01502010.x. . Crossref, PubMed, ISI, Google ScholarNishitani N& Hari R . 2002Viewing lip forms: cortical dynamics. Neuron. 36, 1211–1220.doi:10.1016/S0896-6273(02)01089-9. . Crossref, PubMed, ISI, Google ScholarOjanen V, Möttönen R, Pekkola J, Jääskeläinen I.P, Joensuu R, Autti T& Sams M . 2005Processing of audiovisual speech in Broca's area. Neuroimage. 25, 333–338.doi:10.1016/j.neuroimage.2004.12.001. . Crossref, PubMed, ISI, Google ScholarPatterson R.D& Johnsrude I.S . 2008Functional imaging of the auditory processing applied to speech sounds. Phil. Trans. R. Soc. B. 363, 1023–1035.doi:10.1098/rstb.2007.2157. . Link, ISI, Google ScholarPekkola J, Ojanen V, Autti T, Jääskeläinen I.P, Möttönen R, Tarkiainen A& Sams M . 2005Primary auditory cortex activation by visual speech: an fMRI study at 3T. Neuroreport. 16, 125–128.doi:10.1097/00001756-200502080-00010. . Crossref, PubMed, ISI, Google ScholarPekkola J, Laasonen M, Ojanen V, Autti T, Jäskeläinen I.P, Kujala T& Sams M . 2006Perception of matching and conflicting audiovisual speech in dyslexic and fluent readers: an fMRI study at 3T. Neuroimage. 29, 797–807.doi:10.1016/j.neuroimage.2005.09.069. . Crossref, PubMed, ISI, Google ScholarPuce A, Syngeniotis A, Thompson J.C, Abbott D.F, Wheaton K.J& Castiello U . 2003The human temporal lobe integrates facial form and motion: evidence from fMRI and ERP studies. Neuroimage. 19, 861–869.doi:10.1016/S1053-8119(03)00189-7. . Crossref, PubMed, ISI, Google ScholarRadeau M& Bertelson P . 1974The after-effects of ventriloquism. Q. J. Exp. Psychol. 26, 63–71.doi:10.1080/14640747408400388. . Crossref, PubMed, Google ScholarReisberg D, McLean J& Goldfield A Easy to hear but hard to understand: a lip-reading advantage with intact auditory stimuli. Hearing by eye: the psychology of lip-reading, Dodd B& Campbell R . 1987pp. 97–113. Eds. Hillsdale, NJ:Lawrence Erlbaum Associates. Google ScholarRemez R.E Three puzzles of multimodal speech perception. Audiovisual speech, Vatikiotis-Bateson E, Bailly G& Perrier P . 2005pp. 12–19. Eds. Cambridge, MA:MIT Press. Google ScholarRizzolatti G& Arbib M.A . 1998Language within our grasp. Trends Neurosci. 21, 188–194.doi:10.1016/S0166-2236(98)01260-0. . Crossref, PubMed, ISI, Google ScholarRosenblum L.D& Saldaña H.M . 1996An audiovisual test of kinematic primitives for visual speech perception. J. Exp. Psychol. Hum. Percept. Perform. 22, 318–331.doi:10.1037/0096-1523.22.2.318. . Crossref, PubMed, ISI, Google ScholarRosenblum L.D, Johnson J.A& Saldaña H.M . 1996Point-light facial displays enhance comprehension of speech in noise. J. Speech Hear. Res. 39, 1159–1170. Crossref, PubMed, Google ScholarRosenblum L.D, Schmuckler M.A& Johnson J.A . 1997The McGurk effect in infants. Percept. Psychophys. 59, 347–357. Crossref, PubMed, Google ScholarSadato N, 2005Cross modal integration and changes revealed in lipmovement, random-dot motion and sign languages in the hearing and deaf. Cereb. Cortex. 15, 1113–1122.doi:10.1093/cercor/bhh210. . Crossref, PubMed, ISI, Google ScholarSams M, Aulanko R, Hämäläinen M, Hari R, Lounasmaa O.V, Lu S.-T& Simola J . 1991Seeing speech: visual information from lip movements modifies activity in the human auditory cortex. Neurosci. Lett. 127, 141–145.doi:10.1016/0304-3940(91)90914-F. . Crossref, PubMed, ISI, Google ScholarSams M, Mottonen R& Sihvonen T . 2005Seeing and hearing others and oneself talk. Cogn. Brain Res. 23, 429–435.doi:10.1016/j.cogbrainres.2004.11.006. . Crossref, PubMed, Google ScholarSanti A, Servos P, Vatikiotis-Bateson E, Kuratate T& Munhall K . 2003Perceiving biological motion: dissociating visible speech from walking. J. Cogn. Neurosci. 15, 800–809.doi:10.1162/089892903322370726. . Crossref, PubMed, ISI, Google ScholarScott S.K . 2005Auditory processing—speech, space and auditory objects. Curr. Opin. Neurobiol. 15, 197–201.doi:10.1016/j.conb.2005.03.009. . Crossref, PubMed, ISI, Google ScholarSkipper J.I, Nusbaum H.C& Small S.L . 2005Listening to talking faces: motor cortical activation during speech perception. Neuroimage. 25, 76–89.doi:10.1016/j.neuroimage.2004.11.006. . Crossref, PubMed, ISI, Google ScholarSoto-Faraco S, Navarra J& Alsius A . 2004Assessing automaticity in audiovisual speech integration: evidence from the speeded classification task. Cognition. 92, B13–B23.doi:10.1016/j.cognition.2003.10.005. . Crossref, PubMed, ISI, Google ScholarSumby W.H& Pollack I . 1954Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215.doi:10.1121/1.1907309. . Crossref, ISI, Google ScholarSummerfield A.Q . 1979The use of visual information in phonetic perception. Phonetica. 36, 314–331. Crossref, PubMed, ISI, Google ScholarSummerfield A.Q Some preliminaries to a theory of audiovisual speech processing. Hearing by eye, Dodd B& Campbell R . 1987pp. 58–82. Eds. Hove, UK:Erlbaum Associates. Google Scholar- Thompson-Schill, S. L. 2005 Dissecting the language organ: a new look at the role of Broca's area in language processing. In Twenty-first century psycholinguistics: four cornerstones (ed. A. Cutler), pp. 173–189. Hillsdale, NJ: Lawrence Erlbaum Associates. Google Scholar
Tiippana K, Andersen T.S& Sams M . 2004Visual attention modulates audiovisual speech perception. Eur. J. Cogn. Psychol. 16, 457–472.doi:10.1080/09541440340000268. . Crossref, Google Scholarvan Wassenhove V, Grant K.W& Poeppel D . 2005Visual speech speeds up the neural processing of auditory speech. Proc. Natl Acad. Sci. USA. 102, 1181–1186.doi:10.1073/pnas.0408949102. . Crossref, PubMed, ISI, Google ScholarWatkins K.E, Strafella A.P& Paus T . 2003Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia. 41, 989–994.doi:10.1016/S0028-3932(02)00316-0. . Crossref, PubMed, ISI, Google ScholarWright T.M, Pelphrey K.A, Allison T, McKeown M.J& McCarthy G . 2003Polysensory interactions along lateral temporal regions evoked by audiovisual speech. Cereb. Cortex. 13, 1034–1043.doi:10.1093/cercor/13.10.1034. . Crossref, PubMed, ISI, Google ScholarYehia H.C, Kuratate T& Vatikiotis-Bateson E . 2002Linking facial animation, head motion and speech acoustics. J. Phonet. 30, 555–568.doi:10.1006/jpho.2002.0165. . Crossref, ISI, Google Scholar


