scholarly journals Audio-visual integration in noise: Influence of auditory and visual stimulus degradation on eye movements and perception of the McGurk effect

2020 ◽  
Vol 82 (7) ◽  
pp. 3544-3557 ◽  
Author(s):  
Jemaine E. Stacey ◽  
Christina J. Howard ◽  
Suvobrata Mitra ◽  
Paula C. Stacey

AbstractSeeing a talker’s face can aid audiovisual (AV) integration when speech is presented in noise. However, few studies have simultaneously manipulated auditory and visual degradation. We aimed to establish how degrading the auditory and visual signal affected AV integration. Where people look on the face in this context is also of interest; Buchan, Paré and Munhall (Brain Research, 1242, 162–171, 2008) found fixations on the mouth increased in the presence of auditory noise whilst Wilson, Alsius, Paré and Munhall (Journal of Speech, Language, and Hearing Research, 59(4), 601–615, 2016) found mouth fixations decreased with decreasing visual resolution. In Condition 1, participants listened to clear speech, and in Condition 2, participants listened to vocoded speech designed to simulate the information provided by a cochlear implant. Speech was presented in three levels of auditory noise and three levels of visual blurring. Adding noise to the auditory signal increased McGurk responses, while blurring the visual signal decreased McGurk responses. Participants fixated the mouth more on trials when the McGurk effect was perceived. Adding auditory noise led to people fixating the mouth more, while visual degradation led to people fixating the mouth less. Combined, the results suggest that modality preference and where people look during AV integration of incongruent syllables varies according to the quality of information available.

2012 ◽  
Vol 25 (0) ◽  
pp. 112 ◽  
Author(s):  
Lukasz Piwek ◽  
Karin Petrini ◽  
Frank E. Pollick

Multimodal perception of emotions has been typically examined using displays of a solitary character (e.g., the face–voice and/or body–sound of one actor). We extend investigation to more complex, dyadic point-light displays combined with speech. A motion and voice capture system was used to record twenty actors interacting in couples with happy, angry and neutral emotional expressions. The obtained stimuli were validated in a pilot study and used in the present study to investigate multimodal perception of emotional social interactions. Participants were required to categorize happy and angry expressions displayed visually, auditorily, or using emotionally congruent and incongruent bimodal displays. In a series of cross-validation experiments we found that sound dominated the visual signal in the perception of emotional social interaction. Although participants’ judgments were faster in the bimodal condition, the accuracy of judgments was similar for both bimodal and auditory-only conditions. When participants watched emotionally mismatched bimodal displays, they predominantly oriented their judgments towards the auditory rather than the visual signal. This auditory dominance persisted even when the reliability of auditory signal was decreased with noise, although visual information had some effect on judgments of emotions when it was combined with a noisy auditory signal. Our results suggest that when judging emotions from observed social interaction, we rely primarily on vocal cues from the conversation, rather then visual cues from their body movement.


2018 ◽  
Vol 31 (1-2) ◽  
pp. 39-56 ◽  
Author(s):  
Julia Irwin ◽  
Trey Avery ◽  
Lawrence Brancazio ◽  
Jacqueline Turcios ◽  
Kayleigh Ryherd ◽  
...  

Visual information on a talker’s face can influence what a listener hears. Commonly used approaches to study this include mismatched audiovisual stimuli (e.g., McGurk type stimuli) or visual speech in auditory noise. In this paper we discuss potential limitations of these approaches and introduce a novel visual phonemic restoration method. This method always presents the same visual stimulus (e.g., /ba/) dubbed with a matched auditory stimulus (/ba/) or one that has weakened consonantal information and sounds more /a/-like). When this reduced auditory stimulus (or /a/) is dubbed with the visual /ba/, a visual influence will result in effectively ‘restoring’ the weakened auditory cues so that the stimulus is perceived as a /ba/. An oddball design in which participants are asked to detect the /a/ among a stream of more frequently occurring /ba/s while either a speaking face or face with no visual speech was used. In addition, the same paradigm was presented for a second contrast in which participants detected /pa/ among /ba/s, a contrast which should be unaltered by the presence of visual speech. Behavioral and some ERP findings reflect the expected phonemic restoration for the /ba/ vs. /a/ contrast; specifically, we observed reduced accuracy and P300 response in the presence of visual speech. Further, we report an unexpected finding of reduced accuracy and P300 response for both speech contrasts in the presence of visual speech, suggesting overall modulation of the auditory signal in the presence of visual speech. Consistent with this, we observed a mismatch negativity (MMN) effect for the /ba/ vs. /pa/ contrast only that was larger in absence of visual speech. We discuss the potential utility for this paradigm for listeners who cannot respond actively, such as infants and individuals with developmental disabilities.


2015 ◽  
Vol 112 (47) ◽  
pp. 14717-14722 ◽  
Author(s):  
Clark Fisher ◽  
Winrich A. Freiwald

The primate brain contains a set of face-selective areas, which are thought to extract the rich social information that faces provide, such as emotional state and personal identity. The nature of this information raises a fundamental question about these face-selective areas: Do they respond to a face purely because of its visual attributes, or because the face embodies a larger social agent? Here, we used functional magnetic resonance imaging to determine whether the macaque face patch system exhibits a whole-agent response above and beyond its responses to individually presented faces and bodies. We found a systematic development of whole-agent preference through the face patches, from subadditive integration of face and body responses in posterior face patches to superadditive integration in anterior face patches. Superadditivity was not observed for faces atop nonbody objects, implying categorical specificity of face–body interaction. Furthermore, superadditivity was robust to visual degradation of facial detail, suggesting whole-agent selectivity does not require prior face recognition. In contrast, even the body patches immediately adjacent to anterior face areas did not exhibit superadditivity. This asymmetry between face- and body-processing systems may explain why observers attribute bodies’ social signals to faces, and not vice versa. The development of whole-agent selectivity from posterior to anterior face patches, in concert with the recently described development of natural motion selectivity from ventral to dorsal face patches, identifies a single face patch, AF (anterior fundus), as a likely link between the analysis of facial shape and semantic inferences about other agents.


2015 ◽  
Vol 282 (1802) ◽  
pp. 20142284 ◽  
Author(s):  
William L. Allen ◽  
James P. Higham

Careful investigation of the form of animal signals can offer novel insights into their function. Here, we deconstruct the face patterns of a tribe of primates, the guenons (Cercopithecini), and examine the information that is potentially available in the perceptual dimensions of their multicomponent displays. Using standardized colour-calibrated images of guenon faces, we measure variation in appearance both within and between species. Overall face pattern was quantified using the computer vision ‘eigenface’ technique, and eyebrow and nose-spot focal traits were described using computational image segmentation and shape analysis. Discriminant function analyses established whether these perceptual dimensions could be used to reliably classify species identity, individual identity, age and sex, and, if so, identify the dimensions that carry this information. Across the 12 species studied, we found that both overall face pattern and focal trait differences could be used to categorize species and individuals reliably, whereas correct classification of age category and sex was not possible. This pattern makes sense, as guenons often form mixed-species groups in which familiar conspecifics develop complex differentiated social relationships but where the presence of heterospecifics creates hybridization risk. Our approach should be broadly applicable to the investigation of visual signal function across the animal kingdom.


2018 ◽  
Vol 31 (7) ◽  
pp. 675-688 ◽  
Author(s):  
Stefania S. Moro ◽  
Jennifer K. E. Steeves

Abstract Observing motion in one modality can influence the perceived direction of motion in a second modality (dynamic capture). For example observing a square moving in depth can influence the perception of a sound to increase in loudness. The current study investigates whether people who have lost one eye are susceptible to audiovisual dynamic capture in the depth plane similar to binocular and eye-patched viewing control participants. Partial deprivation of the visual system from the loss of one eye early in life results in changes in the remaining intact senses such as hearing. Linearly expanding or contracting discs were paired with increasing or decreasing tones and participants were asked to indicate the direction of the auditory stimulus. Magnitude of dynamic visual capture was measured in people with one eye compared to eye-patched and binocular viewing controls. People with one eye have the same susceptibility to dynamic visual capture as controls, where they perceived the direction of the auditory signal to be moving in the direction of the incongruent visual signal, despite previously showing a lack of visual dominance for audiovisual cues. This behaviour may be the result of directing attention to the visual modality, their partially deficient sense, in order to gain important information about approaching and receding stimuli which in the former case could be life-threatening. These results contribute to the growing body of research showing that people with one eye display unique accommodations with respect to audiovisual processing that are likely adaptive in each unique sensory situation.


2009 ◽  
Vol 21 (4) ◽  
pp. 625-641 ◽  
Author(s):  
Jürgen M. Kaufmann ◽  
Stefan R. Schweinberger ◽  
A. Mike Burton

We used ERPs to investigate neural correlates of face learning. At learning, participants viewed video clips of unfamiliar people, which were presented either with or without voices providing semantic information. In a subsequent face-recognition task (four trial blocks), learned faces were repeated once per block and presented interspersed with novel faces. To disentangle face from image learning, we used different images for face repetitions. Block effects demonstrated that engaging in the face-recognition task modulated ERPs between 170 and 900 msec poststimulus onset for learned and novel faces. In addition, multiple repetitions of different exemplars of learned faces elicited an increased bilateral N250. Source localizations of this N250 for learned faces suggested activity in fusiform gyrus, similar to that found previously for N250r in repetition priming paradigms [Schweinberger, S. R., Pickering, E. C., Jentzsch, I., Burton, A. M., & Kaufmann, J. M. Event-related brain potential evidence for a response of inferior temporal cortex to familiar face repetitions. Cognitive Brain Research, 14, 398–409, 2002]. Multiple repetitions of learned faces also elicited increased central–parietal positivity between 400 and 600 msec and caused a bilateral increase of inferior–temporal negativity (>300 msec) compared with novel faces. Semantic information at learning enhanced recognition rates. Faces that had been learned with semantic information elicited somewhat less negative amplitudes between 700 and 900 msec over left inferior–temporal sites. Overall, the findings demonstrate a role of the temporal N250 ERP in the acquisition of new face representations across different images. They also suggest that, compared with visual presentation alone, additional semantic information at learning facilitates postperceptual processing in recognition but does not facilitate perceptual analysis of learned faces.


2019 ◽  
Author(s):  
Violet Aurora Brown ◽  
Julia Feld Strand

The McGurk effect is a multisensory phenomenon in which discrepant auditory and visual speech signals typically result in an illusory percept (McGurk & MacDonald, 1976). McGurk stimuli are often used in studies assessing the attentional requirements of audiovisual integration (e.g., Alsius et al., 2005), but no study has directly compared the costs associated with integrating congruent versus incongruent audiovisual speech. Some evidence suggests that the McGurk effect may not be representative of naturalistic audiovisual speech processing—susceptibility to the McGurk effect is not associated with the ability to derive benefit from the addition of the visual signal (Van Engen et al., 2017), and distinct cortical regions are recruited when processing congruent versus incongruent speech (Erickson et al., 2014). In two experiments, one using response times to identify congruent and incongruent syllables and one using a dual-task paradigm, we assessed whether congruent and incongruent audiovisual speech incur different attentional costs. We demonstrated that response times to both the speech task (Experiment 1) and a secondary vibrotactile task (Experiment 2) were indistinguishable for congruent compared to incongruent syllables, but McGurk fusions were responded to more quickly than McGurk non-fusions. These results suggest that despite documented differences in how congruent and incongruent stimuli are processed (Erickson et al., 2014; Van Engen, Xie, & Chandrasekaran, 2017), they do not appear to differ in terms of processing time or effort. However, responses that result in McGurk fusions are processed more quickly than those that result in non-fusions, though attentional cost is comparable for the two response types.


2013 ◽  
Vol 25 (8) ◽  
pp. 1383-1395 ◽  
Author(s):  
Antje Strauß ◽  
Sonja A. Kotz ◽  
Jonas Obleser

Under adverse listening conditions, speech comprehension profits from the expectancies that listeners derive from the semantic context. However, the neurocognitive mechanisms of this semantic benefit are unclear: How are expectancies formed from context and adjusted as a sentence unfolds over time under various degrees of acoustic degradation? In an EEG study, we modified auditory signal degradation by applying noise-vocoding (severely degraded: four-band, moderately degraded: eight-band, and clear speech). Orthogonal to that, we manipulated the extent of expectancy: strong or weak semantic context (±con) and context-based typicality of the sentence-last word (high or low: ±typ). This allowed calculation of two distinct effects of expectancy on the N400 component of the evoked potential. The sentence-final N400 effect was taken as an index of the neural effort of automatic word-into-context integration; it varied in peak amplitude and latency with signal degradation and was not reliably observed in response to severely degraded speech. Under clear speech conditions in a strong context, typical and untypical sentence completions seemed to fulfill the neural prediction, as indicated by N400 reductions. In response to moderately degraded signal quality, however, the formed expectancies appeared more specific: Only typical (+con +typ), but not the less typical (+con −typ) context–word combinations led to a decrease in the N400 amplitude. The results show that adverse listening “narrows,” rather than broadens, the expectancies about the perceived speech signal: limiting the perceptual evidence forces the neural system to rely on signal-driven expectancies, rather than more abstract expectancies, while a sentence unfolds over time.


2003 ◽  
Vol 26 (1) ◽  
pp. 31-32
Author(s):  
Stephen Handel ◽  
Molly L. Erickson

AbstractThere are 2,000 hair cells in the cochlea, but only three cones in the retina. This disparity can be understood in terms of the differences between the physical characteristics of the auditory signal (discrete excitations and resonances requiring many narrowly tuned receptors) and those of the visual signal (smooth daylight excitations and reflectances requiring only a few broadly tuned receptors). We argue that this match supports the physicalism of color and timbre.


2021 ◽  
Author(s):  
Corrina Maguinness ◽  
Sonja Schall ◽  
Katharina von Kriegstein

Perception of human communication signals is often more robust when there is concurrent input from the auditory and visual sensory modality. For instance, seeing the dynamic articulatory movements of a speaker, in addition to hearing their voice, can help with understanding what is said. This is particularly evident in noisy listening conditions. Even in the absence of concurrent visual input, visual mechanisms continue to be recruited to optimise auditory processing: auditory-only speech and voice-identity recognition is superior for speakers who have been previously learned with their corresponding face, in comparison to an audio-visual control condition; an effect termed the “face-benefit”. Whether the face-benefit can assist in maintaining robust perception in noisy listening conditions, in a similar manner to concurrent visual input, is currently unknown. Here, in two behavioural experiments, we explicitly examined this hypothesis. In each experiment, participants learned a series of speakers’ voices together with their corresponding dynamic face, or a visual control image depicting the speaker’s occupation. Following learning, participants listened to auditory-only sentences spoken by the same speakers and were asked to recognise the content of the sentences (i.e., speech recognition, Experiment 1) or the identity of the speaker (i.e., voice-identity recognition, Experiment 2) in different levels of increasing auditory noise (SNR +4 dB to -8 dB). For both speech and voice-identity recognition, we observed that for participants who showed a face-benefit, the benefit increased with the degree of noise in the auditory signal (Experiment 1, 2). Taken together, these results support an audio-visual model of human auditory communication and suggest that the brain has developed a flexible system to deal with auditory uncertainty – learned visual mechanisms are recruited to enhance the recognition of the auditory signal.


Sign in / Sign up

Export Citation Format

Share Document