visual speech
Recently Published Documents


TOTAL DOCUMENTS

1185
(FIVE YEARS 242)

H-INDEX

49
(FIVE YEARS 5)

2021 ◽  
pp. 154596832110642
Author(s):  
Lisa Johnson ◽  
Grigori Yourganov ◽  
Alexandra Basilakos ◽  
Roger David Newman-Norlund ◽  
Helga Thors ◽  
...  

Background Speech entrainment (SE), the online mimicking of an audio-visual speech model, has been shown to increase speech fluency in individuals with non-fluent aphasia. One theory that may explain why SE improves speech output is that it synchronizes functional connectivity between anterior and posterior language regions to be more similar to that of neurotypical speakers. Objectives The present study tested this by measuring functional connectivity between 2 regions shown to be necessary for speech production, and their right hemisphere homologues, in 24 persons with aphasia compared to 20 controls during both free (spontaneous) speech and SE. Methods Regional functional connectivity in participants with aphasia were normalized to the control data. Two analyses were then carried out: (1) normalized functional connectivity was compared between persons with aphasia and controls during free speech and SE and (2) stepwise linear models with leave-one-out cross-validation including normed functional connectivity during both tasks and proportion damage to the left hemisphere as independent variables were created for each language score. Results Left anterior–posterior functional connectivity and left posterior to right anterior functional connectivity were significantly more similar to connectivity of the control group during SE compared to free speech. Additionally, connectivity during free speech was more associated with language measures than connectivity during SE. Conclusions Overall, these results suggest that SE promotes normalization of functional connectivity (i.e., return to patterns observed in neurotypical controls), which may explain why individuals with non-fluent aphasia produce more fluent speech during SE compared to spontaneous speech.


2021 ◽  
pp. 1-21
Author(s):  
Lynne E. Bernstein ◽  
Edward T. Auer ◽  
Silvio P. Eberhardt

Purpose: This study investigated the effects of external feedback on perceptual learning of visual speech during lipreading training with sentence stimuli. The goal was to improve visual-only (VO) speech recognition and increase accuracy of audiovisual (AV) speech recognition in noise. The rationale was that spoken word recognition depends on the accuracy of sublexical (phonemic/phonetic) speech perception; effective feedback during training must support sublexical perceptual learning. Method: Normal-hearing (NH) adults were assigned to one of three types of feedback: Sentence feedback was the entire sentence printed after responding to the stimulus. Word feedback was the correct response words and perceptually near but incorrect response words. Consonant feedback was correct response words and consonants in incorrect but perceptually near response words. Six training sessions were given. Pre- and posttraining testing included an untrained control group. Test stimuli were disyllable nonsense words for forced-choice consonant identification, and isolated words and sentences for open-set identification. Words and sentences were VO, AV, and audio-only (AO) with the audio in speech-shaped noise. Results: Lipreading accuracy increased during training. Pre- and posttraining tests of consonant identification showed no improvement beyond test–retest increases obtained by untrained controls. Isolated word recognition with a talker not seen during training showed that the control group improved more than the sentence group. Tests of untrained sentences showed that the consonant group significantly improved in all of the stimulus conditions (VO, AO, and AV). Its mean words correct scores increased by 9.2 percentage points for VO, 3.4 percentage points for AO, and 9.8 percentage points for AV stimuli. Conclusions: Consonant feedback during training with sentences stimuli significantly increased perceptual learning. The training generalized to untrained VO, AO, and AV sentence stimuli. Lipreading training has potential to significantly improve adults' face-to-face communication in noisy settings in which the talker can be seen.


Sensors ◽  
2021 ◽  
Vol 22 (1) ◽  
pp. 72
Author(s):  
Sanghun Jeon ◽  
Ahmed Elsharkawy ◽  
Mun Sang Kim

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.


2021 ◽  
Author(s):  
Mate Aller ◽  
Heidi Solberg Okland ◽  
Lucy J MacGregor ◽  
Helen Blank ◽  
Matthew H. Davis

Speech perception in noisy environments is enhanced by seeing facial movements of communication partners. However, the neural mechanisms by which audio and visual speech are combined are not fully understood. We explore MEG phase locking to auditory and visual signals in MEG recordings from 14 human participants (6 female) that reported words from single spoken sentences. We manipulated the acoustic clarity and visual speech signals such that critical speech information is present in auditory, visual or both modalities. MEG coherence analysis revealed that both auditory and visual speech envelopes (auditory amplitude modulations and lip aperture changes) were phase-locked to 2-6Hz brain responses in auditory and visual cortex, consistent with entrainment to syllable-rate components. Partial coherence analysis was used to separate neural responses to correlated audio-visual signals and showed non-zero phase locking to auditory envelope in occipital cortex during audio-visual (AV) speech. Furthermore, phase-locking to auditory signals in visual cortex was enhanced for AV speech compared to audio-only (AO) speech that was matched for intelligibility. Conversely, auditory regions of the superior temporal gyrus (STG) did not show above-chance partial coherence with visual speech signals during AV conditions, but did show partial coherence in VO conditions. Hence, visual speech enabled stronger phase locking to auditory signals in visual areas, whereas phase-locking of visual speech in auditory regions only occurred during silent lip-reading. Differences in these cross-modal interactions between auditory and visual speech signals are interpreted in line with cross-modal predictive mechanisms during speech perception.


2021 ◽  
pp. 1-25
Author(s):  
Tania S. ZAMUNER ◽  
Theresa RABIDEAU ◽  
Margarethe MCDONALD ◽  
H. Henny YEUNG

Abstract This study investigates how children aged two to eight years (N = 129) and adults (N = 29) use auditory and visual speech for word recognition. The goal was to bridge the gap between apparent successes of visual speech processing in young children in visual-looking tasks, with apparent difficulties of speech processing in older children from explicit behavioural measures. Participants were presented with familiar words in audio-visual (AV), audio-only (A-only) or visual-only (V-only) speech modalities, then presented with target and distractor images, and looking to targets was measured. Adults showed high accuracy, with slightly less target-image looking in the V-only modality. Developmentally, looking was above chance for both AV and A-only modalities, but not in the V-only modality until 6 years of age (earlier on /k/-initial words). Flexible use of visual cues for lexical access develops throughout childhood.


2021 ◽  
Author(s):  
◽  
Daniel Jenkins

<p>Multisensory integration describes the cognitive processes by which information from various perceptual domains is combined to create coherent percepts. For consciously aware perception, multisensory integration can be inferred when information in one perceptual domain influences subjective experience in another. Yet the relationship between integration and awareness is not well understood. One current question is whether multisensory integration can occur in the absence of perceptual awareness. Because there is subjective experience for unconscious perception, researchers have had to develop novel tasks to infer integration indirectly. For instance, Palmer and Ramsey (2012) presented auditory recordings of spoken syllables alongside videos of faces speaking either the same or different syllables, while masking the videos to prevent visual awareness. The conjunction of matching voices and faces predicted the location of a subsequent Gabor grating (target) on each trial. Participants indicated the location/orientation of the target more accurately when it appeared in the cued location (80% chance), thus the authors inferred that auditory and visual speech events were integrated in the absence of visual awareness. In this thesis, I investigated whether these findings generalise to the integration of auditory and visual expressions of emotion. In Experiment 1, I presented spatially informative cues in which congruent facial and vocal emotional expressions predicted the target location, with and without visual masking. I found no evidence of spatial cueing in either awareness condition. To investigate the lack of spatial cueing, in Experiment 2, I repeated the task with aware participants only, and had half of those participants explicitly report the emotional prosody. A significant spatial-cueing effect was found only when participants reported emotional prosody, suggesting that audiovisual congruence can cue spatial attention during aware perception. It remains unclear whether audiovisual congruence can cue spatial attention without awareness, and whether such effects genuinely imply multisensory integration.</p>


2021 ◽  
Author(s):  
◽  
Daniel Jenkins

<p>Multisensory integration describes the cognitive processes by which information from various perceptual domains is combined to create coherent percepts. For consciously aware perception, multisensory integration can be inferred when information in one perceptual domain influences subjective experience in another. Yet the relationship between integration and awareness is not well understood. One current question is whether multisensory integration can occur in the absence of perceptual awareness. Because there is subjective experience for unconscious perception, researchers have had to develop novel tasks to infer integration indirectly. For instance, Palmer and Ramsey (2012) presented auditory recordings of spoken syllables alongside videos of faces speaking either the same or different syllables, while masking the videos to prevent visual awareness. The conjunction of matching voices and faces predicted the location of a subsequent Gabor grating (target) on each trial. Participants indicated the location/orientation of the target more accurately when it appeared in the cued location (80% chance), thus the authors inferred that auditory and visual speech events were integrated in the absence of visual awareness. In this thesis, I investigated whether these findings generalise to the integration of auditory and visual expressions of emotion. In Experiment 1, I presented spatially informative cues in which congruent facial and vocal emotional expressions predicted the target location, with and without visual masking. I found no evidence of spatial cueing in either awareness condition. To investigate the lack of spatial cueing, in Experiment 2, I repeated the task with aware participants only, and had half of those participants explicitly report the emotional prosody. A significant spatial-cueing effect was found only when participants reported emotional prosody, suggesting that audiovisual congruence can cue spatial attention during aware perception. It remains unclear whether audiovisual congruence can cue spatial attention without awareness, and whether such effects genuinely imply multisensory integration.</p>


Author(s):  
Basil Wahn ◽  
Laura Schmitz ◽  
Alan Kingstone ◽  
Anne Böckler-Raettig

AbstractEye contact is a dynamic social signal that captures attention and plays a critical role in human communication. In particular, direct gaze often accompanies communicative acts in an ostensive function: a speaker directs her gaze towards the addressee to highlight the fact that this message is being intentionally communicated to her. The addressee, in turn, integrates the speaker’s auditory and visual speech signals (i.e., her vocal sounds and lip movements) into a unitary percept. It is an open question whether the speaker’s gaze affects how the addressee integrates the speaker’s multisensory speech signals. We investigated this question using the classic McGurk illusion, an illusory percept created by presenting mismatching auditory (vocal sounds) and visual information (speaker’s lip movements). Specifically, we manipulated whether the speaker (a) moved his eyelids up/down (i.e., open/closed his eyes) prior to speaking or did not show any eye motion, and (b) spoke with open or closed eyes. When the speaker’s eyes moved (i.e., opened or closed) before an utterance, and when the speaker spoke with closed eyes, the McGurk illusion was weakened (i.e., addressees reported significantly fewer illusory percepts). In line with previous research, this suggests that motion (opening or closing), as well as the closed state of the speaker’s eyes, captured addressees’ attention, thereby reducing the influence of the speaker’s lip movements on the addressees’ audiovisual integration process. Our findings reaffirm the power of speaker gaze to guide attention, showing that its dynamics can modulate low-level processes such as the integration of multisensory speech signals.


Sign in / Sign up

Export Citation Format

Share Document