The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

The Impact of Neurocognitive Skills on Recognition of Spectrally Degraded Sentences

Journal of the American Academy of Audiology ◽

10.1055/s-0041-1732438 ◽

2021 ◽

Vol 32 (08) ◽

pp. 528-536

Author(s):

Jessica H. Lewis ◽

Irina Castellanos ◽

Aaron C. Moberly

Keyword(s):

Speech Recognition ◽

Young Adult ◽

Temporal Processing ◽

Recognition Performance ◽

Channel Noise ◽

Sentence Recognition ◽

Inhibition Concentration ◽

Spectral Degradation ◽

The Impact ◽

Clinical Populations

Abstract Background Recent models theorize that neurocognitive resources are deployed differently during speech recognition depending on task demands, such as the severity of degradation of the signal or modality (auditory vs. audiovisual [AV]). This concept is particularly relevant to the adult cochlear implant (CI) population, considering the large amount of variability among CI users in their spectro-temporal processing abilities. However, disentangling the effects of individual differences in spectro-temporal processing and neurocognitive skills on speech recognition in clinical populations of adult CI users is challenging. Thus, this study investigated the relationship between neurocognitive functions and recognition of spectrally degraded speech in a group of young adult normal-hearing (NH) listeners. Purpose The aim of this study was to manipulate the degree of spectral degradation and modality of speech presented to young adult NH listeners to determine whether deployment of neurocognitive skills would be affected. Research Design Correlational study design. Study Sample Twenty-one NH college students. Data Collection and Analysis Participants listened to sentences in three spectral-degradation conditions: no degradation (clear sentences); moderate degradation (8-channel noise-vocoded); and high degradation (4-channel noise-vocoded). Thirty sentences were presented in an auditory-only (A-only) modality and an AV fashion. Visual assessments from The National Institute of Health Toolbox Cognitive Battery were completed to evaluate working memory, inhibition-concentration, cognitive flexibility, and processing speed. Analyses of variance compared speech recognition performance among spectral degradation condition and modality. Bivariate correlation analyses were performed among speech recognition performance and the neurocognitive skills in the various test conditions. Results Main effects on sentence recognition were found for degree of degradation (p = < 0.001) and modality (p = < 0.001). Inhibition-concentration skills moderately correlated (r = 0.45, p = 0.02) with recognition scores for sentences that were moderately degraded in the A-only condition. No correlations were found among neurocognitive scores and AV speech recognition scores. Conclusions Inhibition-concentration skills are deployed differentially during sentence recognition, depending on the level of signal degradation. Additional studies will be required to study these relations in actual clinical populations such as adult CI users.

Download Full-text

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance

10.21437/interspeech.2021-2091 ◽

2021 ◽

Author(s):

Takanori Ashihara ◽

Takafumi Moriya ◽

Makio Kashino

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition Performance ◽

End To End ◽

The Impact ◽

Temporal Degradation

Download Full-text

Novel Features and PRPD Image Denoising Method for Improved Single-Source Partial Discharges Classification in On-Line Hydro-Generators

Energies ◽

10.3390/en14113267 ◽

2021 ◽

Vol 14 (11) ◽

pp. 3267

Author(s):

Ramon C. F. Araújo ◽

Rodrigo M. S. de Oliveira ◽

Fernando S. Brasil ◽

Fabrício J. B. Barros

Keyword(s):

Image Denoising ◽

Recognition Performance ◽

Partial Discharge ◽

Classification Performance ◽

Partial Discharges ◽

Denoising Method ◽

Internal Parameters ◽

On Line ◽

Automatic Removal ◽

The Impact

In this paper, a novel image denoising algorithm and novel input features are proposed. The algorithm is applied to phase-resolved partial discharge (PRPD) diagrams with a single dominant partial discharge (PD) source, preparing them for automatic artificial-intelligence-based classification. It was designed to mitigate several sources of distortions often observed in PRPDs obtained from fully operational hydroelectric generators. The capabilities of the denoising algorithm are the automatic removal of sparse noise and the suppression of non-dominant discharges, including those due to crosstalk. The input features are functions of PD distributions along amplitude and phase, which are calculated in a novel way to mitigate random effects inherent to PD measurements. The impact of the proposed contributions was statistically evaluated and compared to classification performance obtained using formerly published approaches. Higher recognition rates and reduced variances were obtained using the proposed methods, statistically outperforming autonomous classification techniques seen in earlier works. The values of the algorithm’s internal parameters are also validated by comparing the recognition performance obtained with different parameter combinations. All typical PD sources described in hydro-generators PD standards are considered and can be automatically detected.

Download Full-text

Extended High Frequencies Provide Both Spectral and Temporal Information to Improve Speech-in-Speech Recognition

Trends in Hearing ◽

10.1177/2331216520980299 ◽

2020 ◽

Vol 24 ◽

pp. 233121652098029

Author(s):

Allison Trine ◽

Brian B. Monson

Keyword(s):

Speech Recognition ◽

Pure Tone ◽

Recognition Performance ◽

Low Frequency ◽

Exploratory Analysis ◽

Temporal Information ◽

Significant Benefit ◽

Head Orientation ◽

Spectral Structure ◽

High Frequencies

Several studies have demonstrated that extended high frequencies (EHFs; >8 kHz) in speech are not only audible but also have some utility for speech recognition, including for speech-in-speech recognition when maskers are facing away from the listener. However, the contribution of EHF spectral versus temporal information to speech recognition is unknown. Here, we show that access to EHF temporal information improved speech-in-speech recognition relative to speech bandlimited at 8 kHz but that additional access to EHF spectral detail provided an additional small but significant benefit. Results suggest that both EHF spectral structure and the temporal envelope contribute to the observed EHF benefit. Speech recognition performance was quite sensitive to masker head orientation, with a rotation of only 15° providing a highly significant benefit. An exploratory analysis indicated that pure-tone thresholds at EHFs are better predictors of speech recognition performance than low-frequency pure-tone thresholds.

Download Full-text

Working Memory Capacity May Influence Perceived Effort during Aided Speech Recognition in Noise

Journal of the American Academy of Audiology ◽

10.3766/jaaa.23.7.7 ◽

2012 ◽

Vol 23 (08) ◽

pp. 577-589 ◽

Cited By ~ 73

Author(s):

Mary Rudner ◽

Thomas Lunner ◽

Thomas Behrens ◽

Elisabet Sundewall Thorén ◽

Jerker Rönnberg

Keyword(s):

Working Memory ◽

Hearing Loss ◽

Speech Recognition ◽

Hearing Aid ◽

Recognition Performance ◽

Cognitive Capacity ◽

Subjective Ratings ◽

Perceived Effort ◽

Noise Type ◽

Speech Recognition In Noise

Background: Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. Purpose: The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. Research Design: In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. Study Sample: There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). Data Collection and Analysis: A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. Results: In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Conclusions: Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type.

Download Full-text

Speech Recognition Performance at Loudness Discomfort Level

Scandinavian Audiology ◽

10.3109/01050398109076187 ◽

1981 ◽

Vol 10 (4) ◽

pp. 239-246 ◽

Cited By ~ 9

Author(s):

D. D. Dirks ◽

C. A. Kamm ◽

J. R. Dubno ◽

T. M. Velde

Keyword(s):

Speech Recognition ◽

Recognition Performance

Download Full-text

End-to-end recognition of streaming Japanese speech using CTC and local attention

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.23 ◽

2020 ◽

Vol 9 ◽

Author(s):

Jiahao Chen ◽

Ryota Nishimura ◽

Norihide Kitaoka

Keyword(s):

Speech Recognition ◽

Recognition Performance ◽

Time Lag ◽

Recognition Algorithm ◽

Streaming Data ◽

Continuous Speech Recognition ◽

Voice Input ◽

Sequence Modeling ◽

End To End ◽

Bidirectional Networks

Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.

Download Full-text

Multivariate Predictors of Music Perception and Appraisal by Adult Cochlear Implant Users

Journal of the American Academy of Audiology ◽

10.3766/jaaa.19.2.3 ◽

2008 ◽

Vol 19 (02) ◽

pp. 120-134 ◽

Cited By ~ 63

Author(s):

Kate Gfeller ◽

Jacob Oleson ◽

John F. Knutson ◽

Patrick Breheny ◽

Virginia Driscoll ◽

...

Keyword(s):

Speech Recognition ◽

Cochlear Implant ◽

Music Perception ◽

Life Experience ◽

Recognition Performance ◽

Strong Predictor ◽

Music Listening ◽

Linear Regression Models ◽

Hearing Aid Use ◽

Music Appraisal

The research examined whether performance by adult cochlear implant recipients on a variety of recognition and appraisal tests derived from real-world music could be predicted from technological, demographic, and life experience variables, as well as speech recognition scores. A representative sample of 209 adults implanted between 1985 and 2006 participated. Using multiple linear regression models and generalized linear mixed models, sets of optimal predictor variables were selected that effectively predicted performance on a test battery that assessed different aspects of music listening. These analyses established the importance of distinguishing between the accuracy of music perception and the appraisal of musical stimuli when using music listening as an index of implant success. Importantly, neither device type nor processing strategy predicted music perception or music appraisal. Speech recognition performance was not a strong predictor of music perception, and primarily predicted music perception when the test stimuli included lyrics. Additionally, limitations in the utility of speech perception in predicting musical perception and appraisal underscore the utility of music perception as an alternative outcome measure for evaluating implant outcomes. Music listening background, residual hearing (i.e., hearing aid use), cognitive factors, and some demographic factors predicted several indices of perceptual accuracy or appraisal of music. La investigación examinó si el desempeño, por parte de adultos receptores de un implante coclear, sobre una variedad de pruebas de reconocimiento y evaluación derivadas de la música del mundo real, podrían predecirse a partir de variables tecnológicas, demográficas y de experiencias de vida, así como de puntajes de reconocimiento del lenguaje. Participó una muestra representativa de 209 adultos implantados entre 1965 y el 2006. Usando múltiples modelos de regresión lineal y modelos mixtos lineales generalizados, se seleccionaron grupos de variables óptimas de predicción, que pudieran predecir efectivamente el desempeño por medio de una batería de pruebas que permitiera evaluar diferentes aspectos de la apreciación musical. Estos análisis establecieron la importancia de distinguir entre la exactitud en la percepción musical y la evaluación de estímulos musicales cuando se utiliza la apreciación musical como un índice de éxito en la implantación. Importantemente, ningún tipo de dispositivo o estrategia de procesamiento predijo la percepción o la evaluación musical. El desempeño en el reconocimiento del lenguaje no fue un elemento fuerte de predicción, y llegó a predecir primariamente la percepción musical cuando los estímulos de prueba incluyeron las letras. Adicionalmente, las limitaciones en la utilidad de la percepción del lenguaje a la hora de predecir la percepción y la evaluación musical, subrayan la utilidad de la percepción de la música como una medida alternativa de resultado para evaluar la implantación coclear. La música de fondo, la audición residual (p.e., el uso de auxiliares auditivos), los factores cognitivos, y algunos factores demográficos predijeron varios índices de exactitud y evaluación perceptual de la música.

Download Full-text

Speech recognition performance on a voicemail transcription task

Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181) ◽

10.1109/icassp.1998.675414 ◽

2002 ◽

Cited By ~ 4

Author(s):

M. Padmanabhan ◽

E. Eide ◽

B. Ramabhadran ◽

G. Ramaswamy ◽

L.R. Bahl

Keyword(s):

Speech Recognition ◽

Recognition Performance

Download Full-text

Classroom Noise and Children Learning Through a Second Language

Language Speech and Hearing Services in Schools ◽

10.1044/0161-1461(2005/022) ◽

2005 ◽

Vol 36 (3) ◽

pp. 219-229 ◽

Cited By ~ 39

Author(s):

Peggy Nelson ◽

Kathryn Kohnert ◽

Sabina Sabur ◽

Daniel Shaw

Keyword(s):

Second Language ◽

Word Recognition ◽

Signal To Noise Ratio ◽

Recognition Performance ◽

Recognition Task ◽

English Word ◽

Noise Condition ◽

English Only ◽

Task Behavior ◽

The Impact

Purpose: Two studies were conducted to investigate the effects of classroom noise on attention and speech perception in native Spanish-speaking second graders learning English as their second language (L2) as compared to English-only-speaking (EO) peers. Method: Study 1 measured children’s on-task behavior during instructional activities with and without soundfield amplification. Study 2 measured the effects of noise (+10 dB signal-to-noise ratio) using an experimental English word recognition task. Results: Findings from Study 1 revealed no significant condition (pre/postamplification) or group differences in observations in on-task performance. Main findings from Study 2 were that word recognition performance declined significantly for both L2 and EO groups in the noise condition; however, the impact was disproportionately greater for the L2 group. Clinical Implications: Children learning in their L2 appear to be at a distinct disadvantage when listening in rooms with typical noise and reverberation. Speech-language pathologists and audiologists should collaborate to inform teachers, help reduce classroom noise, increase signal levels, and improve access to spoken language for L2 learners.

Download Full-text