The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance

Author(s):  
Radek Safarik ◽  
Lukas Mateju
2021 ◽  
Vol 32 (08) ◽  
pp. 528-536
Author(s):  
Jessica H. Lewis ◽  
Irina Castellanos ◽  
Aaron C. Moberly

Abstract Background Recent models theorize that neurocognitive resources are deployed differently during speech recognition depending on task demands, such as the severity of degradation of the signal or modality (auditory vs. audiovisual [AV]). This concept is particularly relevant to the adult cochlear implant (CI) population, considering the large amount of variability among CI users in their spectro-temporal processing abilities. However, disentangling the effects of individual differences in spectro-temporal processing and neurocognitive skills on speech recognition in clinical populations of adult CI users is challenging. Thus, this study investigated the relationship between neurocognitive functions and recognition of spectrally degraded speech in a group of young adult normal-hearing (NH) listeners. Purpose The aim of this study was to manipulate the degree of spectral degradation and modality of speech presented to young adult NH listeners to determine whether deployment of neurocognitive skills would be affected. Research Design Correlational study design. Study Sample Twenty-one NH college students. Data Collection and Analysis Participants listened to sentences in three spectral-degradation conditions: no degradation (clear sentences); moderate degradation (8-channel noise-vocoded); and high degradation (4-channel noise-vocoded). Thirty sentences were presented in an auditory-only (A-only) modality and an AV fashion. Visual assessments from The National Institute of Health Toolbox Cognitive Battery were completed to evaluate working memory, inhibition-concentration, cognitive flexibility, and processing speed. Analyses of variance compared speech recognition performance among spectral degradation condition and modality. Bivariate correlation analyses were performed among speech recognition performance and the neurocognitive skills in the various test conditions. Results Main effects on sentence recognition were found for degree of degradation (p = < 0.001) and modality (p = < 0.001). Inhibition-concentration skills moderately correlated (r = 0.45, p = 0.02) with recognition scores for sentences that were moderately degraded in the A-only condition. No correlations were found among neurocognitive scores and AV speech recognition scores. Conclusions Inhibition-concentration skills are deployed differentially during sentence recognition, depending on the level of signal degradation. Additional studies will be required to study these relations in actual clinical populations such as adult CI users.


Energies ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 3267
Author(s):  
Ramon C. F. Araújo ◽  
Rodrigo M. S. de Oliveira ◽  
Fernando S. Brasil ◽  
Fabrício J. B. Barros

In this paper, a novel image denoising algorithm and novel input features are proposed. The algorithm is applied to phase-resolved partial discharge (PRPD) diagrams with a single dominant partial discharge (PD) source, preparing them for automatic artificial-intelligence-based classification. It was designed to mitigate several sources of distortions often observed in PRPDs obtained from fully operational hydroelectric generators. The capabilities of the denoising algorithm are the automatic removal of sparse noise and the suppression of non-dominant discharges, including those due to crosstalk. The input features are functions of PD distributions along amplitude and phase, which are calculated in a novel way to mitigate random effects inherent to PD measurements. The impact of the proposed contributions was statistically evaluated and compared to classification performance obtained using formerly published approaches. Higher recognition rates and reduced variances were obtained using the proposed methods, statistically outperforming autonomous classification techniques seen in earlier works. The values of the algorithm’s internal parameters are also validated by comparing the recognition performance obtained with different parameter combinations. All typical PD sources described in hydro-generators PD standards are considered and can be automatically detected.


2020 ◽  
Vol 24 ◽  
pp. 233121652098029
Author(s):  
Allison Trine ◽  
Brian B. Monson

Several studies have demonstrated that extended high frequencies (EHFs; >8 kHz) in speech are not only audible but also have some utility for speech recognition, including for speech-in-speech recognition when maskers are facing away from the listener. However, the contribution of EHF spectral versus temporal information to speech recognition is unknown. Here, we show that access to EHF temporal information improved speech-in-speech recognition relative to speech bandlimited at 8 kHz but that additional access to EHF spectral detail provided an additional small but significant benefit. Results suggest that both EHF spectral structure and the temporal envelope contribute to the observed EHF benefit. Speech recognition performance was quite sensitive to masker head orientation, with a rotation of only 15° providing a highly significant benefit. An exploratory analysis indicated that pure-tone thresholds at EHFs are better predictors of speech recognition performance than low-frequency pure-tone thresholds.


2012 ◽  
Vol 23 (08) ◽  
pp. 577-589 ◽  
Author(s):  
Mary Rudner ◽  
Thomas Lunner ◽  
Thomas Behrens ◽  
Elisabet Sundewall Thorén ◽  
Jerker Rönnberg

Background: Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. Purpose: The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. Research Design: In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. Study Sample: There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). Data Collection and Analysis: A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. Results: In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Conclusions: Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type.


1981 ◽  
Vol 10 (4) ◽  
pp. 239-246 ◽  
Author(s):  
D. D. Dirks ◽  
C. A. Kamm ◽  
J. R. Dubno ◽  
T. M. Velde

Author(s):  
Jiahao Chen ◽  
Ryota Nishimura ◽  
Norihide Kitaoka

Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.


2008 ◽  
Vol 19 (02) ◽  
pp. 120-134 ◽  
Author(s):  
Kate Gfeller ◽  
Jacob Oleson ◽  
John F. Knutson ◽  
Patrick Breheny ◽  
Virginia Driscoll ◽  
...  

The research examined whether performance by adult cochlear implant recipients on a variety of recognition and appraisal tests derived from real-world music could be predicted from technological, demographic, and life experience variables, as well as speech recognition scores. A representative sample of 209 adults implanted between 1985 and 2006 participated. Using multiple linear regression models and generalized linear mixed models, sets of optimal predictor variables were selected that effectively predicted performance on a test battery that assessed different aspects of music listening. These analyses established the importance of distinguishing between the accuracy of music perception and the appraisal of musical stimuli when using music listening as an index of implant success. Importantly, neither device type nor processing strategy predicted music perception or music appraisal. Speech recognition performance was not a strong predictor of music perception, and primarily predicted music perception when the test stimuli included lyrics. Additionally, limitations in the utility of speech perception in predicting musical perception and appraisal underscore the utility of music perception as an alternative outcome measure for evaluating implant outcomes. Music listening background, residual hearing (i.e., hearing aid use), cognitive factors, and some demographic factors predicted several indices of perceptual accuracy or appraisal of music. La investigación examinó si el desempeño, por parte de adultos receptores de un implante coclear, sobre una variedad de pruebas de reconocimiento y evaluación derivadas de la música del mundo real, podrían predecirse a partir de variables tecnológicas, demográficas y de experiencias de vida, así como de puntajes de reconocimiento del lenguaje. Participó una muestra representativa de 209 adultos implantados entre 1965 y el 2006. Usando múltiples modelos de regresión lineal y modelos mixtos lineales generalizados, se seleccionaron grupos de variables óptimas de predicción, que pudieran predecir efectivamente el desempeño por medio de una batería de pruebas que permitiera evaluar diferentes aspectos de la apreciación musical. Estos análisis establecieron la importancia de distinguir entre la exactitud en la percepción musical y la evaluación de estímulos musicales cuando se utiliza la apreciación musical como un índice de éxito en la implantación. Importantemente, ningún tipo de dispositivo o estrategia de procesamiento predijo la percepción o la evaluación musical. El desempeño en el reconocimiento del lenguaje no fue un elemento fuerte de predicción, y llegó a predecir primariamente la percepción musical cuando los estímulos de prueba incluyeron las letras. Adicionalmente, las limitaciones en la utilidad de la percepción del lenguaje a la hora de predecir la percepción y la evaluación musical, subrayan la utilidad de la percepción de la música como una medida alternativa de resultado para evaluar la implantación coclear. La música de fondo, la audición residual (p.e., el uso de auxiliares auditivos), los factores cognitivos, y algunos factores demográficos predijeron varios índices de exactitud y evaluación perceptual de la música.


2005 ◽  
Vol 36 (3) ◽  
pp. 219-229 ◽  
Author(s):  
Peggy Nelson ◽  
Kathryn Kohnert ◽  
Sabina Sabur ◽  
Daniel Shaw

Purpose: Two studies were conducted to investigate the effects of classroom noise on attention and speech perception in native Spanish-speaking second graders learning English as their second language (L2) as compared to English-only-speaking (EO) peers. Method: Study 1 measured children’s on-task behavior during instructional activities with and without soundfield amplification. Study 2 measured the effects of noise (+10 dB signal-to-noise ratio) using an experimental English word recognition task. Results: Findings from Study 1 revealed no significant condition (pre/postamplification) or group differences in observations in on-task performance. Main findings from Study 2 were that word recognition performance declined significantly for both L2 and EO groups in the noise condition; however, the impact was disproportionately greater for the L2 group. Clinical Implications: Children learning in their L2 appear to be at a distinct disadvantage when listening in rooms with typical noise and reverberation. Speech-language pathologists and audiologists should collaborate to inform teachers, help reduce classroom noise, increase signal levels, and improve access to spoken language for L2 learners.


Sign in / Sign up

Export Citation Format

Share Document