Performance vs. hardware requirements in state-of-the-art automatic speech recognition

Alexandru-Lucian Georgescu; Alessandro Pappalardo; Horia Cucu; Michaela Blott

doi:10.1186/s13636-021-00217-4

Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00217-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Alexandru-Lucian Georgescu ◽

Alessandro Pappalardo ◽

Horia Cucu ◽

Michaela Blott

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Decision Makers ◽

Computing Power ◽

Trade Off ◽

Speech Features ◽

Commercial Applications ◽

Guided Tour ◽

Embedded Applications

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.

Download Full-text

An Experimental Analysis of Speech Features for Tone Speech Recognition

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b7748.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4355-4360

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Arunachal Pradesh ◽

Automatic Speech Recognition System ◽

North East India ◽

Related Information ◽

North East ◽

Speech Features ◽

Commercial Applications

Recently Automatic Speech Recognition (ASR) has been successfully integrated in many commercial applications. These applications are performing significantly well in relatively controlled acoustical environments. However, the performance of an Automatic Speech Recognition system developed for non-tonal languages degrades considerably when tested for tonal languages. One of the main reason for this performance degradation is the non-consideration of tone related information in the feature set of the ASR systems developed for non-tonal languages. In this paper we have investigated the performance of commonly used feature for tonal speech recognition. A model has been proposed for extracting features for tonal speech recognition. A statistical analysis has been done to evaluate the performance of proposed feature set with reference to the Apatani language of Arunachal Pradesh of North-East India, which is a tonal language of Tibeto-Burman group of languages.

Download Full-text

Evaluation of the effectiveness and efficiency of state-of-the-art features and models for automatic speech recognition error detection

Journal Of Big Data ◽

10.1186/s40537-020-00391-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Asmaa El Hannani ◽

Rahhal Errattahi ◽

Fatima Zahra Salmam ◽

Thomas Hain ◽

Hassan Ouahmane

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Detection ◽

State Of The Art ◽

Rapid Development ◽

Unified Framework ◽

Human Machine Interaction ◽

Detection Analysis ◽

Extensive Evaluation ◽

Effectiveness And Efficiency

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.

Download Full-text

Noise-robust algorithm of speech features extraction for automatic speech recognition system

2016 XIX IEEE International Conference on Soft Computing and Measurements (SCM) ◽

10.1109/scm.2016.7519729 ◽

2016 ◽

Author(s):

A. N. Yakhnev ◽

A. S. Pisarev

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Features Extraction ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Robust Algorithm ◽

Speech Features ◽

Noise Robust

Download Full-text

Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems

2009 IEEE Workshop on Automatic Speech Recognition & Understanding ◽

10.1109/asru.2009.5373530 ◽

2009 ◽

Cited By ~ 2

Author(s):

Gerasimos Potamianos

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

The State ◽

Open Problems

Download Full-text

Generating Robust Audio Adversarial Examples with Temporal Dependency

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/438 ◽

2020 ◽

Author(s):

Hongting Zhang ◽

Pan Zhou ◽

Qiben Yan ◽

Xiao-Yang Liu

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Defense Mechanisms ◽

User Study ◽

State Of The Art ◽

Temporal Structure ◽

Human Perception ◽

Experimental Results ◽

Low Intensity ◽

Adversarial Examples

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.

Download Full-text

Speech Feature Evaluation for Bangla Automatic Speech Recognition

Technical Challenges and Design Issues in Bangla Language Processing ◽

10.4018/978-1-4666-3970-6.ch009 ◽

2013 ◽

pp. 169-208 ◽

Cited By ~ 1

Author(s):

Mohammed Rokibul Alam Kotwal ◽

Foyzul Hassan ◽

Mohammad Nurul Huda

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition Performance ◽

Dynamic Parameters ◽

Mel Frequency Cepstral Coefficients ◽

Phoneme Recognition ◽

Hybrid Features ◽

Feature Evaluation ◽

Speech Feature ◽

Speech Features

This chapter presents Bangla (widely known as Bengali) Automatic Speech Recognition (ASR) techniques by evaluating the different speech features, such as Mel Frequency Cepstral Coefficients (MFCCs), Local Features (LFs), phoneme probabilities extracted by time delay artificial neural networks of different architectures. Moreover, canonicalization of speech features is also performed for Gender-Independent (GI) ASR. In the canonicalization process, the authors have designed three classifiers by male, female, and GI speakers, and extracted the output probabilities from these classifiers for measuring the maximum. The maximization of output probabilities for each speech file provides higher correctness and accuracies for GI speech recognition. Besides, dynamic parameters (velocity and acceleration coefficients) are also used in the experiments for obtaining higher accuracy in phoneme recognition. From the experiments, it is also shown that dynamic parameters with hybrid features also increase the phoneme recognition performance in a certain extent. These parameters not only increase the accuracy of the ASR system, but also reduce the computation complexity of Hidden Markov Model (HMM)-based classifiers with fewer mixture components.

Download Full-text

Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey

Archives of Computational Methods in Engineering ◽

10.1007/s11831-020-09414-4 ◽

2020 ◽

Author(s):

Jaspreet Kaur ◽

Amitoj Singh ◽

Virender Kadyan

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System

Download Full-text

Using Privacy-Transformed Speech in the Automatic Speech Recognition Acoustic Model Training

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200601 ◽

2020 ◽

Author(s):

Askars Salimbajevs

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Speaker Verification ◽

Voice Conversion ◽

Acoustic Model ◽

Acoustic Models ◽

Speech Data ◽

Model Training ◽

The Voice

Automatic Speech Recognition (ASR) requires huge amounts of real user speech data to reach state-of-the-art performance. However, speech data conveys sensitive speaker attributes like identity that can be inferred and exploited for malicious purposes. Therefore, there is an interest in the collection of anonymized speech data that is processed by some voice conversion method. In this paper, we evaluate one of the voice conversion methods on Latvian speech data and also investigate if privacy-transformed data can be used to improve ASR acoustic models. Results show the effectiveness of voice conversion against state-of-the-art speaker verification models on Latvian speech and the effectiveness of using privacy-transformed data in ASR training.

Download Full-text

Automatic Speech Recognition System: A Survey Report

Science & Technology Journal ◽

10.22232/stj.2016.04.02.10 ◽

2016 ◽

Vol 4 (2) ◽

pp. 152-155

Author(s):

Moirangthem Tiken Singh ◽

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Mel Frequency Cepstral Coefficients ◽

Automatic Speech Recognition System ◽

Indian Language ◽

System A ◽

Survey Report ◽

Speech Features

This paper presents a report on an Automatic Speech Recognition System (ASR) for different Indian language under different accent. The paper is a comparative study of the performance of system developed which uses Hidden Markov Model (HMM) as the classifier and Mel-Frequency Cepstral Coefficients (MFCC) as speech features.

Download Full-text

Computer-Assisted Interpreting Tools (CAI) and options for automation with Automatic Speech Recognition

Tradterm ◽

10.11606/issn.2317-9511.v32i0p9-31 ◽

2018 ◽

Vol 32 ◽

pp. 9-31

Author(s):

Luis Eduardo Schild Ortiz ◽

Patrizia Cavallo

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

New Technologies ◽

State Of The Art ◽

The State ◽

Current Level ◽

Computer Assisted ◽

Future Perspectives ◽

Potential Benefits

In recent years, several studies have indicated interpreters resist adopting new technologies. Yet, such technologies have enabled the development of several tools to help those professionals. In this paper, using bibliographical and documental research, we briefly analyse the tools cited by several authors to identify which ones remain up to date and available on the market. Following that, we present concepts about automation, and observe the usage of automatic speech recognition (ASR), while analysing its potential benefits and the current level of maturity of such an approach, especially regarding Computer-Assisted Interpreting (CAI) tools. The goal of this paper is to present the community of interpreters and researchers with a view of the state of the art in technology for interpreting as well as some future perspectives for this area.

Download Full-text