Speech Emotion Feature Extraction Using FFT Spectrum Analysis

2015 ◽  
Vol 781 ◽  
pp. 551-554 ◽  
Author(s):  
Chaidiaw Thiangtham ◽  
Jakkree Srinonchat

Speech Emotion Recognition has widely researched and applied to some appllication such as for communication with robot, E-learning system and emergency call etc.Speech emotion feature extraction is an importance key to achieve the speech emotion recognition which can be classify for personal identity. Speech emotion features are extracted into several coefficients such as Linear Predictive Coefficients (LPCs), Linear Spectral Frequency (LSF), Zero-Crossing (ZC), Mel-Frequency Cepstrum Coefficients (MFCC) [1-6] etc. There are some of research works which have been done in the speech emotion recgnition. A study of zero-crossing with peak-amplitudes in speech emotion classification is introduced in [4]. The results shown that it provides the the technique to extract the emotion feature in time-domain, which still got the problem in amplitude shifting. The emotion recognition from speech is descrpited in [5]. It used the Gaussian Mixture Model (GMM) for extractor of feature speech. The GMM is provided the good results to reduce the back ground noise, howere it still have to focus on random noise in GMM for recognition model. The speech emotion recognition using hidden markov model and support vector machine is explained in [6]. The results shown the average performance of recognition system according to the features of speech emotion still has got the error information. Thus [1-6] provides the recognition performance which still requiers more focus on speech features.

2016 ◽  
Vol 7 (1) ◽  
pp. 58-68 ◽  
Author(s):  
Imen Trabelsi ◽  
Med Salim Bouhlel

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.


2020 ◽  
pp. 1-15
Author(s):  
Wang Wei ◽  
Xinyi Cao ◽  
He Li ◽  
Lingjie Shen ◽  
Yaqin Feng ◽  
...  

Abstract To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED features. Support vector machine and convolutional neural network are employed as the classifiers in our experiment. The results show that our proposed method in four tasks of emotion classification all provides significant improvement in unweighted average recall.


Electronics ◽  
2021 ◽  
Vol 10 (23) ◽  
pp. 2891
Author(s):  
Shihan Huang ◽  
Hua Dang ◽  
Rongkun Jiang ◽  
Yue Hao ◽  
Chengbo Xue ◽  
...  

Speech Emotion Recognition (SER) plays a significant role in the field of Human–Computer Interaction (HCI) with a wide range of applications. However, there are still some issues in practical application. One of the issues is the difference between emotional expression amongst various individuals, and another is that some indistinguishable emotions may reduce the stability of the SER system. In this paper, we propose a multi-layer hybrid fuzzy support vector machine (MLHF-SVM) model, which includes three layers: feature extraction layer, pre-classification layer, and classification layer. The MLHF-SVM model solves the above-mentioned issues by fuzzy c-means (FCM) based on identification information of human and multi-layer SVM classifiers, respectively. In addition, to overcome the weakness that FCM tends to fall into local minima, an improved natural exponential inertia weight particle swarm optimization (IEPSO) algorithm is proposed and integrated with fuzzy c-means for optimization. Moreover, in the feature extraction layer, non-personalized features and personalized features are combined to improve accuracy. In order to verify the effectiveness of the proposed model, all emotions in three popular datasets are used for simulation. The results show that this model can effectively improve the success rate of classification and the maximum value of a single emotion recognition rate is 97.67% on the EmoDB dataset.


2020 ◽  
pp. 283-293
Author(s):  
Imen Trabelsi ◽  
Med Salim Bouhlel

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.


2015 ◽  
Vol 6 (2) ◽  
pp. 57-68 ◽  
Author(s):  
Imen Trabelsi ◽  
Med Salim Bouhlel

Speech emotion recognition is the indispensable requirement for efficient human machine interaction. Most modern automatic speech emotion recognition systems use Gaussian mixture models (GMM) and Support Vector Machines (SVM). GMM are known for their performance and scalability in the spectral modeling while SVM are known for their discriminatory power. A GMM-supervector characterizes an emotional style by the GMM parameters (mean vectors, covariance matrices, and mixture weights). GMM-supervector SVM benefits from both GMM and SVM frameworks. In this paper, the GMM-UBM mean interval (GUMI) kernel based on the Bhattacharyya distance is successfully used. CFSSubsetEval combined with Best first algorithm and Greedy stepwise were also utilized on the supervectors space in order to select the most important features. This framework is illustrated using Mel-frequency cepstral (MFCC) coefficients and Perceptual Linear Prediction (PLP) features on two different emotional databases namely the Surrey Audio-Expressed Emotion and the Berlin Emotional speech Database.


Sensors ◽  
2019 ◽  
Vol 19 (12) ◽  
pp. 2730 ◽  
Author(s):  
Wei Jiang ◽  
Zheng Wang ◽  
Jesse S. Jin ◽  
Xianfeng Han ◽  
Chunguang Li

Automatic speech emotion recognition is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. We propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups which may contain redundant and unrelated information leading to low emotion recognition performance in this work. After obtaining the informative features, a fusion network is trained to jointly learn the discriminative acoustic feature representation and a Support Vector Machine (SVM) is used as the final classifier for recognition task. Experimental results on the IEMOCAP dataset demonstrate that the proposed architecture improved the recognition performance, achieving accuracy of 64% compared to existing state-of-the-art approaches.


2015 ◽  
Vol 2015 ◽  
pp. 1-10
Author(s):  
Xinran Zhang ◽  
Huawei Tao ◽  
Cheng Zha ◽  
Xinzhou Xu ◽  
Li Zhao

Speech emotion classification method, proposed in this paper, is based on Student’st-mixture model with infinite component number (iSMM) and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model), speech emotion model based on Student’st-mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover,t-mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus) and two acting (DES and EMO-DB) universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Hariharan Muthusamy ◽  
Kemal Polat ◽  
Sazali Yaacob

Recently, researchers have paid escalating attention to studying the emotional state of an individual from his/her speech signals as the speech signal is the fastest and the most natural method of communication between individuals. In this work, new feature enhancement using Gaussian mixture model (GMM) was proposed to enhance the discriminatory power of the features extracted from speech and glottal signals. Three different emotional speech databases were utilized to gauge the proposed methods. Extreme learning machine (ELM) andk-nearest neighbor (kNN) classifier were employed to classify the different types of emotions. Several experiments were conducted and results show that the proposed methods significantly improved the speech emotion recognition performance compared to research works published in the literature.


Sensors ◽  
2021 ◽  
Vol 21 (15) ◽  
pp. 5135
Author(s):  
Ngoc-Dau Mai ◽  
Boon-Giin Lee ◽  
Wan-Young Chung

In this research, we develop an affective computing method based on machine learning for emotion recognition using a wireless protocol and a wearable electroencephalography (EEG) custom-designed device. The system collects EEG signals using an eight-electrode placement on the scalp; two of these electrodes were placed in the frontal lobe, and the other six electrodes were placed in the temporal lobe. We performed experiments on eight subjects while they watched emotive videos. Six entropy measures were employed for extracting suitable features from the EEG signals. Next, we evaluated our proposed models using three popular classifiers: a support vector machine (SVM), multi-layer perceptron (MLP), and one-dimensional convolutional neural network (1D-CNN) for emotion classification; both subject-dependent and subject-independent strategies were used. Our experiment results showed that the highest average accuracies achieved in the subject-dependent and subject-independent cases were 85.81% and 78.52%, respectively; these accuracies were achieved using a combination of the sample entropy measure and 1D-CNN. Moreover, our study investigates the T8 position (above the right ear) in the temporal lobe as the most critical channel among the proposed measurement positions for emotion classification through electrode selection. Our results prove the feasibility and efficiency of our proposed EEG-based affective computing method for emotion recognition in real-world applications.


Sign in / Sign up

Export Citation Format

Share Document