scholarly journals Automatic detection of prosodic boundaries in spontaneous speech

PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0250969
Author(s):  
Tirza Biron ◽  
Daniel Baum ◽  
Dominik Freche ◽  
Nadav Matalon ◽  
Netanel Ehrmann ◽  
...  

Automatic speech recognition (ASR) and natural language processing (NLP) are expected to benefit from an effective, simple, and reliable method to automatically parse conversational speech. The ability to parse conversational speech depends crucially on the ability to identify boundaries between prosodic phrases. This is done naturally by the human ear, yet has proved surprisingly difficult to achieve reliably and simply in an automatic manner. Efforts to date have focused on detecting phrase boundaries using a variety of linguistic and acoustic cues. We propose a method which does not require model training and utilizes two prosodic cues that are based on ASR output. Boundaries are identified using discontinuities in speech rate (pre-boundary lengthening and phrase-initial acceleration) and silent pauses. The resulting phrases preserve syntactic validity, exhibit pitch reset, and compare well with manual tagging of prosodic boundaries. Collectively, our findings support the notion of prosodic phrases that represent coherent patterns across textual and acoustic parameters.

2018 ◽  
Vol 26 (4) ◽  
pp. 1455 ◽  
Author(s):  
Bárbara Helohá Falcão Teixeira ◽  
Maryualê Malvessi Mittmann

Abstract: This work presents the results of the analysis of multiple acoustic parameters for the construction of a model for the automatic segmentation of speech in tone units. Based on literature review, we defined sets of acoustic parameters related to the signalization of terminal and non-terminal boundaries. For each parameter, we extracted a series of measurements: 6 for speech rate and rhythm; 34 for duration; 65 for fundamental frequency; 4 for intensity and 2 measurements related to pause. These parameters were extracted from spontaneous speech fragments that were previously segmented into tone units, manually performed by 14 human annotators. We used two methods of statistical classification, Random Forest (RF) and Linear Discriminant Analysis (LDA), to generate models for the identification of prosodic boundaries. After several phases of training and testing, both methods were relatively successful in identifying terminal and non-terminal boundaries. The LDA method presented a higher accuracy in the prediction of terminal and non-terminal boundaries than the RF method, therefore the model obtained with LDA was further refined. As a result, the terminal boundary model is based on 20 acoustic measurements and shows a convergence of 80% in relation to boundaries identified by annotators in the speech sample. For non-terminal boundaries, we arrived at three models that, combined, presented a convergence of 98% in relation to the boundaries identified by annotators in the sample.Keywords: speech segmentation; prosodic boundaries; spontaneous speech.Resumo: Este trabalho apresenta os resultados da análise de múltiplos parâmetros acústicos para a construção de um modelo para a segmentação automática da fala em unidades tonais. A partir da investigação da literatura, definimos conjuntos de parâmetros acústicos relacionados à identificação de fronteiras terminais e não terminais. Para cada parâmetro, uma série de medidas foram extraídas: 6 medidas de taxa de elocução e ritmo; 34 de duração; 65 de frequência fundamental; 4 de intensidade e 2 medidas relativas às pausas. Tais parâmetros foram extraídos de fragmentos de fala espontânea previamente segmentada em unidades tonais de forma manual por 14 anotadores humanos. Utilizamos dois métodos de classificação estatística, Random Forest (RF) e Linear Discriminant Analysis (LDA), para gerar modelos de identificação de fronteiras prosódicas. Após diversas fases de treinamentos e testes, ambos os métodos apresentaram sucesso relativo na identificação de fronteiras terminais e não-terminais. O método LDA apresentou maior índice de acerto na previsão de fronteiras terminais e não-terminais do que o RF, portanto, o modelo obtido com este método foi refinado. Como resultado, O modelo para as fronteiras terminais baseia-se em 20 medidas acústicas e apresenta uma convergência de 80% em relação às fronteiras identificadas pelos anotadores na amostra de fala. Para as fronteiras não terminais, chegamos a três modelos que, combinados, apresentaram uma convergência de 98% em relação às fronteiras identificadas pelos anotadores na amostra.Palavras-chave: segmentação da fala; fronteiras prosódicas; fala espontânea.


2020 ◽  
Vol 9 ◽  
pp. 105-128
Author(s):  
Tommaso Raso ◽  
Bárbara Teixeira ◽  
Plínio Barbosa

Speech is segmented into intonational units marked by prosodic boundaries. This segmentation is claimed to have important consequences on syntax, information structure and cognition. This work aims both to investigate the phonetic-acoustic parameters that guide the production and perception of prosodic boundaries, and to develop models for automatic detection of prosodic boundaries in male monological spontaneous speech of Brazilian Portuguese. Two samples were segmented into intonational units by two groups of trained annotators. The boundaries perceived by the annotators were tagged as either terminal or non-terminal. A script was used to extract 111 phonetic-acoustic parameters along speech signal in a right and left windows around the boundary of each phonological word. The extracted parameters comprise measures of (1) Speech rate and rhythm; (2) Standardized segment duration; (3) Fundamental frequency; (4) Intensity; (5) Silent pause. The script considers as prosodic boundary positions at which at least 50% of the annotators indicated a boundary of the same type. A training of models composed by the parameters extracted by the script was developed; these models, were then improved heuristically. The models were developed from the two samples and from the whole data, both using non-balanced and balanced data. Linear Discriminant Analysis algorithm was adopted to produce the models. The models for terminal boundaries show a much higher performance than those for non-terminal ones. In this paper we: (i) show the methodological procedures; (ii) analyze the different models; (iii) discuss some strategies that could lead to an improvement of our results.


Phonology ◽  
2018 ◽  
Vol 35 (1) ◽  
pp. 79-114 ◽  
Author(s):  
Alessandro Vietti ◽  
Birgit Alber ◽  
Barbara Vogt

In the Southern Bavarian variety of Tyrolean, laryngeal contrasts undergo a typologically interesting process of neutralisation in word-initial position. We undertake an acoustic analysis of Tyrolean stops in word-initial, word-medial intersonorant and word-final contexts, as well as in obstruent clusters, investigating the role of the acoustic parameters VOT, prevoicing, closure duration and F0 and H1–H2* on following vowels in implementing contrast, if any. Results show that stops contrast word-medially via [voice] (supported by the acoustic cues of closure duration and F0), and are neutralised completely in word-final position and in obstruent clusters. Word-initially, neutralisation is subject to inter- and intraspeaker variability, and is sensitive to place of articulation. Aspiration plays no role in implementing laryngeal contrasts in Tyrolean.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e10736
Author(s):  
Kaja Wierucka ◽  
Michelle D. Henley ◽  
Hannah S. Mumby

The ability to recognize conspecifics plays a pivotal role in animal communication systems. It is especially important for establishing and maintaining associations among individuals of social, long-lived species, such as elephants. While research on female elephant sociality and communication is prevalent, until recently male elephants have been considered far less social than females. This resulted in a dearth of information about their communication and recognition abilities. With new knowledge about the intricacies of the male elephant social structure come questions regarding the communication basis that allows for social bonds to be established and maintained. By analyzing the acoustic parameters of social rumbles recorded over 1.5 years from wild, mature, male African savanna elephants (Loxodonta africana) we expand current knowledge about the information encoded within these vocalizations and their potential to facilitate individual recognition. We showed that social rumbles are individually distinct and stable over time and therefore provide an acoustic basis for individual recognition. Furthermore, our results revealed that different frequency parameters contribute to individual differences of these vocalizations.


2020 ◽  
Vol 10 (14) ◽  
pp. 4711 ◽  
Author(s):  
Zongmin Li ◽  
Qi Zhang ◽  
Yuhong Wang ◽  
Shihang Wang

One prominent dark side of online information behavior is the spreading of rumors. The feature analysis and crowd identification of social media rumor refuters based on machine learning methods can shed light on the rumor refutation process. This paper analyzed the association between user features and rumor refuting behavior in five main rumor categories: economics, society, disaster, politics, and military. Natural language processing (NLP) techniques are applied to quantify the user’s sentiment tendency and recent interests. Then, those results were combined with other personalized features to train an XGBoost classification model, and potential refuters can be identified. Information from 58,807 Sina Weibo users (including their 646,877 microblogs) for the five anti-rumor microblog categories was collected for model training and feature analysis. The results revealed that there were significant differences between rumor stiflers and refuters, as well as between refuters for different categories. Refuters tended to be more active on social media and a large proportion of them gathered in more developed regions. Tweeting history was a vital reference as well, and refuters showed higher interest in topics related with the rumor refuting message. Meanwhile, features such as gender, age, user labels and sentiment tendency also varied between refuters considering categories.


1996 ◽  
Vol 39 (2) ◽  
pp. 278-297 ◽  
Author(s):  
Susan Nittrouer

Studies of children’s speech perception have shown that young children process speech signals differently than adults. Specifically, the relative contributions made by various acoustic parameters to some linguistic decisions seem to differ for children and adults. Such findings have led to the hypothesis that there is a developmental shift in the perceptual weighting of acoustic parameters that results from experience with a native language (i.e., the Developmental Weighting Shift). This developmental shift eventually leads the child to adopt the optimal perceptual weighting strategy for the native language being learned (i.e., one that allows the listener to make accurate decisions about the phonemic structure of his or her native language). Although this proposal has intuitive appeal, there is at least one serious challenge that can be leveled against it: Perhaps age-related differences inspeech perception can more appropriately be explained by age-related differences in basic auditory-processing abilities. That is, perhaps children are not as sensitive as adults to subtle differences in acoustic structure and so make linguistic decisions based on the acoustic information that is most perceptually salient. The present study tested this hypothesis for the acoustic cues relevant to fricative identity in fricative-vowel syllables. Results indicated that 3-year-olds were not as sensitive to changes in these acoustic cues as adults are, but that these age-related differences in auditory sensitivity could not entirely account for age-related differences in perceptual weighting strategies.


2002 ◽  
Vol 11 (3) ◽  
pp. 236-242 ◽  
Author(s):  
Barbara W. Hodson ◽  
Julie A. Scherz ◽  
Kathy H. Strattman

Procedures to examine the communication abilities of a highly unintelligible 4-year-old during a 90-minute evaluation session are explained in this article. Phonology, metaphonology, speech rate, stimulability, and receptive language are evaluated formally and informally. A conversational speech sample is used to provide information for assessing intelligibility/understandability, fluency, voice quality, prosody, and mean length of response. Methods for determining treatment goals are discussed in the final section.


Sign in / Sign up

Export Citation Format

Share Document