Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Xiao Zhou; Zhenhua Ling; Yajun Hu; Lirong Dai

doi:10.3390/app112110475

Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Applied Sciences ◽

10.3390/app112110475 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10475

Author(s):

Xiao Zhou ◽

Zhenhua Ling ◽

Yajun Hu ◽

Lirong Dai

Keyword(s):

Speech Synthesis ◽

Attention Mechanism ◽

Experimental Results ◽

Acoustic Modeling ◽

Synthetic Speech ◽

Acoustic Feature ◽

Popular Method ◽

Hidden States ◽

G2p Conversion

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.

Download Full-text

Neural Speech Synthesis with Transformer Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016706 ◽

2019 ◽

Vol 33 ◽

pp. 6706-6713 ◽

Cited By ~ 20

Author(s):

Naihan Li ◽

Shujie Liu ◽

Yanqing Liu ◽

Sheng Zhao ◽

Ming Liu

Keyword(s):

Speech Synthesis ◽

Attention Mechanism ◽

Neural Machine Translation ◽

Proposed Model ◽

Speed Up ◽

Low Efficiency ◽

Human Quality ◽

And Performance ◽

Hidden States ◽

Training Efficiency

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

Download Full-text

A Global-Local Blur Disentangling Network for Dynamic Scene Deblurring

Applied Sciences ◽

10.3390/app11052174 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2174

Author(s):

Xiaoguang Li ◽

Feifan Yang ◽

Jianglu Huang ◽

Li Zhuo

Keyword(s):

Local Features ◽

Attention Mechanism ◽

Experimental Results ◽

Dynamic Scene ◽

Feature Maps ◽

Training Scheme ◽

Real Scene ◽

Global And Local

Images captured in a real scene usually suffer from complex non-uniform degradation, which includes both global and local blurs. It is difficult to handle the complex blur variances by a unified processing model. We propose a global-local blur disentangling network, which can effectively extract global and local blur features via two branches. A phased training scheme is designed to disentangle the global and local blur features, that is the branches are trained with task-specific datasets, respectively. A branch attention mechanism is introduced to dynamically fuse global and local features. Complex blurry images are used to train the attention module and the reconstruction module. The visualized feature maps of different branches indicated that our dual-branch network can decouple the global and local blur features efficiently. Experimental results show that the proposed dual-branch blur disentangling network can improve both the subjective and objective deblurring effects for real captured images.

Download Full-text

Differentiated Attentive Representation Learning for Sentence Classification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/644 ◽

2018 ◽

Cited By ~ 5

Author(s):

Qianrong Zhou ◽

Xiaojie Wang ◽

Xuan Dong

Keyword(s):

Representation Learning ◽

Learning Model ◽

Attention Mechanism ◽

Experimental Results ◽

Sentence Classification ◽

Synthetic Datasets

Attention-based models have shown to be effective in learning representations for sentence classification. They are typically equipped with multi-hop attention mechanism. However, existing multi-hop models still suffer from the problem of paying much attention to the most frequently noticed words, which might not be important to classify the current sentence. And there is a lack of explicitly effective way that helps the attention to be shifted out of a wrong part in the sentence. In this paper, we alleviate this problem by proposing a differentiated attentive learning model. It is composed of two branches of attention subnets and an example discriminator. An explicit signal with the loss information of the first attention subnet is passed on to the second one to drive them to learn different attentive preference. The example discriminator then selects the suitable attention subnet for sentence classification. Experimental results on real and synthetic datasets demonstrate the effectiveness of our model.

Download Full-text

Person Reidentification Model Based on Multiattention Modules and Multiscale Residuals

Complexity ◽

10.1155/2021/6673461 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yongyi Li ◽

Shiqi Wang ◽

Shuang Dong ◽

Xueling Lv ◽

Changzhi Lv ◽

...

Keyword(s):

Local Features ◽

Attention Mechanism ◽

Experimental Results ◽

Original Network ◽

Fine Grained ◽

Backbone Network ◽

Model Based ◽

Local Branch ◽

Feature Expression ◽

Global And Local

At present, person reidentification based on attention mechanism has attracted many scholars’ interests. Although attention module can improve the representation ability and reidentification accuracy of Re-ID model to a certain extent, it depends on the coupling of attention module and original network. In this paper, a person reidentification model that combines multiple attentions and multiscale residuals is proposed. The model introduces combined attention fusion module and multiscale residual fusion module in the backbone network ResNet 50 to enhance the feature flow between residual blocks and better fuse multiscale features. Furthermore, a global branch and a local branch are designed and applied to enhance the channel aggregation and position perception ability of the network by utilizing the dual ensemble attention module, as along as the fine-grained feature expression is obtained by using multiproportion block and reorganization. Thus, the global and local features are enhanced. The experimental results on Market-1501 dataset and DukeMTMC-reID dataset show that the indexes of the presented model, especially Rank-1 accuracy, reach 96.20% and 89.59%, respectively, which can be considered as a progress in Re-ID.

Download Full-text

A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8461452 ◽

2018 ◽

Cited By ~ 19

Author(s):

Xin Wang ◽

Jaime Lorenzo-Trueba ◽

Shinji Takaki ◽

Lauri Juvela ◽

Junichi Yamagishi

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Acoustic Modeling ◽

Modeling Methods ◽

Waveform Generation

Download Full-text

Expanding a Large Inclusive Study of Human Listening Rates

ACM Transactions on Accessible Computing ◽

10.1145/3461700 ◽

2021 ◽

Vol 14 (3) ◽

pp. 1-26

Author(s):

Danielle Bragg ◽

Katharina Reinecke ◽

Richard E. Ladner

Keyword(s):

Speech Synthesis ◽

Large Scale ◽

Speech Rate ◽

Synthetic Speech ◽

Conversational Agents ◽

Screen Reader ◽

Audio Cues ◽

Agent Interaction ◽

Personal Devices ◽

Answering Questions

As conversational agents and digital assistants become increasingly pervasive, understanding their synthetic speech becomes increasingly important. Simultaneously, speech synthesis is becoming more sophisticated and manipulable, providing the opportunity to optimize speech rate to save users time. However, little is known about people’s abilities to understand fast speech. In this work, we provide an extension of the first large-scale study on human listening rates, enlarging the prior study run with 453 participants to 1,409 participants and adding new analyses on this larger group. Run on LabintheWild, it used volunteer participants, was screen reader accessible, and measured listening rate by accuracy at answering questions spoken by a screen reader at various rates. Our results show that people who are visually impaired, who often rely on audio cues and access text aurally, generally have higher listening rates than sighted people. The findings also suggest a need to expand the range of rates available on personal devices. These results demonstrate the potential for users to learn to listen to faster rates, expanding the possibilities for human-conversational agent interaction.

Download Full-text

FastNMF

Advances in Face Image Analysis ◽

10.4018/978-1-61520-991-0.ch008 ◽

2010 ◽

pp. 137-163

Author(s):

Le Li ◽

Yu-Jin Zhang ◽

Yu-Jin Zhang

Keyword(s):

Feature Extraction ◽

Fixed Point ◽

Dimensionality Reduction ◽

Matrix Factorization ◽

Ease Of Use ◽

Experimental Results ◽

Fixed Point Algorithm ◽

Popular Method ◽

Face Images ◽

Multiplicative Update

Non-negative matrix factorization (NMF) is a more and more popular method for non-negative dimensionality reduction and feature extraction of non-negative data, especially face images. Currently no NMF algorithm holds not only satisfactory efficiency for dimensionality reduction and feature extraction of face images but also high ease of use. To improve the applicability of NMF, this chapter proposes a new monotonic, fixed-point algorithm called FastNMF by implementing least squares error-based non-negative factorization essentially according to the basic properties of parabola functions. The minimization problem corresponding to an operation in FastNMF can be analytically solved just by this operation, which is far beyond existing NMF algorithms’ power, and therefore FastNMF holds much higher efficiency, which is validated by a set of experimental results. For the simplicity of design philosophy, FastNMF is still one of NMF algorithms that are the easiest to use and the most comprehensible. Besides, theoretical analysis and experimental results also show that FastNMF tends to extract facial features with better representation ability than popular multiplicative update-based algorithms.

Download Full-text

Speech Synthesis of Emotions Using Vowel Features

International Journal of Software Innovation ◽

10.4018/ijsi.2013010105 ◽

2013 ◽

Vol 1 (1) ◽

pp. 54-67

Author(s):

Kanu Boku ◽

Taro Asada ◽

Yasunari Yoshitomi ◽

Masayoshi Tabuse

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Male Subject ◽

Maximum Amplitude ◽

Synthetic Speech ◽

Emotional Speech ◽

Prosodic Features ◽

Initial Investigation ◽

Synthesis Research ◽

Case Based

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”

Download Full-text

A Text Normalization Method for Speech Synthesis Based on Local Attention Mechanism

IEEE Access ◽

10.1109/access.2020.2974674 ◽

2020 ◽

Vol 8 ◽

pp. 36202-36209

Author(s):

Lan Huang ◽

Shunan Zhuang ◽

Kangping Wang

Keyword(s):

Speech Synthesis ◽

Attention Mechanism ◽

Normalization Method ◽

Text Normalization

Download Full-text

Text-to-Speech Synthesis

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch135 ◽

2011 ◽

pp. 957-963

Author(s):

Mahbubur R. Syed ◽

Shuvro Chakrobartty ◽

Robert J. Bignall

Keyword(s):

Speech Production ◽

Speech Synthesis ◽

Synthetic Speech ◽

Practical Application ◽

Text To Speech ◽

Synthesis System ◽

System A ◽

Vocal System ◽

Text To Speech Synthesis ◽

Computer Based

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.

Download Full-text