HIERARCHICAL TRANSFER LEARNING FOR MULTILINGUAL, MULTI-SPEAKER, AND STYLE TRANSFER DNN-BASED TTS ON LOW-RESOURCE LANGUAGES

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Blog Article

This work applies vince camuto fiori gift set a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages.DNN-based system typically requires a large amount of training data.In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages.

In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages.We make use of a high-resource language and a joint multilingual dataset of low-resource languages.A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture.

Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS.Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech.The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.

35 for Indonesian (ground truth = 4.36).Whereas for Javanese and Sundanese it reaches a MOS of 4.

20 (ground truth = 4.38) and 4.28 (ground truth = 4.

20), respectively.For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.

13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively.The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain.

With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human groovy mama ring voice, and successfully transfer speaking style from a reference audio.

Report this page