Direct Speech-to-speech Translation without Textual Annotations using Bottlneck Features
Junhui Zhang, Junjie Pan, Xiang Yin, Zejun Ma
Abstract: Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation. State-of-art models usually contains an auxiliary module for phoneme sequences prediction, and this requires textual annotations of the training dataset. We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Instead of introducing an auxiliary phoneme prediction task in the model, we propose to use bottleneck features as intermediate training objectives for our model to ensure the translation performance of the system. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach and the performance can match a cascaded system with respect of translation and synthesis qualities.
Speech-to-Speech Translation Results
Source speech utterances are from random speakers. Translated Speech (Baseline) represents the outputs from the cascaded pipeline. Translated Speech (Proposed) represents the outputs from the proposed S2ST model. Texts of source and proposed translated speech are transcribed by linguistic experts. Translated text of baseline are outputs of the NMT model.
Source Speech
Transcribed Text
Translated Speech (Baseline)
Translated Text
Translated Speech (Proposed)
Transcribed Text
我自己做就自己做咯,我做给你们看。
我自己做就自己啦。我做俾你哋睇。
我自己做我自己做啦,我做畀你哋睇。
这件事我很愿意的。
呢件事我好愿意嘅。
呢件事我好愿意嘅。
你记不记得啊,上一次那么小的事情,都有那么多人来报导啊。
记唔记得上一次咁细嘅事情,都有咁多人嚟报道?
你记不记得,上一次噉小嘅事情,都有咁多人嚟报道呀?
放心吧,我跟你一样只是担心他,我没心情怪他炒他鱿鱼。
放心!我同你一样,只系担心佢。我冇心情怪佢炒佢鱿鱼。
放心呀,我同你一样只系担心佢,我冇心情怪佢炒佢鱿鱼。
闫姐啊,你真的好有面子啊。
严姐,你真系系好有面。
燕姐,你真系好有面子。
一看他就是没晒太阳,推他出去晒晒太阳。
一睇佢就系冇晒太阳,推佢晒晒太阳。
一睇佢就系冇晒太阳,推佢出去晒晒太阳。