Google AI introduces Translatotron 2 for robust direct speech-to-speech translation




The field of Natural Language Processing (NLP) is experiencing remarkable growth in many areas including search engines, machine translation, chatbots, home assistants and many more. One of these applications of S2ST (voice translation) breaks language barriers globally by allowing speakers of different languages ​​to communicate. It is therefore extremely valuable for humanity in terms of science and intercultural exchange.

Automatic S2ST systems are generally made up of a series of subsystems for speech recognition, machine translation and speech synthesis. However, such cascading systems can experience longer latency, loss of information (especially paralinguistic and non-linguistic information), and aggravating errors between subsystems.

In 2019, Google AI introduced Translatotron, the first model that directly translates speech between two languages. This direct S2ST model could be trained end-to-end in a short time and had the unique ability to retain the source speaker’s voice (which is non-linguistic information) in the translated speech. Despite its ability to produce high-fidelity translated speech that sounds realistic, it still underperforms compared to a solid basic cascading S2ST system.

The recent Google study presents the improved version of Translatotron, which significantly improves performance. Translatotron 2 uses a new method to transfer the voices of the source speakers to the translated speech. Even when the input speech involves many speakers taking turns speaking, the updated technique of voice transfer is effective while reducing the potential for abuse and better conforming to our AI principles. .

Translatotron 2 architecture

The main components of this new model are:

  • A voice encoder.
  • A target phoneme decoder.
  • A target speech synthesizer.
  • An attention module that connects them all.

The encoder, attention module and decoder work together to be comparable to a traditional direct speech-to-text (TS) model.


The main changes made in Translatotron 2 are listed below:

  1. The output of the target phoneme decoder is one of the inputs of the spectrogram synthesizer in Translatotron 2. So it is easy to train and works better due to its strong conditioning.
  2. The spectrogram synthesizer used in Translatotron 2 is time-based, which remarkably improves the robustness of the synthesized speech.
  3. The attention-based connection in Translatotron 2 is driven by the phoneme decoder instead of the spectrogram synthesizer. This aligns the acoustic information that the spectrogram synthesizer sees with the translated material it synthesizes, allowing each speaker’s voice to be preserved throughout speaker turns.

Strong voice retention

By conditioning its decoder on a speaker integration generated by a separately trained speaker encoder, the original Translatotron preserved the source speaker’s voice in the translated speech. However, if a clip of the target speaker’s recording was provided as reference audio to the speaker’s encoder, or if the target speaker’s integration was directly available, this approach allowed it to generate the speech translated into the speaker’s voice. ‘another speaker. This had the potential to be used to spoof audio with arbitrary content.

Keeping this in mind, Translatotron 2 is built with a single voice encoder that handles both language comprehension and voice capture. This restricts models trained to play non-source voices.

The researchers used a modified version of PnG NAT, a TTS model capable of transmitting voice in multiple languages. The modified PnG NAT model adds a separately learned speaker encoder, allowing zero-knock voice transfer.

Additionally, they offer ConcatAug, a simple concatenation-based data augmentation technique. This allows S2ST models to keep each speaker’s voice in the translated speech when the input speech contains many speakers taking turns. By randomly choosing pairs of training examples and concatenating source speech, target speech, and target phoneme sequences into new training examples, this method increases training data on the fly. The model can learn from examples with speaker turns since the samples contain the voices of two speakers in both the source speech and the destination speech.

Entry (Spanish):

Reference synthesized by TTS (English):

Translatotron 2 prediction (without ConcatAug) (English):

Translatotron 2 prediction (with ConcatAug) (English):


Translatotron 2 consistently outperforms the original Translatotron in terms of translation quality, speech naturalness and speech resilience when tested on three different corpora. He excelled in Fisher’s difficult body of work in particular.


The researchers also evaluated the model’s performance on a multilingual setup, in which the model translated the speech of four distinct languages ​​into English. The language of the input voice is not provided; therefore, the model must have figured it out on its own. Translatotron 2 greatly surpasses the original Translatotron in this task. The results suggest that the translation quality of Translatotron 2 is comparable to a basic speech-to-text translation model. These results demonstrate that Translatotron 2 is very efficient on multilingual S2ST.





Leave A Reply

Your email address will not be published.