Skip to content

Text to Speech Synthesis (TTS)

As the name suggests, TTS allows us to convert textual information into speech. It can be seen as a sub-component of Emotional Speech Synthesis (ESS) and Affective Speech Synthesis (ASS) which extend TTS by an emotional- or the more general affect-dimension.

History

[[Articulatory Synthesis]]

Early models focused on simulating the very organs that generate speech. Examples for that are the Speaking Machine from Wolfgang von Kempelen and the Voder.

[[Formant Synthesis]]

The formant synthesis is based on the source-filter model. A fundamental frequency (source) is herein transformed using a rule-based system (filter).

[[Concatenative Speech Synthesis]]

In this model, language is split into building blocks, like words, syllables, half-syllables, phonemes, diphones or triphgones. These building-blocks could be pre-recorded and concatenated. The problem with this approach is the discontinuity: Natural speech has a continuous prosody and transitions, which could only be replicated with vast amounts of pre-recorded data. In the end, it didn't proof flexible enough to deal with real-world language utterances.

[[Statistical Parametric Speech Synthesis (SPSS)]]

In SSPS, speech is generated in three stages:

  1. Text Analysis: Analyzing the text (normalization, grapheme-to-phoneme, …) and extracting phonemes, durations or part-of-speech tags.
  2. Acoustic Model: Transforming those linguistic features to acoustic features (fundamental frequency, spectrum, cepstrum, …)
  3. Vocoder: Vocoding to transform acoustic features into an output waveform through models like [[WORLD]] and [[STRAIGHT]].

Deep Learning

From Affective Speech Synthesis (ASS)#^1

Pasted image 20240420193424.png

The models presented here can be characterized by certain peculiarities:

  • Autoregressive vs non-autoregressive
  • Network Structure (CNNs, RNNs, Attention-based, …)
  • Generative model (VAE, GAN, …)
  • Degree of E2E behavior

The first model WaveNet was based on CNNs that process the input features and autoregressive structures that produce the waveform. It combines the acoustic model and vocoding of Affective Speech Synthesis (ASS)#Statistical Parametric Speech Synthesis (SPSS). It was quickly iterated upon by extending to non-autoregressive synthesis and conditioning on spectrograms.

Another approach was the Tacotron model, which is based on an encoder-attention-decoder framework using RNNs. It combines the text analysis and acoustic model of SSPS. It was iterated upon by moving to a transformer-based approach, that encodes a text sequence and decodes it into acoustic features (FastSpeech). The parallelism of this model allows for very fast text to acoustic feature conversion. Its predecessor, FastSpeech2, predicts both pitch and energy of the target speech, which allows the injection of emotional information, useful for ESS.

Another approach is based on generative models: WaveGAN and MelGAN. The use noise and acoustic features as an input and transform them into Waveforms. Thus they are representing the Vocoding part of SSPS. They do not perform that well on TTS tasks, but could be interesting when it comes to freely speaking agents.