Duration Prediction

Duration prediction is used to make synthesized speech sound more natural (Text-to-Speech Synthesis (TTS)). It solves a problem that arises when trying to map linguistic features (e.g. phonemes) directly onto acoustic features (e.g. Mel-frequency Cepstrum Coefficients (MFCC)). Real speech doesn't have a one-to-one correspondence of a frame of linguistic features to a single frame of acoustic features, rather a single phoneme might take multiple frames of acoustic features (due to emphasizes, dialects, etc, see Emotion#Prosody).

Autoregressive Vs Parallel Models¶

In autoregressive models, we can solve this problem in a Markovian way: The model learns the transition probability of going from one phoneme to the next given the acoustic features. Then, during inference, phonemes that are emphasized have a higher transition probability to themselves and lower transition probability to the next phoneme. This way, such a model can map a single phoneme to one or more acoustic features, increasing the duration if needed.

Parallel models, such as Generative Adversarial Network (GAN) and Diffusion Models produce all the acoustic features in parallel fashion, so the autoregressive solution doesn't work here. The correct durations of the output features must already be known at the start of the generative process.

Drawing 2024-12-12 13.37.48.excalidraw#^group=ik2w5UxWYsJhi-19BliuG

Any duration perturbation of the linguistic feature embeddings is constraint to be monotonous. This way, words shouldn't be repeated or skipped accidentally.

Alignment Strategy¶

Given an array of phonemes and an array of frames of an utterance, we want to find out, which frames represent which phoneme. Often, a phoneme is represented by multiple consecutive frames. An alignment strategy finds this mapping and gives a duration vector, which describes how many consecutive frames belong to each phoneme.

Attention-based alignment uses the cross-attention layer of a transformer to figure out, which phonemes are most relevant for each frame. Monotonic alignment considers each frame to be drawn from a phoneme-dependent Gaussian and finds the sequence of phoneme which makes the observed frames most likely.

Attention-based Alignment¶

Based on @renFastSpeechFastRobust2019.

Pasted image 20241212135608.png

The first parallel model that employed duration prediction was [[FastSpeech]], a Text-to-Speech Synthesis (TTS) [[Transfomer Model]]. As parallel models have a fixed input-to-output length ratio, a Length Regulator is used to expand the phoneme representation \(\mathcal{H}_{\text{pho}}\) using the duration vector \(\mathcal{D}\) so that the actual model input \(\mathcal{H}_{\text{mel}}\) has the correct duration per phoneme. The Transformer can now use this to generate the actual output mel coefficients \(Y\) from the duration-adjusted input.

Length Regulation

The length regulation is a simple process, but requires the prediction of the duration vector \(D\). In case of the FastSpeech model, this is done using a separate duration predictor model. This model is a simple convolutional network that learns to map phoneme embeddings to the appropriate duration vector.

Duration Predictor

The interesting part is how to extract the ground-truth duration vector from a given phoneme-spectrogram pair in the first place. This is done by using an encoder-attention-decoder based Transformer TTS model. The model has multiple attention heads, the activations of which show how much attention is placed on a given input phoneme to predict the next mel-coefficients as output.

If the input frames \(x_{i}\) and output frames \(y_{j}\) were perfectly aligned, then the attention matrix would be a diagonal matrix, as the attention \(a_{i,j}\) would be max for \(i=j\). But in most cases, the frames in the output will lag behind the frames of the input, because some phonemes will produce multiple acoustic features. Therefore, the attention matrices of the attention heads will have some concave curvature. The attention head, where the attention matrix is closest to diagonal is chosen, measured using the focus rate

\[ F=\frac{1}{S}\sum\limits_{s=1}^{S}\max_{1\le t \le T}a_{s,t} \]

The focus rate measures how distributed the input-attention is for a given output frame. A high focus rate indicates that the attention head models the alignment between phonemes and acoustic features well. A lower focus rate would indicate more abstract patterns, not directly related to the frame alignment.

Finally, the duration vector is computed as

\[ d_{i}=\sum\limits_{s=1}^{S}[\arg\max_{t}a_{s,t}=i] \]

So the duration of input frame \(i\) is the number of frames that attention stays on that frame when predicting the output frames.

During training, the alignment extracted by the Transformer TTS model is used directly. The simple predictor model is trained simultaneously and employed during inference, when there is no target mel-spectrogram.

Monotonic Alignment¶

Based on @kimGlowTTSGenerativeFlow2020.

Pasted image 20241212201846.png

An alternative approach for duration prediction is Monotonic Alignment Search (MAS). The general procedure is very similar to #Attention-based Alignment, but instead of extracting the ground-truth durations from a teacher model, they are estimated using MAS. Otherwise the procedure stays the same: Phonemes are embedded (as prior statistics \(\mu\) and \(\sigma\)) and then expanded by the factors in the duration vector. When training the encoder-decoder and duration predictor, this vector is computed using MAS and during inference, its predicted by duration predictor directly.

Because the encoder-decoder network depends on the output of the MAS and the output of the MAS depends the encoding by the encoder-decorder, the model is trained in a Viterbi style: The best alignment \(A^{*}\) for the given parameters \(\theta\) is computed, then the parameters \(\theta\) are updated for the given alignment \(A^{*}\). This procedure maximizes the log-likelihood of the most likely hidden alignment.

Monotonic Alignment Search

Pasted image 20241212202421.png

MSA is a dynamic programming approach to finding the most likely alignment between the latent variables \(z\) and the prior-statistics \(\mu\) and \(\sigma\). The priors are estimated by a variational encoder, the latents are assumed to be sampled from the priors. In the TTS scenario, we have \(\mu_{i},\sigma_{i};\space i\in[1,T_{text}]\) and \(z_{j}; \space j\in[1,T_{mel}]\), where \(T_{mel}\ge T_{text}\).

Now, for each latent variable \(z_{j}\), we want to compute the log-likelihood \(Q_{i,j}\) of it having been produced by the current or any of the previous statistics \(\mu_{1\le i \le j}, \sigma_{1 \le i \le j}\). For example, the likelihood the first three latent variables \(z_{k}\) having been produced only by the first frame of the statistics is just the sum of log-likelihoods of each latent variable given the priors \(\mu_{1}\) and \(\sigma_{1}\):

\[ Q_{1,3}=\log\mathcal{N}(z_{1};\mu_{1},\sigma_{1})+\log\mathcal{N}(z_{2};\mu_{1},\sigma_{1})+\log\mathcal{N}(z_{3};\mu_{1},\sigma_{1}) \]

Consequently, we can compute the all the log-likelihoods that the first \(j\) latents have been produced by the first linguistic frame (first row of alignment matrix) as:

\[ Q_{1,j}=\sum\limits_{k=1}^{j}\log\mathcal{N}(z_{k};\mu_{1},\sigma_{1}) \]

Because we assume the alignment to be monotonic, we can use the log-likelihoods of the previous frame of latents to compute the log-likelihood of the current output frame given the prior statistics:

\[ Q_{i,j}=\max(Q_{i-1,j-1},Q_{i,j-1})+\log\mathcal{N}(z_{j};\mu_{i},\sigma_{i}) \]

This way, we can compute the "upper right triangle" of the alignment matrix and then backtrack the most probable alignment by starting at the end of the latent variables \(T_{mel}\) and choosing the most likely input-frame for each output-frame (either staying at the same input frame or moving one back). This operation is formalized as:

\[ A^{*}(j)=\arg\max_\nolimits{i\in\{A^{*}(j+1)-1,A^{*}(j+1)\}}Q_{i,j} \]

Voice-Conversion¶

The previous two methods both were Text-to-Speech Synthesis (TTS) models, meaning they relied on aligning frames of linguistic features with frames of acoustic features. Typically, in Voice Conversion (VC) scenarios, duration prediction is not used because the input and output features are assumed to be aligned. In Emotional Voice Conversion (EVC) however, the frames of input and output acoustic features might be misaligned (different emphasize for different emotions). This leads to an unnatural flow of speech, where parts of the utterance seem rushed when arousal is increased.

Speech Unit Alignment¶

Based on @ohDurFlexEVCDurationFlexibleEmotional2024

Another approach to duration regulation in VC is to discretize the input waveform into abstract speech units. Similar to FastSpeech (#Attention-based Alignment), cross-attention is used to extract the duration of each speech unit. However, in FastSpeech, the cross-attention is computed between the lingusitc embedding and the frames of the mel-spectrogram.

Unit Level Pooling

In this case, there is no embedding of lingusitc features. Instead, a learnable embedding vector \(e_{unit}\) is created, which servers as key and value of the Transformer architecture, and the acoustic input features serve as keys. Each entry in the embedding vector serves as an abstract class, to which an acoustic frame might belong to. The model thus learns to map acoustic frames onto those classes (or speech units) and, using cross-attention, can extract the duration vector for the units. This is done by taking the embedding unit for which attention is maximal per acoustic frame and extracting the unique consecutive units and their consecutive counts as duration vector (deduplication). The \(e_{unit}\) embedding is trained to represent a [[HuBERT]] embedding. The unit level pooling acts as a kind of information bottleneck, removing redundant information introduced by long duration phonemes.

Frame Level Scaling

During training, the units can be expanded again using the duration vector of the target mel-spectrogram. A stochastic duration predictor is trained in parallel is used during inference, to predict the unknown duration-vector of the target style.

Information Bottleneck¶

Based on @leeDurationControllableVoice2022

One approach to duration prediction in VC is to discretize the input waveform into phonemes, using a phoneme predictor. The phoneme predictor is trained from ground-truth phoneme labels as targets (text converted to phonemes using G2P) and mel-spectrograms generated using [[Tacotron2]] as the input. The actual model then uses the phoneme predictor to generate phoneme labels for each frame of the input mel-spectrogram and uses a similar deduplication process as in #Speech Unit Alignment to downsample the input to phoneme-level.

A duration predictor is trained from durations extracted as in #Attention-based Alignment and conditioned on the speaker embedding.