tags: ['AI', 'ComputerScience', 'MasterThesis', 'Speech']
paper: [[@popovGradTTSDiffusionProbabilistic2021]]
domain: speech synthesis
type: seq2seq diffusion
input: text sequence (characters/phonemes)
output: mel-spectogram
Grad TTS
Grad-TTS is a Text-to-Speech Synthesis (TTS) Diffusion Model. It uses an approach based on Score-Based Diffusion Model. It converts a text-sequence of characters or phonemes into a sequence of mel coefficients.
Model Architecture¶

Inference¶
The model encodes the sequence of text inputs \(x_{1:L}\) into a latent representation \(\tilde{\mu}_{1:L}\). A duration predictor finds produces an alignment mapping \(A\) from the input sequence \([1:L]\) to the acoustic sequence \([1:F]\). That means, an input feature can be represented by one or more acoustic frames. The duration adapted latent features are now passed through the diffusion model that produces the Mel-frequency Cepstrum Coefficients (MFCC) for each acoustic frame.
The reverse diffusion evolves according to Differential Equation#Ordinary Differential Equation (ODE):
where \(s_\theta(X_t,\mu,t)\) is a U-Net (see Diffusion Model#Model for Reverse Process).
Training¶
For training, each part of the model has its own Loss:
Input Encoder¶
For the encoder, \(\tilde{\mu}\) and \(\mu\) are assumed to be normally distributed. So the loss is given by the negative log-likelihood (\(\varphi\) is the probability density function).
Duration Predictor¶
The duration predictor is a neural network, that is trained with using MSE. The \(sg[\cdot]\) denotes stopping the gradients on the inputs \(\tilde{\mu}\) to avoid effecting the input encoder. The indicator function \(\mathbb{I}\) is 1 for every number in \(F\), that the optimal alignment maps to the number \(i\). The duration predictor is trained to minimized the MSE between itself and the optimal alignment.
Diffusion Model¶
The loss for the diffusion model is given as the expectation of the weighted losses of estimated gradients for the noisy sample data at different time steps \(t\in{[0,T]}\).
The \(\lambda_t\) is a weighting factor given from the noise schedule, \(X_t\) is the forward diffusion process.
For training, first the parameters for encoder, duration predictor and decoder are fixed, and the alignment loss is minimized using [[Monotonic Alignment Search (MAS)]]. Then the alignment is fixed and the losses for input encoder, duration predictor and diffusion model are minimized. Those two steps repeat till convergence.
Results¶
The authors trained the model on the LJSpeech dataset. The input text was phonemized (Phonetic Transcription), the output 80-dimensional Mel-frequency Cepstrum Coefficients (MFCC). Because the the number of steps \(N\) to solve the Differential Equation#Ordinary Differential Equation (ODE) at inference time.
The model was evaluated using crowd-sourcing via [[Amazon Mechanical Turk]]. While 10 steps already produce results comparable to [[FastSpeech]] and [[Tacotron2]], 1000 steps produce almost natural speech. While not comparing to those in the Real-Time Factor (RTF), the number of seconds it takes to generate a second of audio, it can produce high quality speech with less parameters than the other models in real time (Grad-TTS-100).

Discussion¶
The authors also try an End2End model, which did not perform well enough for further evaluation. Yet they see a promising future for End2End diffusion models.
Also they mention that that the forward and reverse process can be augmented by [[Lipschitz Constraints]] as done in Master Wiki/Models/General Models/Generative Adversarial Network (GAN)#Wasserstein GANs.
Lastly they mention possible improvements in the noise-schedule, that go beyond the linear schedules mostly used right now.