tags: ['AI', 'ComputerScience', 'Emotion', 'Speech', 'MasterThesis']

paper: [[@guoEmoDiffIntensityControllable2023]]

domain: emotional speech synthesis

type: seq2seq diffusion

input: text sequence (characters/phonemes)

output: mel-spectogram

EmoDiff

EmoDiff is a Text-to-Speech Synthesis (TTS) Diffusion Model that is able to produce speech with controllable emotional intensity. It experiments with different models for relative attributes, like Relative Attribute Ranks (RAR) and emotional embedding vectors. The model is based on Grad-TTS.

Model Architecture¶

Pasted image 20240527155340.png

The basic architecture of the EmoDiff is the same as in Grad-TTS (a). The model is conditioned via Diffusion Model#Classifier Guidance using a trained classifier (b). During inference, the gradients of the classifier is added to the trajectory of \(x_t\) (c).

Soft-Label Guidance¶

The authors want not only the emotion itself, but also its intensity, to be controllable. Therefore, they extend the concept of classifier guidance by soft-labels. They do so by defining the classes as a mix of neutral (\(e_0\)) and emotion (\(e_i\)): \(d=\alpha e_{i} + (1-\alpha) e_0\). The \(\alpha\) parameter controls the intensity of emotion \(e_i\), if it is low, the trajectory is pulled toward neutral speech and if its high, toward highly emotional speech. Note, that it is important for there to be a neutral class \(e_0\), as otherwise the trajectory would randomly move in the emotional space.

The gradient of the combined emotional mix follows straight-forward from a weighted sum of the classifier gradients for each emotion:

\[ \nabla_x \log p(d \mid x) = \alpha \nabla_x \log p(e_i \mid x) + (1 - \alpha) \nabla_x \log p(e_0 \mid x) \]

and can be added to the score-function to guide the trajectory during denoising:

\[ \nabla_x \log p(x \mid d) = \alpha \nabla_x \log p(e_i \mid x) + (1 - \alpha) \nabla_x \log p(e_0 \mid x) + \nabla_x \log p(x) \]

Mixed Labels¶

The authors extend the soft-label guidance to work with a mix of emotions:

\[ \nabla_x \log p(d \mid x) \approx \sum_{i=0}^{m-1} w_i \nabla_x \log p(e_i \mid x) \]

where the weights sum to one (\(\sum w_i=1\)). Each \(\nabla_x \log p(e_i \mid x)\) is the gradient toward one emotion, the whole thing thus is the gradient toward a weighted mix of all emotions.

As emotion is now a categorical distribution \(p_e\) across all emotions, it can be derived that the soft-label guidance actually decreases the cross-entropy (CE) between the classifier output \(p(\cdot|x)\) and the target emotion distribution \(p_e\):

\[ \begin{align} \nabla_x \log p(d \mid x) &\approx \mathbb{E}_{e \sim p_e} \nabla_x \log p(e \mid x)\\ & \approx -\nabla_x \text{CE} \left[ p_e(\cdot), p(\cdot \mid x) \right] \end{align} \]

The soft-label guidance can thus be used to model an arbitrary complex mix of basic emotions.

Results¶

Using the Datasets for Emotional Speech#Emotional Speech Database (ESD), the authors compare their model to Grad-TTS with emotion labels, the autoregressive [[MixedEmotion]] model and the ground truth records. The Vocoding was performed with HiFiGAN.

On the [[Mean Opinion Score]], EmoDiff outperforms the MixedEmotion model by far. The authors use a separately trained emotion classifier to classify samples generated at different intensities of emotion. EmoDiff shows an almost linear relation between intensity and classification probability, consistently outperforming the MixedEmotion model. At full intensity (\(\alpha=1\),) they compared the model to the ground truth and GradTTS, finding that both GradTTS and EmoDiff are very close to the ground truth in terms of classification probability.