This model is designed for Emotional Speech Synthesis (ESS) with mixed base-emotions. The model uses high-dimensional embeddings of emotions, extracted by a Speech Emotion Recognition (SER) model, instead of a simple one-hot label encoding. It is based on Grad-TTS.

Model Architecture¶

Pasted image 20240913104322.png

EmoMix models emotions as a complex embedding instead of simple labels. To do so, a reference audio is used to extract an emotion embedding \(e\) using Speech Emotion Recognition (SER). Different to EmoDiff, the duration prediction also takes the emotion embedding into account.

The denoising model \(\epsilon_{\theta}\) takes as input…

the emotion embedding \(e\)
the speaker embedding \(s\)
the content embedding \(\mu\)
the current timestep \(t\)
and the current latent representation \(x_{t}\) and denoises step-wise.

The mixing of emotions is done by changing embedding for the classifier guidance (see EmoDiff#Soft-Label Guidance) at some point during the reverse diffusion process.

For the steps…

\(T\) to \(K_{max}\), the embedding of the primary emotion \(e_{1}\) is conditioned on
\(K_{max}\) to \(K_{min}\), the weighted average of all emotions is conditioned on
\(K_{min}\) to 0, the embedding of the mixed-in emotion \(e_{2}\) is conditioned on The intermediate conditioning on the weighted average should prevent \(e_{2}\) from completely overwriting the conditioning of \(e_{1}\). The authors specifically use weights that mix the neutral emotion with the primary emotion \(e_{1}\) to control the intensity of that primary emotion.

Loss¶

The loss is consists of four components:

\[ \mathcal{L}=\mathcal{L}_{dur}+\mathcal{L}_{diff}+\mathcal{L}_{prior}+\gamma\mathcal{L}_{style} \]

The contribution of the style loss is controlled by hyperparameter \(\gamma\).

Duration Loss \(\mathcal{L}_{dur}\)¶

This is the same as in Grad-TTS#Duration Predictor, but also taking the emotion embedding \(e\) into account.

unclear How does the model take multiple emotions into account?¶

Diffusion Loss \(\mathcal{L}_{diff}\)¶

The loss of the score-based reverse diffusion (Score-Based Diffusion Model). The Diffusion Model#Model for Reverse Process takes into account the embeddings for content, speaker and emotion.

Prior Loss \(\mathcal{L}_{prior}\)¶

The prior loss is adopted from Grad-TTS.

Style Loss \(\mathcal{L}_{style}\)¶

The style loss compares the emotional style of the input \(m\) and the synthesized output \(\hat{m}\) using the distance between the [[Gram Matrix]] of each layer withing the CNN of the SER model.

\[ \mathcal{L}_{style}=\sum\limits_{j}||G_{j}(\hat{m})-G_{j}(m)||^{2}_{F} \]

The distance norm used is the Frobenius norm (root of the trace of the squared matrix). The style loss