EmoConv Diff

Paper by Navin, based on

This model is a Emotional Voice Conversion (EVC). Very similar to In-the-wild SEC, its designed to be trained on non-parallel data (Parallel and Non-parallel Training Data), based on the Russell's Circumplex Model. It tries to improve on the more extreme values of arousal.

The model outperforms In-the-wild SEC for high arousal states, but not quiet as well on low arousal.

Idea¶

Pasted image 20240915153030.png

Disentanglement is achieved, like in In-the-wild SEC, using three encoders and a diffusion decoder.

Encoder¶

Phoneme Encoder Extracts the speaker- and emotion-independent mel-features (average voice) using a pretrained transformer, adopted from @wuUncoveringDisentanglementCapability2023.

Speaker Encoder Uses a pretrained speaker verification model, also adopted from @wuUncoveringDisentanglementCapability2023, that outputs a speaker-specific embedding called d-vector.

Emotion Encoder The emotion is encoded using a pretrained [[Self-supervised Learning (SSL)]] Speech Emotion Recognition (SER) model, based @wagnerDawnTransformerEra2023.

Decoder¶

The decoder then is a diffusion model as in Grad-TTS. In contrast to the usual approach of using a Gaussian prior distribution, EmoConv diffuses forward towards the average voice distribution as encoded by the Phoneme Encoder and then learns the Diffusion Model#Reverse Trajectory from the average voice towards the source distribution. This way, the model learns to add back emotion and speaker information into the utterance.

The reverse-time diffusion model is thus defined as

\[ s_{\theta}(X_{t},Y,S(X_{0}),E(X_{0}),t) \]

where

\(X_{t}\) is the noise sample for time-step \(t\)
\(Y\) is the average voice sample from the prior distribution
\(S(X_{0})\) is the speaker embedding for the source sample
\(E(X_{0})\) is the emotion embedding for the source sample
\(t\) is the current time-step The model outputs the gradient of the log-probability density \(\nabla_{X_{t}}\log p_{t}(X_{t}|Y)\) (score function of Score-Based Diffusion Model) and uses the U-Net architecture as in Grad-TTS.

During training, the model is conditioned on the emotion embedding from the Emotion Encoder. Then, during inference, the model uses the averaged embedding for a set of reference utterances that belong to a certain emotion category \(\bar{e}\) (scalar between 1 and 7). The reference set is to be \(20\%\) of the samples belonging to a certain target arousal.

\[ E(\bar{e})=\frac{1}{|A_{p}(\bar{e})|}\sum\limits_{X_{0}\in A_{p}(\bar{e})}E(X_{0}) \]

Loss¶

The loss consists of two parts, a score matching loss and a reconstruction loss.

Score matching loss

\[ \mathcal{L}_{s}(X_{t})=\mathbb{E}_{\epsilon_{t}}[||s_{\theta}(X_{t},t)+\sigma(t)^{-1}\epsilon_{t}||^{2}_{2}] \]

The score matching loss is based on @songImprovedTechniquesTraining2020. The noisy sample \(X_{t}\)is not directly sampled from the prior distribution, but computed using the Variational Autoencoder#Reparameterization Trick: \(X_{t}=\mu(t)+\sigma(t)\epsilon_{t}\).

Reconstruction loss The reconstructions loss is a measure of how similar the Mel-frequency Cepstrum Coefficients (MFCC) are:

\[ \mathcal{L}_{m}(\hat{X}_{0})=\sum\limits_{x}||X_{0}-\hat{X}_{0}||_{1} \]

Computing the generated mel-spectrogram \(\hat{X}_{0}\) is expensive as it requires solving the full reverse SDE from \(X_{t}\). Instead, the authors approximate \(\hat{X}_{0}\) using Tweedie's forumla:

\[ \hat{X}_{0}=\frac{\hat{\mu}(t)-(1-\alpha_{t})Y}{\alpha_{t}} \]

where \(\hat{\mu}(t)=X_{t}-(s_{\theta}(X_{t},t)*\sigma(t)^{2})\).