EmoConv Diff
Paper by Navin, based on
This model is a Emotional Voice Conversion (EVC). Very similar to In-the-wild SEC, its designed to be trained on non-parallel data (Parallel and Non-parallel Training Data), based on the Russell's Circumplex Model. It tries to improve on the more extreme values of arousal.
The model outperforms In-the-wild SEC for high arousal states, but not quiet as well on low arousal.
Idea¶

Disentanglement is achieved, like in In-the-wild SEC, using three encoders and a diffusion decoder.
Encoder¶
Phoneme Encoder Extracts the speaker- and emotion-independent mel-features (average voice) using a pretrained transformer, adopted from @wuUncoveringDisentanglementCapability2023.
Speaker Encoder Uses a pretrained speaker verification model, also adopted from @wuUncoveringDisentanglementCapability2023, that outputs a speaker-specific embedding called d-vector.
Emotion Encoder The emotion is encoded using a pretrained [[Self-supervised Learning (SSL)]] Speech Emotion Recognition (SER) model, based @wagnerDawnTransformerEra2023.
Decoder¶
The decoder then is a diffusion model as in Grad-TTS. In contrast to the usual approach of using a Gaussian prior distribution, EmoConv diffuses forward towards the average voice distribution as encoded by the Phoneme Encoder and then learns the Diffusion Model#Reverse Trajectory from the average voice towards the source distribution. This way, the model learns to add back emotion and speaker information into the utterance.
The reverse-time diffusion model is thus defined as
where
- \(X_{t}\) is the noise sample for time-step \(t\)
- \(Y\) is the average voice sample from the prior distribution
- \(S(X_{0})\) is the speaker embedding for the source sample
- \(E(X_{0})\) is the emotion embedding for the source sample
- \(t\) is the current time-step The model outputs the gradient of the log-probability density \(\nabla_{X_{t}}\log p_{t}(X_{t}|Y)\) (score function of Score-Based Diffusion Model) and uses the U-Net architecture as in Grad-TTS.
During training, the model is conditioned on the emotion embedding from the Emotion Encoder. Then, during inference, the model uses the averaged embedding for a set of reference utterances that belong to a certain emotion category \(\bar{e}\) (scalar between 1 and 7). The reference set is to be \(20\%\) of the samples belonging to a certain target arousal.
Loss¶
The loss consists of two parts, a score matching loss and a reconstruction loss.
Score matching loss
The score matching loss is based on @songImprovedTechniquesTraining2020. The noisy sample \(X_{t}\)is not directly sampled from the prior distribution, but computed using the Variational Autoencoder#Reparameterization Trick: \(X_{t}=\mu(t)+\sigma(t)\epsilon_{t}\).
Reconstruction loss The reconstructions loss is a measure of how similar the Mel-frequency Cepstrum Coefficients (MFCC) are:
Computing the generated mel-spectrogram \(\hat{X}_{0}\) is expensive as it requires solving the full reverse SDE from \(X_{t}\). Instead, the authors approximate \(\hat{X}_{0}\) using Tweedie's forumla:
where \(\hat{\mu}(t)=X_{t}-(s_{\theta}(X_{t},t)*\sigma(t)^{2})\).