Paper by Navin, based on @prabhuWildSpeechEmotion2023

This model is an Emotional Voice Conversion (EVC) model does not require a parallel dataset (Parallel and Non-parallel Training Data). This means, it can be trained on any data that provides a scalar emotion label (But would also generalize to discrete labels I guess). Emotions are modeled in accordance to the Russell's Circumplex Model, meaning emotions are distributed continuously along arousal and valence dimensions, instead of using discrete primary dimensions.

The model performs well for moderate levels of arousal and sounds more natural for an arousal reduction than for an arousal increase.

Idea¶

The model uses disentangled latent representations of the speech instead of a mel spectrogram for vocoding (called resynthesis). The internal representation is learned through [[Self-supervised Learning (SSL)]] and disentangles speaker and linguistic latent features. During resynthesis, the disentangled representation is augmented with an emotion embedding.

( #MasterThesis Maybe the attribute vector from MixedEmotions might make more sense?)

Pasted image 20240915132103.png

Disentanglement is done using three separate encoders, a lexical encoder \(E_{l}\) and speaker encoder \(E_{s}\) embed the waveform input, an emotion encoder \(E_{e}\) encodes the scalar emotion label. This gives the latent representation \(z_{T^{\prime}}=(z_{l}, z_{s}, z_{e})\). The lexical embedding \(z_{l}\) is a vector of the length of input frames, the speaker and emotion embeddings are global and time-invariant, so they are concatenated to each frame in \(z_{l}\). The latent representation is then the input to the resynthsizer, which vocodes the latent representation back into a waveform.

Lexical Encoder \(E_{l}\) is a pretrained SSL-based [[HuBERT]] model ([[Self-supervised Learning (SSL)]]) applied to each input frame, where the continuous internal representations are discretized into a integer representation using k-means.

Speaker Encoder \(E_{s}\) is a pretrained [[WavLM]] speaker verification model, that outputs a continuous d-vector (other word for speaker embedding). Its encoding all input frames at once.

Emotion Encoder \(E_{e}\) is simply a stack of trainable linear layers. Like the speaker encoder, its applied to the whole utterance at once.

Resynthesizer The latent representation is resynthesized using a modified HiFiGAN. HiFiGAN uses two discriminators, one that works on multiple periods and one that works on multiple scales.

Loss¶

As the encoders are pretrained, the relevant loss is in regards to the HiFiGAN.

Generator Loss¶

\[ L_{G}(D,G)=\sum\limits_{j=1}^{J}[L_{adv}(D_{j},G)+\gamma_{fm}L_{fm}(D_{j},G)]+\gamma _{recon}L_{recon}(G)+\gamma_{ser}L_{ser} \]

The generator loss consists of components:

\(L_{adv}\)¶

The adversarial loss is the classic generative loss of a GAN. It measures how well it fools the discriminator:

\[ L_{adv}(D_{j},G)=\sum\limits_{x}||1-D_{j}(G(z_{T^\prime}))||^{2}_{2} \]

If the generator is perfect, the discriminator will put out \(1\) as to label the generated sample as real. So the adversarial loss for the generator is the squared distance to \(1\), summed over all sub-discriminators of the HiFiGAN.

\(L_{fm}\)¶

The activation loss is measuring the difference in activations for all the layers of the discriminator when given a generated sample compared to a real sample.

\[ L_{fm}(D_{j}, G)=\sum\limits_{x}\sum\limits_{i=1}^{R}\frac{1}{M_{i}}||\psi_{i}(x)-\psi_i(G(z_{T^\prime}))|| \]

\(R\) is the number of discriminator layers and \(M_{i}\) is the number of features in the \(i\)-th layer. The activation \(\psi\) should optimally be the same for the real and the generated sample.

\(L_{recon}\)¶

The reconstruction loss measures the difference in Mel-coefficients.

\[ L_{recon}(G)=\sum\limits_{x}||\phi(x)-\phi(G(z_{T^{\prime}}))||_{1} \]

The function \(\phi(x)\) computes the Mel-frequency Cepstrum Coefficients (MFCC) for the input \(x\).

\(L_{ser}\)¶

The emotion-recognition loss is measures, how well the model output represents the target emotion using Speech Emotion Recognition (SER).

\[ L_{SER}=\sum\limits_{x}[1-L_{ccc}(e,E_{SER}(G(z_{T^{\prime}})))] \]

Perfect reconstruction means a [[Concordance Correlation Coefficient]] of \(1\) between the target emotion embedding \(e\) and the SER output.

Discriminator Loss¶

The discriminator loss is simply the sum of sub-discriminator losses, where each loss is calculated as:

\[ L_{D}(D_{j},G)=\sum\limits_{x}[||1-D_{j}(x)||^{2}_{2}+||D_{j}(G(z_{T^{\prime}}))||^{2}_{2}] \]

unclear For the second term, the paper uses \(\hat{x}\) instead of \(\hat{y}=G(z_{T^{\prime}})\), which isn't mentioned anywhere else. Probably an error?¶