MixedEmotions

Most Emotional Speech Synthesis (ESS) models either allow no Emotional Speech Synthesis (ESS)#Control#Mixed Emotions or, if they do, don't capture the interplay between emotions. For example, the emotion surprise might have a different effect on happy speech than on angry speech. While models like EmoMix do a kind of linear interpolation between two emotions, MixedEmotions model tries to capture the actual probability density topology between the emotional base dimensions.

Model Architecture¶

Pasted image 20240914151304.png MixedEmotions is based on the emotional wheel theory, according to which emotions occur as a mix of eight primary emotions. Since most Datasets for Emotional Speech contain speech with only a single emotional label, the authors use Relative Attribute Rank to create an Emotion Attribute Vector for each input sample, which represents the relative amount of each primary emotion contained in the sample. This way, they project the samples fully into the latent emotion space (in other models, each sample would lie one of the base axis).

Training¶

Pasted image 20240914152023.png

Emotion Control

The Emotional Speech Synthesis (ESS)#Control is done using reference speech to extract the emotion from. The speech is encoded using an emotion encoder ([[Bidirectional LSTM]]). The emotion embedding is concatenated with the Emotion Attribute Vector described above. Together, they provide the style and the point in the emotion space, that is to be reconstructed.

Content Control The linguistic content is encoded from text or the input speech sample in alternating manner (even epoch: text, odd epoch: speech). Using [[Contrastive Loss]], the embeddings are ensured to be similar (based on @zhangNonParallelSequenceSequenceVoice2020). Additionally, the linguistic encoder is forced to learn a emotion agnostic embedding by using an adversarial emotion classifier.

The decoder takes the emotion embedding and the content embedding as input and tries to reconstruct the original speech.

Inference¶

Pasted image 20240914154448.png

During runtime, the reference sample is only used to create an emotion style embedding, while the attribute vector is controlled manually. The linguistic embedding is now only extracted from the text. All encoders and the decoder are [[Bidirectional LSTM]]s.