Skip to content

Emotional Voice Conversion (EVC)

Emotional Voice Conversion is concerned with the injection of emotional depth into speech. It mainly acts on the prosody of speech.Emotion#^dc3104

Due to big advances in Text-to-Speech Synthesis (TTS), EVC is often the go-to method for Emotional Speech Synthesis (ESS) (as compared to [[Text-to-Emotional-Feature Synthesis (TTEF)]]), converting already synthesized speech into emotional speech.

Overview

Typically EVC involves two steps:

Feature Extraction This step aims at extracting global as well as temporal spectral components like Fundamental Frequency (F0) and its energy envelope, pitch and duration. A common method for that is [[Continuous Wavelet Transform (CWT)]]. When using [[Deep Learning]] approaches, a fundamental issue is Entanglement of Latent Features. Possible solutions are disentanglement methods: Emotional Speech Synthesis (ESS)#^700bce

Feature Mapping The model needs to capture the relationship between source and target features to be able to model them. Early approaches used hierarchical clustering followed by Gaussian Mixture models. Contemporary solutions employ neural networks, using Deep Belief Networks, Deep Belief LSTMs and attention-based seq2seq models. The attention mechanism learns the feature mapping and alignment during training and can utilized them for prediction during runtime. EVC models typically use parallel mappings (Parallel and Non-parallel Training Data) but Autoencoder (AE) and Master Wiki/Models/General Models/Generative Adversarial Network (GAN)#CycleGAN & StarGAN have enable training on non-parallel data. An example from @zhouEmotionalVoiceConversion2022: Pasted image 20240423114854.png

Just like Emotional Speech Synthesis (ESS), EVC can profit from advances in the related fields of Text-to-Speech Synthesis (TTS) and [[Automatic Speech Recognition (ASR)]]. Speaker independent ASRs can be used to produce [[Phonetic PosteriorGram (PPG)]], which combined with the mention feature extractions can be used to improve the EVCs performance. An example from @zhouEmotionalVoiceConversion2022:

Pasted image 20240423115601.png

Training Data

This section overlaps with the database requirements of Emotional Speech Synthesis (ESS):

Datasets for Emotional Speech#Datasets