Skip to content

ComputerScience #Speech #Emotion

This note is sourced mainly from @triantafyllopoulosOverviewAffectiveSpeech2023.

Emotional Speech Synthesis is the process of generating utterances that are comprehensive and have emotional content. Text-to-Speech Synthesis (TTS) can be regarded as a sub-problem of ESS. TTS is the foundation that allows us to generate comprehensive speech, while ESS extends it by an emotional depth. Due to the fast progress in TTS, ESS is often based on TTS more as an afterthought. Commonly, its TTS utterances augmented with emotional content by Emotional Voice Conversion (EVC). ^8180db

Emotion

Emotion#^05471c

In the context of ASS, its often defined in one of two ways:

  • Discrete theories: Emotions are treated as discrete categories. Typical example: Ekman's Big Six.
  • Dimensional theories: Emotions are seen as having several dimensions, like arousal, valence and dominance. Typical Example: Russell's Circumplex Model. Discrete theories have been predominant in DL-based ASS, but struggle to capture the subtleties in human the utterances in human emotion. @zhouEmotionalVoiceConversion2022

History

ESS went through a similar history as Text-to-Speech Synthesis (TTS)#History. Early approaches used rule-based approaches ([[Formant Synthesis]]) to modulate the acoustic correlates of emotion, primarily the pitch and timing of speech. The parameters of those were determined by experts. Later, more data-driven approaches emerged, where parameters were sourced as ranges from source data.

Those early approaches were later replaced by [[Concatenative Speech Synthesis]], where pre-recorded utterances were concatenated into actual speech. Compared to the normal concatenative TTS approaches, in this case, the pre-recorded utterances also contained emotional variations. And just like normal concatenative TTS, it suffered from limited data availability and discontinuities.

Finally, ESS moved to [[Statistical Parametric Speech Synthesis (SPSS)]] and its extension into the domain of deep learning.

Deep Learning Approaches

Researches are adapting DL approaches from TTS for ESS purposes. They have to adapt them to inject emotion without Entanglement of Latent Features. Especially advances in seq2seq models have simplified dealing with the different duration that the target speech could have compared to the source signal. For ESS, those models mostly consist of multiple encoders, so that latent features like speaker, emotion and content are not entangled (see #Deep Learning Approaches#Disentanglement methods). They often are using a Variational Autoencoder style, where the latent feature is Gaussian distributed.

Another general trend is the increasing end-to-endness of ESS models, moving away from the [[Statistical Parametric Speech Synthesis (SPSS)]] models that are separated into text analyzers, acoustic model and vocoder.

Overview of Existing Works

@triantafyllopoulosOverviewAffectiveSpeech2023

Deep learning approaches in ESS can be categorized in how they work:

  • Degree of End-to-End Behavior: [[Text-to-Emotional-Feature Synthesis (TTEF)]] vs Emotional Voice Conversion (EVC) Are acoustic features with emotional content directly synthesized (TTEF) or is the emotional content injected into acoustic features from normal TTS (EVC)
  • Parallel and Non-parallel Training Data Is parallel or non-parallel data used in training? For non-parallel data, the problem of Entanglement of Latent Features arises.
  • Direct transformation vs decomposition Is the emotional content decomposed as a separate component or directly incorporated into the models latent representation?
  • Reference-based vs reference-free Is the target emotion provided as a reference sample or as a non-auditory representation like a one-hot encoding?
  • Features What target features are manipulated? Could be spectral features, prosody, F0, …
  • Granularity Is the emotional content controlled on an utterance- or a frame/word-level?

Existing Works

Approach Control Intensity Granularity Non-parallel data Conversion Model Features End-to-end
Ming et al. Fixed N/A Utterance x Transformation bLSTM STRAIGHT EVC
Lee et al. One-hot N/A Utterance x N/A Seq2Seq Spectra TTEF
Lorenzo-Trueba et al. Annotator agreement Annotator agreement Utterance x N/A RNN WORLD TTEF
Choi et al. Reference N/A Utterance x Disentanglement CNN Spectra TTEF
Kwon et al. Reference N/A Utterance x Disentanglement Seq2Seq Spectra TTEF
Shankar et al. Fixed N/A Utterance x Transformation Highway F0/intensity EVC
Bao et al. Fixed N/A Utterance Transformation CycleGAN openSMILE EVC
Luo et al. Fixed N/A Utterance x Transformation GAN F0 EVC
Robinson et al. Fixed N/A Frame x Transformation Seq2Seq F0 EVC
Gao et al. Reference N/A Utterance Disentanglement GAN F0/Spectra EVC
Kim et al. Reference N/A Utterance x Disentanglement Seq2Seq Spectra EVC
Rizos et al. One-hot N/A Utterance Transformation StarGAN Cepstra EVC
Cao et al. Fixed N/A Utterance Transformation VAE-GAN Cepstra EVC
Schnell and Garner Reference Saliency maps Frame x Transformation RNN WORLD EVC
Liu et al. Reference N/A Utterance Transformation Seq2Seq Spectra TTEF
Du et al. Reference N/A Utterance Transformation StarGAN Cepstra EVC
Choi and Hah Reference Manual Utterance x Disentanglement Seq2Seq Spectra EVC
Cai et al. Reference N/A Utterance Disentanglement Seq2Seq Spectra TTEF
Wu et al. Reference N/A Frame x Disentanglement Seq2Seq Spectra TTEF
Kreuk et al. Fixed N/A Frame x Transformation Seq2Seq Spectra/F0/T EVC
Zhou et al. Reference Ranking Utterance Disentanglement Seq2Seq Spectra EVC
Zhang et al. Reference Posterior Utterance Disentanglement Seq2Seq Spectra/F0 EVC
Li et al. Reference Manual Utterance Disentanglement Seq2Seq Spectra TTEF
Liu et al. Reference N/A Frame Disentanglement Seq2Seq Spectra TTEF
L
ei et al.
Reference Ranking Frame Disentanglement Seq2Seq Spectra TTEF

Parallel and Non-parallel Data

Data can be categorized into Parallel and Non-parallel Training Data.

Parallel Data

Parallel data is more readily available and easy to work with. The utterances in the data mappings are constant and only the emotion varies. But for emotional processing, parallel data has significant downsides:

  1. Scaling is very limited, as large databases are needed, that are specifically tailored to emotional processing.
  2. Controllability is also limited. Its often only possible to map from one emotion to another and multiple models are needed for multiple emotions.
  3. Naturalness is not the same in acted pre-recordings compared to utterances in real-world environments.
Non-parallel Data

Compared to parallel data, non-parallel data can be sourced from real-world environments and are thus much more naturalistic and generalizable. But it leads to the problem of Entanglement of Latent Features: Emotional content can become entangled with the speakers identity and gender. Thus, the input utterance has to be decomposed into speaker, emotion and content components.

While this is a more complicated approach a swayed a lot of research toward Emotional Voice Conversion (EVC), recent works focus again more on [[Text-to-Emotional-Feature Synthesis (TTEF)]] with non-parallel training data.

Generative Models

Generative models like Master Wiki/Models/General Models/Generative Adversarial Network (GAN)#CycleGAN & StarGAN have proven effective for ESS. CycleGANs were first adopted for Voice Conversion (VC) in 2018, with adaption like using gated CNNs, identity loss and additional discriminator networks for the cyclically re-produced source samples. A core issue with CycleGANs however is, that they only support one-to-one distributional mappings, thus one would need separate CycleGANs for each emotional mapping. A solution that is capable of one-to-many mappings is the StarGAN model.

Disentanglement Methods

Pasted image 20240422140034.png

An approach to deal with Entanglement of Latent Features is to decompose a source sample into emotional, speaker and content information separately. Multiple encoders can be used to extract the emotion \(e\) (using Speech Emotion Recognition (SER) to calculate loss), encode the speaker \(s\) (often just given by ID) and the content \(c\) (which might be text directly or encoded from the source sample). Those encoded features can now be concatenated into a single latent feature \(z\) where all dimensions are encoded in a disentangled way. ^700bce

MasterThesis Potentially, Variational Autoencoder#Disentanglement methods could be used in Score-Based Diffusion Model#Latent Score-Based Generative Model to leverage the advantages of latent diffusion and VAE disentanglement at the same time.

Control

Emotional Type

Early methods trained multiple models for different emotional mappings, of which the appropriate pair was selected. Superseding those models, StarGAN allowed for one-to-multiple mapping, but still constrained by the one-hot context vector provided.

Later models utilized style transfer, where a prosodic encoding is learned. This is done by training a TTS system and an encoder for prosody jointly in a VAE style. Later, new text inputs could be encoded using the prosody encoder. However, these models are prone to Entanglement of Latent Features, so that changing the gender of a voice from female to male might sound like a low-pitched female voice.

Building on top of that, newer models use Global Style Tokens (GSTs) in attention based systems, that better disentangle the prosodic features and sound more natural.

Mixed Emotions

ESS often tries to generate speech with a certain base-emotion (Ekman's Big Six). However, emotions are not occurring isolated from each other, but rather multiple emotions are experienced to varying degrees at once.

Many models focus on controlling the intensity of emotions from e.g. neutral to very happy, using interpolation, scaling or Relative Attribute Rank techniques. Mixing of emotions is then done by

  • weighted classifier guidance (EmoDiff): "make speech 70% happy and 30% surprised"
  • two step conditioning (EmoMix): "denoise towards happy speech first, then towards surprise"

One limitation of that approach is that the relative intensity of an emotion is computed towards the base-line neutral. It does not consider the potential interplay of emotions. Adding 30% surpise might have a different perturbing effect on happy speech than it has on neutral speech.

Emotional Intensity

Another issue that is less in the focus of research is the control of the intensity of the emitted emotion. The key challenge is to learn emotions not as a discrete one-hot encoding, but as continuous vectors.

Early approaches tried to manually annotate the difference between the expected and perceived emotion in generated samples. To reduce the manual effort of such a task, later models used saliency maps or posterior probabilities of Speech Emotion Recognition (SER) to quantify emotional intensity. However, high salience or high likelihood are just indicators of how close the observed emotion fits to the training data of the SER model.

Thus, another solution is to assume that all neutral samples have an intensity of zero and then train a weighting matrix \(W\) via neutral-neutral, emotional-neutral and emotional-emotional anchor pairs using max-margin optimization (basically a SVM matrix predicting an emotional or neutral sample). The weighting matrix could be used to later manually control the emotional intensity.

Granularity

While most works focus on the control emotion and its intensity for a complete utterance, more recent works start working on frame-level control. They do so through attention-based SER control, capsule networks and frame-level losses.

Datasets & Evaluation

Datasets for Emotional Speech#Datasets

Datasets for Emotional Speech#Evaluation

Limitations

MasterThesis

Lack of Benchmarks

Compared to other fields, ESS lacks the automatic benchmarking techniques from other fields in AI. Therefore, it also lacks the performance leaderboards which give guidance into what features or trends are promising in the future. Right now, researchers have to fall back on expensive human evaluations.

Entanglement

The problem of Entanglement of Latent Features is a serious challenge in the field of ESS. Solving it remains the 'holy grail' of emotional speech synthesis. This holds especially when moving to larger datasets with more lexical, speaker and emotional variety.

Dominant Languages & Culture

Datasets are very limited in language and culture, with only a few dominant ones being represented. While ESS models seem to be quickly fine-tuned to new languages as well, one cannot speak of a holistic cultural representation, which raises ethical concerns.

Simulating vs. Having Emotions

Work on ESS tries to have AI actors to simulate emotions in a "fake it till you make it" manner. But in humans, differences in simulated vs actual emotions are noticeable. So in the long-run, machines might only produce realistic emotions, especially in real interactions, when actually also experiencing those emotions.

Ethics

MasterThesis

While ESS shows great potential especially in the realm of giving people back a voice in a medical context, it also has great dangers attached to it.

Deep Fakes

For one, [[Deep Fakes]] are fabricated videos, where some person (often of public interest) is portrayed as saying often destructive things. ESS has the potential of making deep fakes even more realistic. Furthermore, style transfer allows for subtle changes in prosody, making the speaker sound slightly more sarcastic, derogatory, submissive or aggressive, changing just the non-verbal cues of speech, which makes detecting deep fakes even more difficult. ESS could therefore substantially contribute to an already growing threat of misinformation campaigns.

Voice Assistance

Furthermore, voice assistance have been reported to be designed to be submissive in their speech. Combined with the fact that most voice assistance are female in voice, that poses the danger of reinforcing outdated gender and servility notions.

Influence on Humans

Another concern is the influence that an agent capable of simulating emotions could have on humans. They could be optimized to be the most trustworthy and friendly salesperson to exist, raising concerns about manipulative AI. Already now, its often difficult or pretty much impossible to differentiate an AI agent from an actual person, especially when communicating by voice only. ESS would only exaggerate that issue.

Data

There are three big issues with the datasets for ESS:

  1. Generalisability: There is a lack of representation of other cultures and languages in ESS datasets.
  2. Privacy: For realistic, non-acted emotional data, the privacy of people would quickly be breached.
  3. Correctness: The evaluation of ESS models is often done by human annotation, where the humans are individuals from certain cultural backgrounds.

Future

  • Endowing AI agents with actual affect (emotion, personality) and speaker states like sincerity, cognitive load, …
  • Adapting an emotional style that suits speaker and listener.
    • By extension, reacting emotionally appropriate to a human interlocutor.
  • With ESS entering the real world and real interactions, Speech Emotion Recognition (SER) could be used to build natural rewards signals by interpreting the emotion of the human interlocutor (e.g.: annoyed=bad, engaged=good)