Affective Speech Synthesis (ASS)

Overview based on @triantafyllopoulosOverviewAffectiveSpeech2023

Definition¶

Definition of affective speech:

a broader term, encompassing all kinds of manifestations of personality such as mood, interpersonal stances, or attitudes.

- @schullerComputationalParalinguisticsEmotion2013

The affective synthesis is part of all possible voice transformations and subsumes Emotional Speech Synthesis (ESS). Compared to ESS, it also contains other notions like mood, personality and social status.

ASS is the opposite of Speech Emotion Recognition (SER). Due to the introduction of [[Deep Learning]] into the realm of Text-to-Speech Synthesis (TTS) around 2016, there has been a lot of progress in TTS, which are spearheading into ASS as well.

The "generation" of affective speech encompasses three steps:

Selecting the appropriate emotion
Selecting the appropriate text
Synthesizing the waveform from both.

ASS is the final step of an Affective Agent Model, which receives inputs from an environment it is embedded in and an interlocutor.

History¶

ASS has developed through different phases of pre-dominant model for the voice generation.

Text-to-Speech Synthesis (TTS)¶

The history of ASS is closely linked to progress in TTS approaches(Text-to-Speech Synthesis (TTS)#History). Thus, just as TTS has seen tremendous breakthroughs through Deep Learning approaches (Text-to-Speech Synthesis (TTS)#Deep Learning), ASS is benefiting from those breakthroughs as well as they are spearheading into this field as well.

Voice Conversion (VC)¶

Voice Conversion (VC)#^d7c656

Emotional Speech Synthesis (ESS)¶

Emotional Speech Synthesis (ESS)#^8180db