Datasets for Emotional Speech
This section is mainly sourced from @zhouEmotionalVoiceConversion2022.
Datasets¶
A dataset for training ESS models should satisfy several criteria to be as generalizable as possible:
High Lexical Variability¶
The dataset should contain a wide range of utterances for each emotion. Often, they contain only a few sentences for each emotion by a very limited amount of speakers. Especially when trying to build a general Emotional Voice Conversion (EVC) or Emotional Speech Synthesis (ESS) model that is independent of speaker or utterance, the lexical variability should be as high as possible.
High Language Variability¶
Databases are mostly limited to very few dominant languages, mostly English but also German, French, Danish, Italian, Japanese and Chinese. For a universally usable emotional speech models, more multi-lingual databases are needed, especially containing underrepresented languages.
Speaker Variability¶
Speakers in the datasets have to be from different genders and cultural backgrounds to properly train a model without neglecting certain groups in society.
Controlling Confounders¶
Some datasets contain confounding factors such as accents from non-native speakers, dialects and non-lexical verbalization such as laughter and sighing. This can effect the conversion and synthesis performance of emotional speech.
Recording Environment¶
The recordings of emotional speech should be as clean as possible, without any environmental effects such as background noise or bad recording recording equipment. This makes data gathered from movies and TV shows unsuitable (although [@triantafyllopoulosOverviewAffectiveSpeech2023] argue that in the long-term, Emotional Speech Synthesis (ESS) models should be able to deal with such factors).
Emotional Speech Database (ESD)¶
From the previously mentioned criteria, the ESD dataset emerged. The ESD consists of 350 parallel utterances (997 unique words), 300 for training, 20 for evaluation and 30 for testing. The dataset contains 5 different emotions and is multi-lingual (English and Chinese), with 10 speakers for each language.
The authors observe some co-variance in the utterance duration between languages, indicating that emotions have similar effects on duration in both languages.
Evaluation¶
For evaluating synthesized emotional speech, human annotators are the gold standard, rating how natural the speech sounds and how close it is to a target emotion. Evaluation without humans can be performed by using Speech Emotion Recognition (SER) or distance metrics between model output and target.
Ethical Concerns¶
Generally, ESS datasets have several ethical concerns: