Spectral Envelope

The spectral envelope is the transfer function of the vocal tract, as seen as a Source-Filter Model (Text-to-Speech Synthesis (TTS)#Formant Synthesis). Each Formant has a particular transfer function that is sampled by the sounds produced in the vocal tract. The benefit of using the spectral representation of speech is that it is (a) a much more consistent representation across different speakers and (b) much easier to perform certain modifications on (e.g. noise reduction, source separation, …).

As a higher voice has a high Fundamental Frequency (F0) and thus more sparse harmonies, they sample the spectral envelope of a formant less frequent. This explains why very high voices are often harder to understand than lower voices, they actually contain less information across frequencies usually contained in the spectral envelope of common formants (and by extension, phonemes). Great explanation here: https://youtu.be/N0CVIoVQkmc?si=8LjIs-vPW707-UwB&t=2026.

To analyze the spectral envelope of an utterance, suitable length and overlap for the Analysis Window in the Fourier Transformation#Short Time Fourier Transform (STFT) must be chosen.

Pasted image 20240513202034.png