Mel frequency Cepstrum Coefficients (MFCC)

The mel-spectogram is an alternative to the normal Cepstrum based on mel-frequencies. Compared to a normal cepstrum, the frequency bands are equally spaced in the mel-scale. The MFC is a better representation of sound for the human hearing compared to the cepstrum, making it well-suited for audtio compression and speech processing tasks.

The MFC is calculated as follows:

Take Fourier Transformation of the time domain signal.
Map the spectrum on the mel-scale (using triangular/cosine Analysis Window)
Take logarithm of the mel-spectrum.
Take Discrete Cosine Transformation of the log-scaled mel-spectrum. This process results in Mel-frequency cepstral coefficients (MFCCs) in the time domain..

Mel Scale¶

The mel scale is based on the assumption that the human perceived pitch of a sound does is not increase linearly with the frequency. It was derived empirically using listeners that judged whether different pitches are equally distant to each other. It is defined as:

\[ m(f)=2595\cdot\log_{10}(1+\frac{f}{700}) \]

Hz	40	161	200	404	693	867	1000	2022	3000	3393	4109	5526	6500	7743	12000
mel	43	257	300	514	771	928	1000	1542	2000	2142	2314	2600	2771	2914	3228

Umesh et al. 1999 mel scale data from Stevens and Volkmann 1940

Computation¶

The MFCCs are typically computed in 5 steps:

STFT\((\cdot)\) Produce windowed spectrum through Fourier Transformation#Short Time Fourier Transform (STFT).
\(|\cdot|^{2}\) Calculate Power Spectral Density (PSD)#Periodogram by squaring the spectra .
Mel-filters Multiply spectrum of each frame with #Mel-filter bank (see above)
log\((\cdot)\) Apply logarithm to each frame.
IDCT\((\cdot)\) Apply inverse Fourier Transformation#Discrete Cosine Transformation

The inverse Fourier Transformation#Discrete Cosine Transformation instead of a normal IDFT is used as the coefficients already have to be real-valued, so a IDCT is more efficient.

Mel-filter Bank¶

Pasted image 20240625113559.png

As can be seen in the figure, the Mel-filter bank applies overlapping triangular windows to different frequency ranges. This transforms the periodogram to the #Mel Scale, where the first 1000Hz are relatively constant. Then, as human hearing is less differentiating but more sensitive to higher frequencies, we apply increasingly broader filters that taper in amplitude.