Mel frequency Cepstrum Coefficients (MFCC)
The mel-spectogram is an alternative to the normal Cepstrum based on mel-frequencies. Compared to a normal cepstrum, the frequency bands are equally spaced in the mel-scale. The MFC is a better representation of sound for the human hearing compared to the cepstrum, making it well-suited for audtio compression and speech processing tasks.
The MFC is calculated as follows:
- Take Fourier Transformation of the time domain signal.
- Map the spectrum on the mel-scale (using triangular/cosine Analysis Window)
- Take logarithm of the mel-spectrum.
- Take Discrete Cosine Transformation of the log-scaled mel-spectrum. This process results in Mel-frequency cepstral coefficients (MFCCs) in the time domain..
Mel Scale¶
The mel scale is based on the assumption that the human perceived pitch of a sound does is not increase linearly with the frequency. It was derived empirically using listeners that judged whether different pitches are equally distant to each other. It is defined as:
| Hz | 40 | 161 | 200 | 404 | 693 | 867 | 1000 | 2022 | 3000 | 3393 | 4109 | 5526 | 6500 | 7743 | 12000 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mel | 43 | 257 | 300 | 514 | 771 | 928 | 1000 | 1542 | 2000 | 2142 | 2314 | 2600 | 2771 | 2914 | 3228 |
Umesh et al. 1999 mel scale data from Stevens and Volkmann 1940
Computation¶
The MFCCs are typically computed in 5 steps:
- STFT\((\cdot)\) Produce windowed spectrum through Fourier Transformation#Short Time Fourier Transform (STFT).
- \(|\cdot|^{2}\) Calculate Power Spectral Density (PSD)#Periodogram by squaring the spectra .
- Mel-filters Multiply spectrum of each frame with #Mel-filter bank (see above)
- log\((\cdot)\) Apply logarithm to each frame.
- IDCT\((\cdot)\) Apply inverse Fourier Transformation#Discrete Cosine Transformation
The inverse Fourier Transformation#Discrete Cosine Transformation instead of a normal IDFT is used as the coefficients already have to be real-valued, so a IDCT is more efficient.
Mel-filter Bank¶

As can be seen in the figure, the Mel-filter bank applies overlapping triangular windows to different frequency ranges. This transforms the periodogram to the #Mel Scale, where the first 1000Hz are relatively constant. Then, as human hearing is less differentiating but more sensitive to higher frequencies, we apply increasingly broader filters that taper in amplitude.