Skip to content

Mel frequency Cepstrum Coefficients (MFCC)

The mel-spectogram is an alternative to the normal Cepstrum based on mel-frequencies. Compared to a normal cepstrum, the frequency bands are equally spaced in the mel-scale. The MFC is a better representation of sound for the human hearing compared to the cepstrum, making it well-suited for audtio compression and speech processing tasks.

The MFC is calculated as follows:

  1. Take Fourier Transformation of the time domain signal.
  2. Map the spectrum on the mel-scale (using triangular/cosine Analysis Window)
  3. Take logarithm of the mel-spectrum.
  4. Take Discrete Cosine Transformation of the log-scaled mel-spectrum. This process results in Mel-frequency cepstral coefficients (MFCCs) in the time domain..

Mel Scale

The mel scale is based on the assumption that the human perceived pitch of a sound does is not increase linearly with the frequency. It was derived empirically using listeners that judged whether different pitches are equally distant to each other. It is defined as:

\[ m(f)=2595\cdot\log_{10}(1+\frac{f}{700}) \]
Hz 40 161 200 404 693 867 1000 2022 3000 3393 4109 5526 6500 7743 12000
mel 43 257 300 514 771 928 1000 1542 2000 2142 2314 2600 2771 2914 3228

Umesh et al. 1999 mel scale data from Stevens and Volkmann 1940

Computation

The MFCCs are typically computed in 5 steps:

  1. STFT\((\cdot)\) Produce windowed spectrum through Fourier Transformation#Short Time Fourier Transform (STFT).
  2. \(|\cdot|^{2}\) Calculate Power Spectral Density (PSD)#Periodogram by squaring the spectra .
  3. Mel-filters Multiply spectrum of each frame with #Mel-filter bank (see above)
  4. log\((\cdot)\) Apply logarithm to each frame.
  5. IDCT\((\cdot)\) Apply inverse Fourier Transformation#Discrete Cosine Transformation

The inverse Fourier Transformation#Discrete Cosine Transformation instead of a normal IDFT is used as the coefficients already have to be real-valued, so a IDCT is more efficient.

Mel-filter Bank

Pasted image 20240625113559.png

As can be seen in the figure, the Mel-filter bank applies overlapping triangular windows to different frequency ranges. This transforms the periodogram to the #Mel Scale, where the first 1000Hz are relatively constant. Then, as human hearing is less differentiating but more sensitive to higher frequencies, we apply increasingly broader filters that taper in amplitude.