Speech Recognition
Automatic Speech Recognition (ASR) is an important topic in several domains: Human-computer interfaces, like in smartphones, call centers, car communication, big data, and more. It typically involves two major steps: First, features are extracted from the voice signal (#Feature Extraction). Secondly, those features are matched with words from training datasets (#Classification). ASR should work regardless of speaker attributes or prosody.
Major challenges include…
- ambiguous sentences, that are phonetically the same but semantically different.
- lack of separation between words in fluent speech.
- Extrinsic variables, external and environmental effects like reverberation.
- Intrinsic variables
- inter-speaker variabilities like different pitch, vocal tract shapes or dialects.
- intra-speaker variabilities like prosody (emotion, timing, irony).
A simple example would be using Source-Filter Model#Linear Prediction and/or Quantization#Vector Quantization to embed voice signal frames into a vector space and then match the vectors with training data. However, this is to simplistic to work with anything more complex than for example the numbers 1 to 9.
Feature Extraction¶
Phonemes can be distinguished well by the first and second Formant. Using Source-Filter Model#Linear Prediction sounds reasonable, but is not robust enough with regards to noise.
Mel-frequency Cepstrum Coefficients¶
A better solution is using coefficients in the Mel-frequency Cepstrum Coefficients (MFCC):
For speech recognition, the STFT typically involves 257 coefficients per FFT and the periodograms are averaged so that 24 coefficients remain in the mel-cepstrum. The resulting mel-spectrum is then often reduced to \(\approx13\) coefficients using liftering (Cepstrum#Cepstral Smoothing) to further reduce the sensitivity to noise and speaker attributes like pitch. Also, the MFCCs are level independent (except zeroth, which is mean value of coefficients).
The resulting MFCCs now capture the formants really well, while discarding many of the intrinsic and extrinsic variables (pitch, speaker, level, noise).
Speech Enhancement¶
We can reduce the noise of the voice signal using Speech Enhancement to simplify speech recognition tasks.
Multi-condition Training¶
Include different noise/reverberation levels or dialects in training data to train statistical model that learns those variables.
Cepstral Mean and variance Normalization¶
![[Pasted image 20240625120452.png]]
Any channel brings distortions into the voice signal, that are equivalent to a channel specific filter applied to the time domain voice signal. In the Cepstrum, filters are just a constant offset that is added to each coefficient (same for Mel-cepstrum). If we know the channel-specific filter offset, we can subtract it to further reduce extrinsic variability.
The left-most figure shows the time evolution of a single cepstral coefficient for clean speech, noise speech and speech passed through a linear channel.
Mean subtraction After subtracting the global empirical mean over all frames from the components, we see that the channel effects are almost completely gone. Also, the dynamic range of the noisy speech is much improved.
Variance normalization After also dividing the coefficients by the global empirical variance, the SNR of the noisy speech is now even further reduced. The cepstral coefficient is now pretty much the same for the clean speech, speech with noise or speech that was passed through a linear channel filter.
Classification¶
To bind meaning to the extracted features, we need to match the vector representation to some data set. We could do this frame by frame, using the best local match between the acoustic features of each frame and some related phoneme representation. But due to the intrinsic and extrinsic variabilities in speech, the relationship between acoustic and phoneme features is often not linear (due to prosody; i.e. length of utterances and pauses in between). Therefore, we are mostly interested in a global match, where the overall distance for all acoustic features toward the phoneme representation is minimized, while considering the prosodic variability. Two common methods for doing so are dynamic time warping and hidden Markov models.
Dynamic Time Warping (DTW)¶
Dynamic time warping is a distance measure. The idea is the following: Like our source utterance, each reference utterance is segmented, but due to prosody the number for frames in source and reference utterance can be different. For example, a phoneme in the source utterance could span several frames but only one frame in the reference. We therefore not only compute the cumulative distance for a linear progression in both source and reference, but also the the cumulative costs associated for when the source utterance might progress ahead of the reference utterance (or the other way around).
We the choose the reference utterance with the lowest DTW as our classification match. This is the utterance that has the lowest distance to the source, when we allow for different time-spans spent on each phoneme.
Computation¶
We can represent each data as
- source utterance: \(x_n(l)\) where \(n\)-th MFCC, \(l\)-th segment (\(l\in[0,N]\))
- reference utterance: \(y_n(\lambda)\) where \(n\)-th MFCC, \(\lambda\)-th segment (\(\lambda\in[0,M]\)).
We furthermore accept, that segments of source and reference utterances can differ in length. That means, we have three options:
- source and reference segments progress linearly in time (\(l+1, \lambda+1\))
- source segment stays the same while reference progresses in time (\(l, \lambda+1\))
- reference segment stays the same while source progresses in time (\(l+1, \lambda\))

We then compute a distance matrix between the MFCCs of all frames and reference utterances, where each entry is defined as:
The matrix for two different reference utterances is visualized in the figure below, low costs between two segments are dark pixel, high costs are white pixels:
We now try to find the path through the matrix from indices \(l=0, \lambda=0\) to \(l=N,\lambda=M\). In the figure, this means finding the lowest cost path from bottom-left to top-right. A diagonal step represents option 1, the source and reference progress linearly. A step up represents option 2, the reference utterance progresses ahead of the source (because this part of the reference utterance is longer than in the source utterance). A step to the right represents option 3, the source utterance moves ahead of the reference utterance (this part of the source utterance takes longer than in the reference).
Dynamic Programming¶
To avoid brute forcing all routes possible, we can use dynamic programming to find the lowest cost path. We do so by creating a matrix with the minimum cumulative costs to each cell in the matrix:
- We start at the bottom left and calculate the cost \(c(0,0)\).
- We then progress to the neighboring cells and calculate the possibile costs by summing the costs from possible previous cells the distance of the current cell:
- We assign the minimum of those costs to the current cell.
- Repeat until at max index \(c(N,M)\)
As seen above, we have at max three options for neighbors and thus at most three costs to calculate for each cell. This process gives us a cost matrix \(C\). From that, we can get…
- the global costs between the source and reference utterance at \(c(N,M)\).
- the shortest path (and optimal warping) by backtracking, choosing the lowest cost neighbor each time.
Hidden Markov Model (HMM)¶

In the HMM, we represent the phoneme as a hidden state of our observations. We now try to find the most likely hidden state for each of our observations. From labeled training data, we can learn
- the emission probability
- observing a feature given a hidden state
- the transition probabilities - staying in the hidden state when progressing a segment or moving from a hidden state to another state when progressing a segment We can now compute the likelihood of a set of observations given the HMM model. We basically ask "How likely are the transitions and emissions to observe this set of features?".
The HMM assumes that, knowing the last state, we don't need to know any earlier information to understand the process at hand (Markov assumption). If there is a dependency between the current state and earlier states, then they need to be incorporated into the model so the assumption is satisfied.
For each reference utterance, we have a separate recognizer model that gives us a probability of the observations being emitted from that model. We now choose the utterance with the highest probability of being the source of the observations.
The figure above shows the example of "haben". For each hidden state, there is probability that the next frame is still representing the same phoneme or that it progresses to the next phoneme. Each phoneme has a certain probability of our observation to be emitted from it. The model has two advantages over DTW:
- Phonemes can be skipped (/b/ directly to /n/, "habn")
- Different observations can be associated with the same phoneme (/n/ -> [n],[m], "habm") In DTW, we'd need multiple reference utterances to represent that variability.