Speech Signal Processing Overview

Speech is a unique ability of humans to convey information and impart knowledge about our world. Advances in speech signal processing have led to an exponential growth in communication devices, fundamentally changing the world we live in.

First speech production approaches date back to 1939, where Homer Dudley used the Source-Filter Model of speech to simulate lungs and vocal chords, the Voder. The same model is still used today to transmit voice in heavily compressed formats, using Source-Filter Model#Linear Prediction. Furthermore, speech processing can be used to make speech more comprehensible for people suffering hearing loss. This is done through Speech Enhancement techniques.

Speech Production¶

Speech starts as voiced or unvoiced speech sounds, the so called excitation signal. This signal passes through the vocal tract (involving tongue, lips, jaw, …), which acts as a filter (see Source-Filter Model#Tube Model of the Vocal Tract).

Pasted image 20240919120925.png

The excitation signal is created by when airflow passes through the larynx, which contains the glottis and the vocal folds (glottis is the overall sound-producing organ that contains the vocal folds). This excitation signal now passes through the vocal tract, which imparts fast changing resonances perceived as different phonemes. The vocal tract is made up of two sections, the nasal and the oral cavities. A voiced excitation signals produce vowel sounds. Unvoiced excitation signals produce fricatives (like [sh]) or, when the vocal tract is constricted and then suddenly opened, plosives ([k], [p], [t], …). There are also mixtures between vowels and fricatives, like [v] or [z].

The distance between peaks of the time-domain signal produced by the glottis is called the Fundamental Period (Inversion of Fundamental Frequency (F0)). In unvoiced speech, the spectral components are more evenly distributed across all frequencies, so no fundamental frequency or resonances are observable. This also means that there are a lot more high-frequency components in unvoiced signals, leading to a higher amount of zero-crossings in the time domain. This can be leveraged for Speech Recognition.

Pasted image 20240919122415.png

When breaking down speech into its smallest components, we start at the phone. A phone is the smallest speech segment that has some kind of separable physical or perceptual features. Examples are the [c] in "cat" or [p] in "pat" or the [p] in "pin" or the [p] in "spin". While the first is easily perceptually distinguishable, the latter is perceptually almost equivalent, but has physical (spectral) differences. Since [p] and [p] are similar enough to be interchangeable, they constitute the same phoneme. A single phoneme may be constituted of several, physically distinguishable allophones, but all of them can be interchanged without changing the meaning of a word. The differences are due to co-articulations which are caused when producing audibly smooth transitions between phones.

Another important aspect of speech is prosody, which encompasses the rhythm, stress and intonation of speech. Prosody conveys information about emphasis, intent and emotion (Emotion#Prosody).

Source-Filter Model¶

The Source-Filter Model describes speech production as the result of two independent processes. The source is the excitation signal produced by the glottis and the filter is effect of the vocal tract has on that excitation signal.

Source¶

Pasted image 20240919123241.png

When the glottis is closed, air pressure from the lungs builds up. When the glottis opens, the air pressure is release. Due to increased airflow, the pressure now decreases again (Bernoulli effect). With repeated closing and opening of the glottis, a train of high-pressure airwaves is produced at some fundamental frequency $T_{0}$.

Any periodic signal can be represented using the Fourier Transformation#Fourier Series#Sinusoidal Representation.

Pasted image 20240514162025.png

Mathematically, we can represent this as a [[Dirac Comb]] with frequency $T_{0}$. The unvoiced signal contains not a single frequency, but is a random process with random distances between the peaks. This means we can best approximate unvoiced sounds as white noise. In our theoretical model, we now switch between the voiced or unvoiced excitation signal as needed for the phoneme to be produced. This would require some kind of detector for the original signal, that keeps track of whether voiced or unvoiced sounds are needed at some point in time. This whole model is a simplification that doesn't allow for mixed sounds like [v] or [z]. However, instead of a "$0$ or $1$" switch, we could employ a weighted summation with a continuous switch in a more advanced model.

Filter¶

Pasted image 20240516170549.png

Pasted image 20240516170451.png

The vocal tract is modeled very simplified as a straight tube. We can differentiate between oral and nasal cavities. The model follows the principle of a pipe organ. This tube model has the property of concentrating the energy of the excitation signal into certain frequency ranges, so called Formants.

![[Pasted image 20240625104446.png]]

Formants are characteristic peaks within the frequency spectrum and are more or less unique for each phoneme. For example, the German vowel [a] has a major formant F1 at a frequency of 1000Hz and a secondary formant F2 at the frequency 1400Hz. These formants are independent of the fundamental frequency, so no matter if the voice of a person is high or low, the energy of the frequencies for an [a] will be distributed similarly towards the formants.

Each phoneme can thus be described by a unique filter, that concentrates the energy of a signal at the formants unique for that phonemes. This filter is referred to as the vocal tract transfer function. By continuously changing this transfer function over time, we can produce speech. In the time-domain, this corresponds to a convolution between the excitation signal and the vocal tract filter. In the spectral domain, we can easily represent this process as a multiplication:

Pasted image 20240919130938.png

Note that in this figure, the output signal does not contain the peak of the first formant. This is because the excitation signal basically samples ([[Sampling]]) the filter at the fundamental frequency. If a formant is not at a multiple of the fundamental frequency, it wont get sampled. This explains, why it is more difficult to understand the phonemes in high pitched voice: The fundamental frequency is higher and its resonances (at multiples of the fundamental frequency) are further apart. This means, that the vocal tract filter is sampled at a lower rate than for low pitched voice.

Hearing¶

![[Pasted image 20240919144542.png]]

The sound perception is organically divided into three parts:

The outer ear, where sounds arrive at the pinnea and is focused through the ear canal.
The middle ear, where the focused airwaves hit the ear drum. The ear drum is connected to the oval window of the cochlea, via the malleus (Hammer), incus (Ambos) and stapes (Steigbügel).
The inner ear, where the liquid in the cochlea is excited by the stapes. Pressure waves traverse the basilar membrane within the cochlea. This membrane has a varying stiffness across its length, so different parts of it resonate with different frequencies, allowing a deconstruction of the sound signal into its frequency components.

Pasted image 20240919145236.png

Humans perceive sounds from 20Hz to ca. 20.000Hz. As can be seen in the figure above, there are different amplitude thresholds as to when a frequency is perceived and when it starts hurting. The frequency of formants fall right within the most sensitive area of sound perception, implying speech optimized perception or perception optimized speech.

For sensory hearing loss, where hair cells on the Basilar membrane die, we need to only amplified quiet sounds to above the listening threshold, but not already loud sounds beyond the pain threshold (using compression and noise filtering). For conductive hearing loss, where the middle ear is not properly conducting sound waves to the cochlea, we may amplify all frequencies.

Pitch¶

The pitch of the voice carries important information about the speaker and the prosody. It is different to the Fundamental Frequency (F0), as it relates to the perceptual quality of the sound. Experiments show, that increasing the loudness of a sound also increases the perceived pitch in some cases, although the F0 stays constant. Typically, voice ranges from ~100Hz for males, over ~200Hz for females and up to ~600Hz for children.

When the F0 is removed from a signal but its resonances are still present ([[Low-pass filtering]]), we still perceive the fundamental frequency, although its not there. The brain seems to reconstruct it from the resonances alone. This is called the residual effect. This is used in communication protocols like ISDN or GSM, where frequencies below 300Hz and above 3400Hz are omitted. The F0 is mostly missing, but the pitch remains the same, producing intelligible speech.

For speech reconstruction, it is important to reliably detect voiced and unvoiced segments, so we can use noise reduction on the voiced segments and enhance F0 and its resonances. No noise reduction should be applied to unvoiced segments however, as they are Gaussian noise by nature.

Estimating F0¶

See Fundamental Frequency (F0)#Estimation.

The fundamental frequency can be estimated using different methods. A very basic method is using the time domain signal directly and counting zero-crossings or number of peaks within a time interval. This method is however not very reliable because of variance in the voice itself and because of noise from the environment.

A more reliable way of estimating F0 is using the autocorrelation function:

\[ \varphi_{XX}(\lambda)=\mathbb{E}[x(n)x^{*}(n+\lambda )]=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}uvp_{x(n),x^{*}(n+\lambda)}(u,v)dudv \]

The autocorrelation of a signal at lag $\lambda$ is the expected value of a signal multiplied by a $\lambda$-shifted version of it self. However, the joint probability of a signal with a time shifted version of itself is generally not available, so we have to empirically estimate the autocorrelation:

\[ \hat{\varphi}_{XX}(\lambda)=\frac{1}{N-|\lambda|}\sum\limits_{n=0}^{N-|\lambda|-1}x(n)x^{*}(n+\lambda) \]

Note: For $N$ samples, we can only calculate the autocorrelation over $N-\lambda$ samples.

Pasted image 20240423132832.png

The energy of autocorrelation function is highest at $\lambda=0$, so where the signal is just squared. The energy then decreases as $\lambda$ increases until approaches the fundamental frequency $\lambda=T_{0}$. After one period, the signal coincides with itself again and the energy again is high. The signal will have repeated peaks at each period length $\hat{\varphi}_{XX}(kT_{0})$ We can then thus use the second peak of the autocorrelation function as an estimate for our fundamental period $T_{0}$. We can also apply the Fourier Transformation to the autocorrelation function, giving us the Power Spectral Density (PSD). The PSD will have an initial peak the the fundamental frequency F0 and the repeated peaks for each harmonic.

Pasted image 20240919153729.png

The PSD also is helpful for identifying unvoiced and voiced segments. If a segment is unvoiced, it only contains Gaussian noise. If we calculate the autocorrelation of white noise, it will have a peak at $\lambda=0$ and be zero everywhere else. This is because each successive sample will be independent and signal is thus completely uncorrelated to itself. The PSD of such a signal will have no peaks, because there are no periodicities. So we can identify unvoiced segments, if the PSD of the segment has no peaks (i.e. never goes beyond a certain threshold).

Pasted image 20240919154214.png

The accuracy of this method is heavily dependent on the length of Analysis Window utilized. Because speech is not a stationary process and the F0 changes over time, the window length may not be too large. But we also can't make the signal too short, otherwise we don't observe the periods of the signal. A optimal trade-off can be found at a ~$30$ms window.

There are also more noise-robust methods, like the Power Spectral Density (PSD)#Yin-Algorithm. Its basic idea: Instead of multiplying the signal with a time-shifted version of itself, we take the difference squared:

\[ d_{\lambda}=\frac{1}{N-\lambda}\sum\limits_{n=0}^{N-|\lambda|-1|}(x(n)-x(n+\lambda ))^{2} \]

Without any noise, the difference at $\lambda=T_{0}$ will be zero. This MSE approach is more robust to noise and is extended in the YIN-algorithm.

Spectral Transformations¶

Spectral Transformations allow us to transform a time-domain signal into its frequency-based representation. The previously mentioned Power Spectral Density (PSD) also relies on such spectral transformations. The basis for all the transformations is the Fourier Transformation. Fourier transformations are based on the idea, that any signal can be broken down into a weighted sum of sine and cosine functions.

Linear Time Invariant Systems¶

A system is a mapping from one signal to another:

$x(n)*h(n)=y(n)$
$X(\omega)\cdot H(\omega)=Y(\omega)$

where the transformation is described by the transfer function $H$. Because the convolution is simplified to a multiplication in the frequency domain, the computational complexity is reduced drastically, if we can cheaply transform a signal into its spectral representation.

A system that is…

linear: $T\{ax_{1}(n)+bx_{2}(n)\}=aT\{x_{1}\}+bT\{x_{2}\}$
time-invariant: $y(n-n_{0})=T\{x(n-n_{0})\}$

is called LTI System. The transfer function of such system is given by its output for a [[Dirac]] (unit response):

\[ T\{\delta(n)\}=h(n) \]

and fully describes the LTI system. This is due to the LTI properties: We represent the signal as a series of delta impulses, scaled to the amplitude of the input signal at each point. The output signal of the system will now be sum of outputs signals, when the system is applied to each impulse. This is the convolutional integral between the input signal and the transformed delta impulse, the Transfer Function. For more details, see LTI System#Impulse Response.

Fourier Transformation¶

Fourier Series¶

The basic idea starts as follows: Any even function can be represented as a weighted sum of cosines:

\[ x_{e}(t)=\sum\limits_{n=0}^{\infty}a_{n}\cos(n\omega_{0}t) \]

Because $\cos(0)=1$, we can move $a_{0}$ out of the sum:

\[ x_{e}(t)=a_{0}+\sum\limits_{n=1}^{\infty}a_{n}\cos(n\omega_{0}t) \]

We can reformulate this term find the coefficients:

$a_{n}=\frac{2}{T}\int_{T}x_{e}(t)\cos(n\omega_{0}t)dt$
$a_{0}=\frac{1}{T}\int_{T}x_{e}(t)dt=\text{average}(x_{e}(t))$

The same formulation holds for odd signals, just using sines instead of cosines:

\[ x_{o}(t)=\sum\limits_{n=1}^{\infty}b_{n}\sin(n\omega_{0}t) \]

As $\sin(0)=0$, we can omit the first coefficient and reformulate to get:

$b_{n}=\frac{2}{T}\int_{T}x_{o}(t)\sin(n\omega_{0}t)dt$

Because any signal can be represented as the sum of its even and odd part, we can now merge the formulation to:

\[ x_T (t)=\frac{1}{2}a_{0}+\sum\limits_{n=1}^{\infty}(a_{n}\cos(n\omega_{0}t)+b_{n}\sin(n\omega_{0}t)) \]

where each coefficient is computed as described above.

Using Eulers Formula, we can transform the series into its exponential form:

\[ x_{T}(t)=\sum\limits_{n=-\infty}^{\infty}c_{n}e^{jn\omega_{0}t} \]

where $c_{n}=\frac{a_{n}}{2}-j\frac{b_{n}}{2}$.

Continuous Fourier Transformation¶

The Fourier Series has a significant draw back: It uses the sum of harmonics and thus only works for periodic functions. So if we have a non-periodic, infinite signal, we need to generalize the concept to all frequencies:

\[ S(f)=\int\limits_{-\infty}^{\infty}s(t)e^{-j2{\pi}ft}dt \]

The inverse transformation is then given by:

\[ s(t)=\int\limits_{-\infty}^{\infty}S(f)e^{j2{\pi}ft}df \]

This is the Fourier Transformation#Continuous-Time Fourier Transform and allows for breaking down any signal into its frequency constituents.

Discrete Time Fourier Transform¶

Since we computationally only can deal with time-discrete signals, we can't use the continuous form of the FT, but have to restrict the transformation from an integral to a sum of our time-domain samples:

\[ S(f)=\sum\limits\limits_{-\infty}^{\infty}s[n]e^{-j2{\pi}fn} \]

Just as in the forward transformation of the #Fourier Series, the inverse DTFT can be performed by integrating only over one period length of the spectrum:

\[ s[n]=\frac{1}{2\pi}\int\limits\limits_{2\pi}S(f)e^{j2{\pi}fn}df \]

Because any time discrete signal has a repeated spectrum at half the sampling frequency, our signal may only contain up frequencies up to half the sampling frequency (Nyquist Theorem). This is the reason we only integrate over a period-length for synthesis, afterwards the spectrum repeats.

Discrete Fourier Transform¶

Pasted image 20240920112223.png

Not only are our time signals discrete, they are also finite observations (and often this finite observation is broken down even further using Analysis Windows). As we can see above, if we have a discrete and finite signal, we can convolve the signal with a Dirac comb where the period is equal to the length of the signal $T_{s}=NT$. Doing so completely retains the time-domain signal (only adding duplicates before and after) and discretizes the spectrum at the same time, as we multiply with a Dirac comb with period $\frac{1}{NT}$. This is computationally very useful, as we don't have to deal with continuous frequencies anymore. This gives us the Discrete Fourier Transform analysis:

\[ S(k)=\sum\limits_{n=0}^{N-1}s[n]e^{-jk\frac{\omega_{0}}{N}n} \]

and the according synthesis:

\[ s[n]=\frac{1}{N}\sum\limits_{k=0}^{N-1}S(k)e^{jk\frac{\omega_{0}}{N}n} \]

The frequency bins of the DFT are $f_{k}=\frac{f_{s}}{N}k$.

z-Transform¶

The z-transformation is an extension of the Fourier transform. It is defined as:

\[ X(z):=\mathcal{Z}[x(n)]=\sum\limits_{-\infty}^{\infty}x(n)z^{-n} \]

where $z$ is some complex variable. If $z=e^{j2\pi f}=e^{j\Omega}$, then the z-transform is equivalent to the Fourier transform. But we can also choose certain coefficients $z=re^{j\Omega}$. The scaling factor $r$ allows us to deal with signals that are exponentially increasing or decreasing. If we "wrap" an exponentially increasing function around an exponentially decreasing complex circle, the area of the complex function actually stays constant and can be computed. In the FT we would get infinite values. Inversely, we can wrap exponentially decaying signals around an exponentially increasing complex circle, giving us real valued areas where the FT would go to zero.

The z-Transform is useful to model digital filters. Importantly, a time delay of one unit corresponds to $x(n-1) =z^{-1}X(z)$. We can thus easily model systems as linear combinations of past values:

FIR and IIR Filter¶

Pasted image 20240926115936.png

A Finite Impulse Response filter (FIR filter) can be described as a polynomial, where the n-th order component describes the contribution of the sample from n timesteps ago:

\[ H_{FIR}(z)=\sum\limits_{n=0}^{N-1}b_{n}z^{-n} \]

Pasted image 20240926120201.png

An Infinite Impulse Response filter (IIR filter) describes the effect of a system with internal feedback:

\[ H_{IIR}(z)=\frac{\sum\limits_{n=0}^{N-1}b_{n}z^{-n}}{1+\sum\limits_{p=1}^{P}a_{n}z^{-n}} \]

In the section #Linear Prediction#Root factorization we will transform a autoregressive system into the IIR filter form to derive a pole filter, that can describe the effect of the vocal tract. Because the vocal tract is full of internal feedback effects (reflections travelling forth and back), the filter has an infinite impulse response, but can be approximated using a finite amount of $a$-coefficients.

Finite Signals & Windowing¶

Short-Time Fourier Transform¶

Pasted image 20240920122004.png

The DFT is mostly unsuited for non-stationary signals, because the frequencies change over a time frame and the DFT would just give the mean of frequencies over that time frame. Continuous variations of the Fourier Transform would capture such differences, but they would be subtle differences in the continuous spectrum. To retain temporal information, we can use the STFT. It applies the DFT on small overlapping window segments of the time domain signal. As seen in the figure above, this allows us to differentiate sounds with similar frequencies over time, but different temporal behaviors.

The STFT is defined as

\[ X[k,l]=\sum\limits_{n=0}^{N-1}w[n]x[n+IL]e^{-jk\frac{\omega_0}{N}n} \]

where

$I$ is the frame index
$L$ is the frame shift
and $N$ is the window length

The corresponding frequency for frequency bin $k$ is dependent on the sampling rate: $f_{k}=\frac{k}{N}f_{s}$.

The STFT results in a three dimensional signal representation: Amplitude over frequency over time. This data is typically represented in a spectrogram, where the x-axis is time, the y-axis is frequency and the amplitude is color-coded.

Pasted image 20240513192352.png

The wider window length corresponds to narrower frequency bands, a so called narrowband spectrogram. Using shorter window lengths leads to wider frequency bands and narrower time resolution, so called wideband spectrogram. For speech processing, the optimal tradeoff can be found at around 30ms, although some speech sounds like plosives would benefit from longer window lengths like 40-60ms.

Overlap is typically chosen to be around 50-75% of the window. Furthermore, because natural sounds often follow a 6db attenuation, meaning that every frequency is 6db more quiet than the one before. Therefore, we might want to boost high frequencies using a high-pass pre-emphasize filter.

Cyclic Convolution¶

Pasted image 20240920113914.png

If we want to convolve two finite signals by multiplying their DFTs, we have a problem: The discretized frequency domain represents the time-domain signal periodically extended, so the product of the spectra is equal to convolving the periodically extended signals, resulting in a cyclic convolution that yields aliasing effects. We can avoid this issue by using zero padding with the length of the signal. If we use Analysis Windows and have multiple overlapping frames to which we want to apply a convolutional filter, we can reduce the amount of zero padding.

Overlap-add Convolution¶

The overlap of at the end of each frame is set to zero and the filter is extended to length $k=\text{length}(x_{i})+\text{length}(h)-1$. We might want to round up to the nearest power of 2 to facilitate the Fast Fourier Transform. Then we perform the convolution and add the overlapping parts of each output frame.

Overlap-save Convolution¶

We don't change the overlap of each frame, and just extend the filter as above. Then we perform the convolution and discard the first $\text{len}(\text{overlap})$ samples of the output frames.

Spectral Leakage¶

The effect of Spectral Leakage can be seen when #Sampling, but also when applying an Analysis Window to the time-domain signal. When we apply an rectangular analysis window, we basically cut the time-domain signal off at the length of that window and when we then apply the DFT, the signal is periodically extended from that cut-off point. If that cut-off point doesn't correspond exactly to a period-length of the time-domain signal, there will be a discontinuity in the time-domain signal.

Discontinuities generally lead to infinite spectral components to over the complete spectrum, which smears the spectrum of the signal. We can use other tapered windowing functions to avoid such extreme discontinuities. The problem with tapered window function (see Analysis Window#Window Types) is that they increase the width of the main lobe in the spectrum. So we have a trade-off between spectral leakage and spectral resolution. We can also increase the length of the window, trading off time resolution against frequency resolution.

Spectral Envelope¶

Pasted image 20240513202034.png

As mentioned earlier, each phoneme has characteristic formants frequencies, where signal energy is concentrated at by the vocal-tract filter. Averaging the frequency amplitude gives us an envelope, which has peaks at the formants. Using a wideband spectrogram, we can visualize the effect of the vocal tract very well: The underlying, detailed voice properties (F0 and its harmonies) are smeared out and we are left with the broader structure of the spectrum, caused by the vocal tract.

Sampling¶

Drawing 2024-09-20 10.34.36.excalidraw#^group=LvPuRJA2y9WUFlY4-Ep0o

Sampling a signal is done by multiplying the time domain signal with a [[Dirac Comb]] that has a period of $T_{s}$. This is equivalent to convolving a Dirac Comb of period $\frac{1}{T_{s}}$ in the frequency domain. That means that the spectrum of the signal is repeated at $f_{s}=\frac{1}{T_{s}}$. So the larger the sampling rate, the more often the spectrum is repeated, leading to aliasing effects.

Drawing 2024-09-20 10.34.36.excalidraw#^group=yad2xcUdoioi27a3fTYfP

We can avoid this by choosing a sampling rate that is small enough so that the repeated spectra don't overlap. If the signal has a maximum frequency of $f_{m}$, its spectrum will have a width of $2f_{m}$ (positive and negative part). So if we choose a sampling frequency of $f_{s}\le2f_{m}$, we will get some overlap of the repeated spectra (left example, $f_{s}=1.5f_{m}$). But if we choose a sampling frequency that is more than twice the maximum frequency of the signal (right example, $f_{s}>2f_{m}$), the spectra don't overlap anymore and we can perfectly reconstruct the input signal by low-pass filtering the spectrum (convolving time-domain signal with $\text{sinc}$ function).

Because human perception ranges up to $f_{m}=20kHz$ (see #Hearing), we'd want to choose a sampling frequency of $f_{s}>40kHz$ to perfectly reconstruct a signal with no perceptible loss. This is why CDs are sampled at around $44kHz$. Traditional telephone speech is only sampled at $8kHz$, meaning that we can only reconstruct frequencies up to $4kHz$.

Tube Model & Linear Prediction¶

As mentioned in the previous sections, the vocal tract can be approximated by a tube, which produces resonances and thus concentrates energy at certain frequencies. This can be leveraged to produce formants, using a fundamental frequency and its harmonies as input and, by extension, to derive a mathematical model to compute parameters that estimate the instantaneous shape of the vocal tract for each frame of recording.

Tube Model¶

Simple Tube¶

Pasted image 20240923094734.png

For the mathematical formulation, we need to look at the waveform signal as the pressure wave across air particles, propagating along the tube. Instead of the tube being fully opened or fully closed, we will consider it partially opened, propagating part of the wave into a next tube segment and reflecting part of it back.

To do so, we'll define some fundamental properties of wave propagation as functions of space and time:

Acoustic pressure: $p(x,t)$
Speed of particle oscillation: $u(x,t)$
Velocity of particles over infinitesimal tube cross-section of area $A$: $v(x,t)=u\cdot A$

[!note] We will see that we won't need to define any specific functions, but just by relating pressure, particle velocity and volume velocity, we can derive a model of how waves will behave within our tube model. Also, we only consider a single spatial dimension $x$. This is because we are only interested in the length-wise propagation of the air wave.

By dividing the acoustic pressure by the particle velocity, we get the acoustic impedance, which describes how much resistance the tube provides to the propagation of the pressure wave:

\[ Z(x,t)=\frac{p(x,t)}{v(x,t)} \]

Crucially, we can use the impedance to find out, how much energy is reflected when the cross-sectional area of the tube (and thus the acoustic impedance) changes:

\[ r=\frac{Z_{2}-Z_{1}}{Z_{2}+Z_{1}} \]

At the closed end of the tube, the particles can be considered to have no velocity but maximal pressure. So the lower denominator goes to zero and the impedance becomes infinite ($Z_{2}=\infty$). Thus, at the closed end, the reflection co-efficient goes to $r=1$ and all the energy is reflected. At the open end of the tube, the opposite happens: The particles can freely move and the velocity is maximal. However, the pressure fully disperses and the numerator becomes zero. So the acoustic impedance becomes zero ($Z_{2}=0$) and the reflection co-efficient is $r=-1$, so no energy is reflected and all the energy propagates forward.

Pasted image 20240923102700.png

In other words, the closed end reflects the wave with no phase shift, the open ends also reflects it, but the reflected wave is phase shifted $180^{\circ}$. This is the reason why tubes have resonances, a resonance is created when the forward and backward travelling part of the wave interfere constructively and produce a standing wave. For a given wavelength $\lambda$, the first resonance appears at $l=\frac{\lambda}{4}$, so a tube of length $l$ will produce a resonance of period length $\lambda=4l$. The figure shows why. For a pressure wave with $\lambda=4l$, the boundary conditions mentioned above are satisfied:

At the closed end, the pressure is maximum and the velocity is zero (which is true for all wave lengths)
At the open end, the pressure is minimum and the velocity is at its maximum (perfect constructive interference)

Thus, the air exits at the highest velocity possible, producing the loudest sound at that frequency. For wavelengths between the resonances, more energy is trapped in the built-up pressure. The maximum velocity happens somewhere inside the tube, leading to more friction and thus the energy is dissipated as heat over time.

For the vocal tract, a $17cm$ tube is a good approximation. Because $l=\frac{\lambda}{4}$, we know the first resonance will be at a period length of $\lambda_{0}=4l=68cm$ or a frequency of $f_{0}=\frac{c}{\lambda_{0}}=\frac{340\frac{m}{s}}{68cm}=500Hz$. Other resonances are at period lengths $\lambda=[4l,\frac{4}{3}l, \frac{4}{5}l, \frac{4}{7}l, \ldots]$, giving us resonance frequencies at $f_{res}=[500Hz, 1500Hz, 2500Hz, 3500Hz, \ldots]$, so roughly one resonance every 1000Hz.

Concatenated Tube¶

If we wanted to use this simple model for speech production, we got an issue: We can only have one resonance at a time. But for formants, we need to model multiple resonances at specific frequencies and different strengths. This is why we need to model the vocal tract as a sequence of $n$ tubes with varying diameter. Each segment will be of length $\Delta x=\frac{L}{N}$.

Wave Equation¶

In order to know, how a wave will propagate through the concatenated model system, we'd like to model a change in pressure over space as a function of change in pressure over time (the so called Wave Equation).

Assuming a constant diameter per tube and no friction along the tube walls, we can use the equations for plane wave propagation to relate pressure $p$ and volume velocity $v$ across a tube segment:

\[ -\frac{\partial p}{\partial x}=\rho\frac{\partial u}{\partial t} \quad\quad\quad -\frac{\partial u}{\partial x}=\frac{1}{\rho c^{2}}\frac{\partial p}{\partial t} \]

According to these equations…

a change in pressure over space is proportional to a change in particle velocity over time
and a change in particle velocity of space is proportional to a change in pressure over time.

Since $v(x,t)=A\cdot u(x,t)\iff \partial{v}=A_{n}\partial{u}\iff \partial{u}=\frac{\partial{v}}{A}$, we can substitute:

\[ -\frac{\partial p}{\partial x}=\frac{\rho}{A}\frac{\partial v}{\partial t} \quad\quad\quad -\frac{\partial v}{\partial x}=\frac{A}{\rho c^{2}}\frac{\partial p}{\partial t} \]

And by further differentiating those equations by space and time, we can find a relation between particle pressure/velocity over space as a function pressure/velocity over time:

\[ -\frac{\partial^{2}p}{\partial x^{2}}=\frac{\rho}{A}\frac{\partial^{2}v}{\partial t \partial x} \quad\quad -\frac{\partial^{2}v}{\partial x \partial t}=\frac{A}{\rho c^{2}}\frac{\partial^{2}p}{\partial t^{2}} \quad\implies\quad \frac{\partial^{2}p}{\partial x ^{2}}=\frac{1}{c^{2}}\frac{\partial^{2}p}{\partial t^{2}} \]

\[ -\frac{\partial^{2}v}{\partial x^{2}}=\frac{A}{\rho c^{2}}\frac{\partial^{2}p}{\partial x \partial t} \quad\implies\quad \frac{\partial^{2}v}{\partial x ^{2}}=\frac{1}{c^{2}}\frac{\partial^{2}v}{\partial t^{2}} \]

These wave equations relate the acceleration of pressure over space as an acceleration of pressure over time and can be solved using harmonic solutions.

Forward and Backward Wave¶

Pasted image 20240923131226.png

In the concatenated tube, we do not only have a closed and an open end, but also partial openings at the transitions between segments. There, a part of the forward traveling wave will be reflected back and part of the backward traveling wave will be reflected forward. This can be described by reflection coefficients between 1 (full forward propagation) and -1 (full backward reflection). To relate the change in pressure/velocity with the reflection coefficients, we need to represent pressure/velocity as functions of the forward and backward traveling waves.

Superposition¶

Both waves are functions of time and space, but since a change in space is equal to a shift in time ($t=\frac{x}{c}$), we can represent the forward traveling wave as $f(t-\frac{x}{c})$ and the backward traveling wave as $b(t-\frac{x}{c})$. Lets now consider the superposition of those waves:

First off, the velocities of the forward traveling wave and backward traveling wave are opposite to each other, so they subtract:

\[ v(x,t)=f(t-\frac{x}{c})-b(t+\frac{x}{c}) \]

For the pressure, the particles density has no direction, so the waves add up:

\[ p(x,t)=Z \cdot(f(t-\frac{x}{c})+b(t+\frac{x}{c})) \]

Boundary Constraints¶

Now, we furthermore introduce the constraint, that the pressure at the end of one segment and at the beginning of the next segment is equal (same for velocity).

$$ \begin{align}

v_{i}(x=\Delta x, t)&=v_{i-1}(x=0,t)\

p_{i}(x=\Delta x, t)&=p_{i-1}(x=0,t)

\end{align} $$

We can generalize this by using the segment length $\tau=\frac{\Delta x}{c}$ in combination with the wave equations from the previous section:

$$ \begin{align}

f_{i}(t-\tau)-b_{i}(t+\tau)&=f_{i-1}(t)-b_{i-1}(t)\

Z(f_{i}(t-\tau)+b_{i}(t+\tau))&=Z(f_{i-1}(t)+b_{i-1}(t))

\end{align} $$

Partial Reflections and Kelly-Lochbaum Structure¶

Pasted image 20240923135901.png

We can now relate

the forward traveling wave in segment $i-1$ as a function of the forward traveling wave of the previous segments $i$ and the backward traveling wave in this segment.
the backward traveling wave in segment $i$ as a function of the forward traveling wave in this segment $i$ and the backward traveling wave from the next segment $i-1$.

This can be done using the previous equations and the definition of the reflection coefficient $r_{i}=\frac{Z_{i}-Z_{i-1}}{Z_{i}+Z_{i-1}}$, giving us:

$$ \begin{align}

f_{i-1}(t)&=(1+r_{i})f_{i}(t-\tau)+r_{i}b_{i-1}(t)\

b_{i}(t)&=-r_{i}f_{i}(t-2\tau)+(1-r_{i})b_{i-1}(t-\tau)

\end{align} $$

This differential equation can be represented as a system, the so called Kelly-Lochbaum structure.

The interpretation goes as follows:

The energy of the forward wave at segment $i-1$ is the sum of energies

of the forward wave from the previous segment
plus the forward reflected part of the forward wave of the previous segment
plus the forward reflected part of the backward wave of the current segment

The energy of the backward wave at segment $i-1$ is the sum of energies

of the backward wave from the next segment
plus the backward reflected part of the forward wave of the current segment
plus the backward reflected part of the backward wave of the next segment

Not that the backward reflected part of the forward wave takes double the time ($2\tau$) to contribute, because it has to travel forward once, be reflected and travel backward the segment, before it contributes.

Pasted image 20240923141619.png

We can now concatenate this system for our $n$ tube segments. Inputting a [[Dirac]], we expect the Impulse Response after $n\tau$ seconds, as the signal traverses $n$ segments, each segment delaying the signal by $\tau$ seconds. We then expect the reflections to introduce decaying impulses every $2\tau$ seconds (Signal travels back for one tick and forward again on the second tick, each time energy is "lost" at the output). We can now use this system to create a digital filter with infinite but decaying impulse response.

Linear Prediction¶

Pasted image 20240926102530.png

Using the Loch-Kellybaum structure from the last, we now determine the weights for a causal filter (future values can't effect the current ones), that produces synthesized speech from an excitation signal. We can define such a filter as the convolution of the filter with the excitation signal:

\[ s(n)=\sum\limits_{m=0}^{\infty}h(m)e(n-m) \]

Pole-Zero Filter¶

Since the impulse response is infinite, we can't compute the filter output in its present form. This will require a work around using the Autoregressive Moving Average (ARMA). The ARMA model is used for weekly stationary processes like speech and represents the filter as sum of two processes:

An autoregressive process, that models the current realization of the signal through its past realizations
A moving average process, that controls for the models error

ARMA¶

Combined, both give the ARMA model:

\[ s(n)=\sum\limits_{m=0}^{q}b_{m}e(n-m)-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu) \]

Usually, in the ARMA model, the autoregressive part models everything that is predictable about the system as a linear combination of the past $p$ samples, whereas the moving average captures the unpredictable, evolving error of the signal. The only unpredictable part about our system is the input, the varying excitation signal. This is why we use the weighted average of the past $q$ samples of our excitation signal to model the error.

z-Transformation¶

We can transform this ARMA model into the spectral domain using the [[z-Transform]]:

$$ \begin{align}

&&S(z)&=E(z)\sum\limits_{m=0}^{q}b_{m}z}-S(z)\sum\limits_{\nu=1^{p}a_{\nu}z\

\iff&&S(z)(1+\sum\limits_{\nu=1}^{p}a_{\nu}z})&=E(z)\sum\limits_{m=0^{p}b_{m}z

\end{align} $$

Since the signal is just the product of the transfer function and the excitation signal in the spectral domain, we can set $a_{0}=1$ get:

\[ H(z)=\frac{S(z)}{E(z)}=\frac{\sum\limits_{m=0}^{q}b_{m}z^{-m}}{\sum\limits_{\nu=1}^{p}a_{\nu}z^{-\nu}} \]

We now represent the filter as a quotient of two polynomials in the $z$-domain.

Root Factorization¶

We can apply the fundamental theorem of algebra to this formula: A polynomial of order $n$ has $n$ roots. We can make use of this fact. First we look at what happens, if we divide a polynomial by its root in the z domain:

\[ \frac{p(z)}{(z-a)}=q(z)+\frac{R}{z-a} \]

Dividing $p$ by its root gives us the quotient $q$ plus a remainder $R$. We can multiply by $(z-a)$ on both sides to get

\[ p(z)=(z-a)q(z)+R \]

We can see, that $q$ is one order lower than $p$, as it now has $z$ as another factor. Furthermore, for a root $z_{0}=a$, we see that $p(z_{0})=0=(z_{0}-z_{0})q(z)+R\implies R=0$, the remainder is zero. We can use this to factorize the polynomial filter:

$$ \begin{align}

H(z) &= \frac{S(z)}{E(z)} = \frac{\sum_{m=0}^{q} b_m z^{{-m}}{\sum_{\nu=0}}} a_{\nu} z^{-\nu}

& (1)

\

&= \frac{b_0 z^{-q}}{z&(2)\}} \frac{z^{q}+\frac{b_1}{b_0} z^{-q+1} + \dots + \frac{b_q}{b_0}}{z^{p}+a_1 z^{-p+1} + \dots + a_p

&= z^{p-q} b_0 \frac{\prod_{m=1}^{q} (z - z_{0m})}{\prod_{\nu=1}^{p} (z - z_{\infty \nu})}&(3)

\end{align} $$

We take the filter (1) and transform it into a form, where we represent the sum splitting each summand into the sum of products with the constant factor $\frac{b_{0}z^{-q}}{z^{-p}}$ (2). In this form, the polynomials now have positive exponents and the first coefficient is weighted by one. In each of those polynomials, we can now factorize out all roots $z_{0m},z_{\infty\nu}$ as described above, giving us our final root-pole representation (3).

We see now that the ARMA model can be represented as a [[Pole-Zero filter]] in the $z$-domain. Furthermore, if the underlying process is random, it is perfectly described by the moving average and thus by the numerator of the filter and the denominator will go to 1. So it will be an all-zero filter. Similarly, if the process is completely predictable, it will be perfectly described by the autoregressive part and the numerator will go to 1, giving us an all-pole filter.

In the source-filter model, we assume all changes to the excitation signal to be caused by variations in the vocal tract. Thus, we assume no other processes that are not part of our model to effect the output signal. We thus neglect the moving average and represent the filter only using the autoregressive part:

\[ H(z)\approx b_{0}z^{p}\frac{1}{\prod_{\nu=1}^{p}(z-z_{\infty\nu})}=\frac{b_{0}}{\sum\limits_{\nu=0}^{p}a_{\nu}z^{-\nu}} \]

The recursive structure of the ARMA model is resembles the recursive structure of the Loch-Kellybaum model and $p$ is the number of recursive steps we take. This indicates a better approximation of the signals amplitude with increasing order $p$ and a perfect reconstruction for infinite filter order.

AR-Coefficients¶

We have derived a parameterized filter that should be able to emulate the formants in real speech production. Now we need to find the actual parameter coefficients. To do so, we go back to the time-domain ARMA model, reduced to the autoregressive part:

\[ s(n)=b_{0}e(n)-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu) \]

We see that the signal is described by an initial innovation $e(n)$. We want the autoregressive prediction $\hat{s}(n)=-\sum\limits_{\nu=1}^{p}\hat{a}_{\nu}s(n-\nu)$. The autoregressive part is unable to account for changes in the innovation, as it uncorrelated to past samples. The innovation is the difference between the predicted signal and the actual signal: $s(n)-\hat{s}(n)=d(n)=b_{0}e(n)$. So to compute the coefficients, we need to minimize the innovation part, so that the predicted signal gets as close as possible to the real one.

MMSE¶

We do so by using the minimum mean squared error (MMSE):

\[ \hat{a}_{\nu}=\arg\min_{a_\nu}\mathbb{E}[(s(n)-\hat{s}(n))^{2}] \]

First Derivative¶

We solve by taking the derivative with regards to $a_{\nu}$ an setting to zero to find the minimum:

\[ 0\stackrel{!}{=}\frac{\partial \mathbb{E}[d(n)^{2}]}{\partial\hat{a}_{\nu}}\stackrel{\text{chain-rule}}{=}\mathbb{E}\left[2d(n)\cdot\frac{\partial}{\partial \hat{a}_{\nu}}(s(n)-\sum\limits_{\nu=1}^{p}\hat{a}_{\nu}s(n-\nu)\right] \]

For the inner derivative, we can drop $s(n)$ as it is not a function of $\hat{a}_{\nu}$. For the sum, since we are only taking the derivative with respect to the the $\nu^{th}$ component, we can drop all other terms, leaving us with $\frac{\partial}{\partial \hat{a}_{\nu}}(s(n)-\sum\limits_{\nu=1}^{p}\hat{a}_{\nu}s(n-\nu)=s(n-\nu)$. The overall minimization term thus is

$$ \begin{align}

&=\mathbb{E}[2d(n)s(n-\nu)]\

&=2\mathbb{E}\left[(s(n)+\sum\limits_{\nu=1}^{p}\hat{a}_{\nu}s(n-\nu))s(n-\nu)\right]&|\cdot\frac{1}{2}\

&=\mathbb{E}[s(n)s(n-\nu)]+\mathbb{E}\left[\sum\limits_{\mu=1}^{p}\hat{a}_{\mu}s(n-\mu)s(n-\nu)\right]

\end{align} $$

We can now substitute the definition of the autocorrelation function $\phi_{s}(\nu)=\mathbb{E}[s(n)s(n-\nu)]$ twice:

\[ 0=\phi_{s}(\nu)+\sum\limits_{\mu}^{p}\hat{a}_{\mu}\phi_{s}(\nu-\mu) \]

Second Derivative¶

To show that this point is indeed a minimum, we need to show that the second derivative is positive. If we take another derivative from our previous term, only the $v^{th}$ of the second autocorrelation remains. Since the time-lag for $\mu=\nu$ is zero, we have the signal power, which always is positive.

$$ \frac{\partial^{{2}\mathbb{E}[d(n)} =}]}{\partial \hat{a}_{\nu}

\frac{\partial}{\partial \hat{a}{\nu}}\left[\phi}(\nu)+\sum\limits_{\mu}^{p}\hat{a{\mu}\phi(\nu-\mu)\right]

=\mathbb{E}[2\phi_{s}(0)]

\ge0 $$

Wiener-Hopf Equation¶

We can rewrite this equation as $\phi_{s}(\nu)=-\sum\limits_{\mu}^{p}\hat{a}_{\mu}\phi_{s}(\nu-\mu)$, which we can write in matrix notation:

\[ \phi_{s}=-\mathbf{R_{s}\hat{a}} \]

where $\mathbf{R_{s}}$ is the autocorrelation matrix. This matrix relates a vector of $p$ autocorrelations to their previous $p$ realizations. The solution to his problem gives us the optimal coefficients:

\[ \mathbf{\hat{a}_{opt}=-R_{s}^{-1}\phi_{s}} \]

We can solve this using Levnison-Durbin Recursion, if we know the autocorrelation of our signal. Given an empirical observation, we can of course only estimate its value. We do so only for our short window segments, since the signal overall is only weekly stationary:

\[ \tilde{\phi}_{s}(\nu)=\frac{1}{n_{2}-n_{1}}\sum\limits_{n=n_{1}}^{n_{t}}\tilde{s}(n)\tilde{s}(n-\nu) \]

The matrix is Topelitz (all diagonal entries are the same) and symmetrical, making it very efficient to solve.

Transmitting Speech Using LPC¶

We have now found a way to produce the coefficients for our pole-filter from the previous section. A remaining question is: How many coefficients do we need? We saw that for perfect signal reconstruction, we'd want infinitely many coefficients. So we want to compute enough coefficients to perceptually reconstruct the signal, but no so many as to overfit and start predicting noise.

To model a single formant, we need two LPCs. Since human speech contains ca. one formant per 1kHz, at a sampling rate of 8kHz we'd need 8 LPCs to reconstruct a pole filter with poles at the formant frequencies. We furthermore use anti-aliasing filter that attenuate high frequencies, so we need to add 2 more LPCs. This gives us roughly $\text{filter order}=\text{sampling frequency in kHz}+2$. We use pre-emphasize on high frequencies to counteract the natural $6db$ attenuation of sounds.

Over a low-bandwidth channel, we'd transmit about:

unvoiced/voiced decision (1 bit)
fundamental period (if voiced, ~7 bits)
gain (~5bits)
10 LPCs (10*10bits)

113 bits per frame frame. For 30ms frames and overlap of 50%, we get ca. 7.5kbps. For further compression, we can use Quantization as will be described in the upcoming sections.

Overview¶

We modeled the vocal tract as a simple $17cm$ tube. We found, that for a tube that is closed on one end and open on the other, it will produce resonances at for waves with $4$ times the wavelength. This is because at the tubes opening, the forward traveling wave and the reflected wave interfere constructively so that the pressure is minimal and the air particle velocity is maximal when exiting the tube. For our $17cm$tube, this produces resonances at $500Hz$ and then every $1000Hz$. We then derive wave equations that describe how pressure and air velocity changes as a differential equations.

To produce formants using the tube model, we want to produce multiple resonances at once. To do so, we need multiple concatenated tubes segments with different diameters. We use differential wave equations to derive a differential system, the Kelly-Lochbaum structure. It models how an input wave is partially reflected forward and partially reflected backward by the different sized tube segments. This is because of a change in acoustic impedance at the boundaries of each segment.

This system allows us to derive a filter that, given an impulse train as input, generates an according output impulse train with the formants frequencies emphasized. Unfortunately, the filter has an infinite, albeit decaying impulse response, making us unable to compute its effect directly.

We can circumvent this issue using the ARMA model, which models the effect of the filter as a deterministic process. The moving average part is given by the unknown excitation signal, the output signal itself is modeled by the autoregressive part. We apply the z-transform to this model and can reformulate it into a filter, by dividing the excitation signal by the output signal, giving us the transfer function as a quotient of two polynomials. We transform the polynomials into a form, in which we can factorize its roots (first dividing out the first summand, making all exponents positive and the first coefficient equal to 1, then applying the fundamental theorem of algebra). This results in a formulation where the moving average is characterized as the zeroes in the filter and the autoregressive part as the poles. Since we consider the vocal tract to be a deterministic filter on the excitation signal, we ignore the moving average and only consider the autoregression, giving us a formulation for a parameterized all-pole filter, where the parameters are the coefficients of the root-factorized polynomial in the denominator.

\[ H(z)=\frac{b_{0}}{\sum\limits_{\nu=1}^{p}a_{\nu}z^{-\nu}} \]

Pasted image 20240926102530.png

This filters parameters are unknown however. To compute them, we first represent the signal as the difference of the unknown excitation, called innovation, $e(n)$ and the autoregressive sum:

\[ s(n)=b_{0}e(n)-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu) \]

We now see how the filter would be applied in an actual system. We choose between noise and F0 depending on our unvoiced/voiced decision, apply a gain to this excitation signal and then apply the autoregressive coefficients.

To derive the optimal parameters for our prediction $\hat{s}(n)=\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu)$ (where the innovation is unknown), we use the minimum mean square error MMSE:

\[ \arg\min_{a_{\nu}}\mathbb{E}[(s(n)-\hat{s}(n))] \]

We solve by setting the first derivative to 0 and reducing the formulation to

$$ 0=\phi_{s}(\nu)-\sum\limits_{\mu=1}^{p}\hat{a}{\mu}\phi(\mu-\nu)

\iff\phi_{s}(\nu)=-\sum\limits_{\mu=1}^{p}\hat{a}{\mu}\phi(\mu-\nu) $$

We can solve solve this very efficiently by bringing it into the matrix notation $\phi_{s}=\mathbf{R_{s}\hat{a}_{\nu}}$. Because the autocorrelation matrix $\mathbf{R}$ is Topelitz (same elements diagonal) and symmetrical, it can be solved very efficiently using the Levnison-Durbin Recursion.

We now use these coefficients, called LPCs, as the parameters for our all-pole filter. We now only need to transmit the F0, the unvoiced/voiced decision, the gain ($b_{0}$) and the LPCs to faithfully reconstruct the speech signal. We observe ca. one formant per 1kHz in natural speech. Since we typically sample at ca. 8kHz and thus can reconstruct a 4kHz signal, we need to transmit 4 formants. Each formant requires 2 LPCs, thus we need to transmit ca 8 LPCs to reconstruct the signal. Since we apply further anti-aliasing filters for high frequencies, we need another 2 LPCs, giving us a total of 10. A rough guideline is to use 2 x sampling rate + 2 LPCs.

Cepstrum¶

The cepstrum is a inverse Fourier transformation of the log spectrum, capturing the periodic patterns within the spectrum. This is helpful when dealing with speech, as the harmonic patterns of the voice and the formant frequencies are periodic patterns within the frequency domain.

\[ c[n]=\frac{1}{2\pi}\int_{-\infty}^{\infty}\log|X[e^{jn\Omega}]|e^{jn\Omega}d\Omega \]

The cepstrum has equivalent terms to the spectrum:

spectrum - cepstrum
frequency - quefrency
filtering - liftering
harmonics - rhamonics
phase - saphe

Pasted image 20240926112958.png

Since the harmonics of the voice are fast changing patterns in the spectrum (short periods) and formants are slow changing patterns (long periods), they can be differentiated really well in the cepstrum. The low quefrencies correspond to the formants and the filtering effect of the vocal tract. The higher quefrencies correspond to the harmonics of the voice, with a peak at the F0.

There is another nice property: The log turns multiplications into sums, so the a convolution in the time domain corresponds to a linear superposition in the quefrency domain:

\[ y(n)=x(n)*h(n)\iff Y(e^{j\omega})=X(e^{j\Omega})H(e^{j\Omega})\iff \tilde{y}(n)=\tilde{x}(n)+\tilde{h}(n) \]

Pasted image 20240926113555.png

The figure above shows the effect of liftering on the spectrum: We convert to the cepstrum, apply a low-pass filter, thus only keeping the quefrencies corresponding the vocal tract filter, and transform back into the frequency domain. This results in an averaging of the log-frequency spectrum. We can see how the cepstral smoothing gives a more detailed representation of the vocal tract than the LPCs, but with lower magnitude. This is a bias introduced by the logarithm and can be compensated for. The LPC also has much sharper peaks, as it is pole-filter.

Quantization¶

To handle signals computationally, we discretize them using sampling. But not only do we need to discretize signals in time but also their amplitude. We do so using quantization. When we sample a signal at the Nyquist rate or higher ($f_{s}>2f_{m}$), we can perfectly reconstruct the signal. Thats because the for a discretized signal, the spectrum is not altered, just repeated around multiples of the sampling frequency $f_{s}$. If the maximum frequency of the signal is more than half the sampling frequency, these repeated spectra will overlap and create aliasing effects. If, however, the maximum frequency is below half the sampling frequency, the spectra doesnt overlap an we can reconstruct the original signal by applying a low pass filter to only retain the first spectrum around zero. This is equivalent to a convolution with the sinc function in the time domain. For quantization, this lossless reconstruction is not possible unfortunately. Quantization extends to every transmitted information, like the LPCs, as well, so we are interested in finding a representation that has minimal loss. The loss is typically measured using the signal to noise ratio (SNR).

Uniform Quantization¶

Pasted image 20240926140031.png

This is the most basic form of quantization: The signals amplitude is subdivided into quantization levels, where the levels are one step size $\Delta x$ apart. Assigning each value of the signal to its closest quantization level gives us the quantization characteristic curve seen above. For a fixed number of bits $\omega$, we have $2^{\omega}$ quantization levels available. The maximum amplitude is defined as $x_{max}$, so we have a total range $[-x_{max}, x_{max}]$. For speech, we typically use 12 to 16 bits, for audio in general 16 to 32 bits. Using non-uniform quantization, we can lower the amount of bits to ~8 bits (ISDN).

Signal to Noise Ratio (SNR)¶

Pasted image 20240926144418.png

The quantization error is the difference between the source signal and the quantized waveform. For simplicity, we assume that the quantization error is noise that is uncorrelated to the source signal. This is a strong assumption if we look at the figure above, where the error is clearly correlated with the signal. But with enough bits and more complex source signals, this assumption holds true.

We quantify the quality of the signal by how much of the signals power is made up of noise. This is give by ration of signal to noise, the SNR:

\[ SNR=\frac{P_{x}}{P_{n}} \]

Signal Power of Uniform Distribution¶

The power is defined as the expectation for the squared signal $P_{x}=\mathbb{E}[x^{2}]=\int_{-\infty}^{\infty}x^{2}p_{x}(x)dx$. We first assume a uniform signal between $-x_{max}$ and $x_{max}$, which would be optimal for uniform quantization. The PDF would be $p_{\text{uniform}}=\frac{1}{2x_{max}}$ so that the area would be one.

The signals power is:

$$ \begin{align}

P_{x} &= \int_{-x_{max}}^{x_{max}}x\}\frac{1}{2x_{max}}dx & |\space x_{max}= \frac{k\Delta x}{2

&= \int_{-\frac{k\Delta x}{2}}^{\frac{k\Delta x}{2}}x^{2}\frac{1}{2(\frac{k\Delta x}{2})}dx \

&=\frac{2}{k\Delta x}\int_{0}^{\frac{k\Delta x}{2}}x^{2}dx\

&=\frac{2}{k\Delta x}\frac{1}{3}\frac{k^{3}\Delta x^{3}}{2\}

&=\frac{k^{2}\Delta x^{2}}{12}

\end{align} $$

Noise Power of Quantized Signal¶

Then noise power is:

\[ P_{e}=\int_{-\infty}^{\infty}e(x)^2p_{x}(x)dx \]

where $e(x)=\hat{x}-x$. If we consider the characteristic curve $\hat{x}$, it is equal to $x$ at $\frac{\Delta x}{2}$ and the for every step size $\Delta x$. In between, the error increases linearly. So we can restrict the integration limit to one step size and just multiply it times the number of steps $k$:

$$ \begin{align}

P_{n} &= k\int_{0}^{\Delta x}(x-\frac{\Delta x}{2})^{2}\frac{1}{2 x_{max}}dx & |\space x_{max}=\frac{k\Delta x}{1}\

&= \frac{1}{\Delta x}\left[\frac{1}{3}(x-\frac{\Delta x}{2})^{{3}\right]_{0}}\

&= \frac{1}{\Delta x}\frac{1}{3}[(\Delta x-\frac{\Delta x}{2})^{3}+(\frac{\Delta x^{3}}{8})]\

&= \frac{\Delta x^{2}}{12}

\end{align} $$

SNR of Uniform Quantization¶

This gives us the following signal to noise ratio:

\[ SNR=\frac{P_{x}}{P_{n}}=k^2=2^{2\omega} \]

So the SNR is only dependent on the step-size, which in turn only depends on the number of bits we spent. It exponentially reduces with increasing bits. We can directly convert the SNR into dB:

\[ SNR|_{dB}=10\log_{10}2^{2\omega}=6dB\cdot\omega \]

So with every bit spent, we increase the signal to noise ration by 6 decibels.

Pasted image 202409261551The harmonics come from the periodic impulse train used as the excitation source.25.png

This is only true for the most optimal case however, where the quantized distribution is uniform and fits perfectly between $-x_{max}$ and $x_{max}$. If we have signal that goes beyond $x_{max}$, the SNR will increase drastically. Also, if we have a non-uniform signal where most samples are densely packed in some region (e.g. for speech most samples are around zero), the SNR will be a lot higher. The graphic above shoes the difference of employing one more bit on the SNR.

Non-uniform Quantization¶

Pasted image 20240926155648.png

Pasted image 20240926160034.png

To better account for density variation in non-uniform distributions, we might be tempted to have a variable step-size by designing an adaptive characteristic curve. But we can simplify the problem by using a two-step approach named companding (compress, quantize, expand). We see in the first figure, that when we compress the signal, it looks more similar to a uniform distribution, with the probability density being spread across the whole interval. We can then just apply uniform quantization, transmit the signal and expand it again.

Logarithmic Compression¶

One way to compress signals with a lot of values close to zero is to use logarithmic compression $C(x)=c_{1}\log x +c_{2}$. We have to choose the constants in a way that keeps our output between $-x_{max}$ and $x_{max}$, meaning that $C(x_{max})=x_{max}$. We can constrain the upper bound using the formulation

\[ C(x)=c_{1}\log \frac{x}{x_{max}}+x_{max} \]

Drawing 2024-09-26 16.36.30.excalidraw#^group=BDV5Dw6rq9x_TYE3tKRfi

But the log approaches infinity for small $x$ and is undefined for negative $x$. We can solve this issue using two tricks:

We use a second function for small values, a line that smoothly transitions from the log slope and goes through the origin
For negative values, we just invert the positive part with the sign function.

Because the line goes through zero and smoothly transitions into the log slope, the function will have smooth transitions at every point. Also it will fulfill $C(-x_{max})=-x_{max}$ and $C(x_{max})=C(x_{max})$.

Pasted image 20240926162155.png

One such function is the A-Law used as the ISDN compression standard in Europe. The A-Law assumes a normalized signal and is defined as

$$ C(x)=\begin{cases}

\text{sign}(x)\frac{1+\log(A|x|)}{1+\log(A)}) &\frac{1}{A}\le|x|\le1 \

\text{sign}(x)\frac{A|x|}{1+\log{A}} & |x|<\frac{1}{A}

\end{cases} $$

So for absolute values greater than $\frac{1}{A}$, we use the logarithm, for values below we use a linear function. Both are normalized by the same factor $\frac{1}{1+\log A}$. For $A=1$, there is no compression at all. The approximate standard value for ISDN is $A=87.6$. Utilizing this compression allows for 24dB gain in SNR compared to the uniform quantization without companding. Typically, 12 to 16 bits are used for quantization. There is also a $\mu$-Law used in the US and Japan. At a sampling rate of 8kHz (ISDN), we have a bit rate of ~64bps, which is a 50% decrease from not using compounding. This only works well for speech however, for general audio compression, we have to use uniform quantization.

Adaptive Quantization¶

We can make use of the fact that we apply quantization on short window segment and choose a step-size according to the dynamic of the sound in our window segment. We found that the quantization noise power is $\frac{\Delta x^{2}}{12}$, so a choosing a $\Delta x$ as small as possible for our current window segment will yield the smallest amount of noise. The noise will be louder for for louder sounds, but also be masked by those loud sounds. We'd choose a step size proportional to the empirical standard deviation:

\[ \Delta x(n)=c\hat\sigma_{x}(x) \]

Now, we somehow need to transmit the step index as before, but also the step size, as it is not standardized for both sender and receiver anymore. We can do so using one of two methods.

Adaptive Quantization Forward (AQF)¶

Pasted image 20240926180813.png

This quantization method uses a sliding window of $N-1$ samples and buffers $N$ samples. Those buffered samples are used to compute the empirical variation

\[ \hat\sigma^{2}_{x}(n)=\frac{1}{N}\sum\limits_{m=0}^{N-1}x^{2}(n+m) \]

and then quantizes the signal based on this measure. The variation is only calculated once the buffer is filled up, so the calculated variance is used for blocks of $N$ samples. Because the receiver has no way of knowing the step size beforehand, we have to transmit not only the quantization index but also the variable step-size. The receiver then dequantizes the signal by using the quantization index and the transmitted stepsize.

Adaptive Quantization Backward (ABF)¶

Pasted image 20240926181540.png

The backward variant of this algorithm takes advantage of the fact, that the sender has access to the past quantized samples. The signals variation is computed using a weighted average of the previous predicted variations and power of the previous sample:

\[ \hat\sigma^{2}_{\hat x}(n)=\alpha\hat\sigma^{2}_{x}(n-1)+(1-\alpha)\hat x^{2}(n-1) \]

where $0<\alpha<1$. For values close to 1, the variation will be smoothed heavily and mostly the previous prediction will be used as an estimate for variation. For zero, the mostly the the power of the previous sample will be used. Compared to the forward variant, the AQB estimates the variance per sample. Furthermore, we don't have to transmit the stepsize anymore, as it is calculated from past samples anyways.

Comparisons of Quantization Methods¶

Pasted image 20240926182938.png

As can be seen in the figure above, the AQB shows a much more accurate dequantized signal power. This is because the AQF estimates the variance once per block of $N$ samples, which fill the buffer. AQB on the other hand estimates it every sample. Each estimate might be less accurate, but is accurate enough to provide a better SNR per sample.

Pasted image 20240926183624.png

As we can see in the above figure, uniform quantization works only well where the signal power is high. Once it dips, for example in breaks between phonemes or words, the SNR shoots up. A-Law and $\mu$-Law improve on that using the logarithmic compounding scheme, especially for moderately high signal power. But also for them, if the signal power is very low, the linear interpolation does not give the best step-size and the SNR is still high. Lastly, the AQF scheme always has a high SNR, and only short and minor dips, when the signal power changes and the variance estimate needs to adapt. AQF would be expected to have an even better SNR.

Vector Quantization¶

Pasted image 20240926192703.png

Lots of signals are multidimensional and we have to transmit vectors for each timestep (LPCs, Cepstral Coefficients, videos, …). While we can often store and process large amounts of data, transmission is often a bottleneck. So one way to trade-off transmissions size against storage size is using vector code books. The code books contain indexed centroids, where each centroid is like a multidimensional equivalent of a step size apart. The transmitter calculates the closest centroid in the code book to the vector they want to transmit. Then they send the index to the receiver, who looks up the centroid in the code book. The distance of the source vector to the centroid is the quantization error.

Like with the step size, we can store $k=2^{\omega}$ centroids per bit $\omega$. The efficiency also depends on the length of the vector: $\hat\omega=\frac{1}{L}\log_{2}k\space\frac{\text{bits}}{\text{vector element}}$. So the larger the vector, the more bits are needed.

Pasted image 20240926193416.png

We can leverage the fact, that successive samples are often correlated. The figure above shows the joint distribution of successive samples for speech. As speech is a zero mean process, most points are around the origin. But we see that there is a 45° regression fit, indicating correlated samples. Using K-means (or more advanced, the Linde-Buzo-Gray algorithm), we can find produce $k$ centroids that are distributed optimally for the given distribution. K-means normally is used to cluster groups, but if for each iteration

we assign each samples to its closest centroid
move the centroid to the center of gravity

we will actually produce cells where each cell contains roughly the same amount of samples. Those cells are called Voronoi cells.

Speech Coding¶

The before-mentioned quantization methods are applicable to all kinds of signals and don't incorporate a-priori knowledge about the signal (like that is is speech) into their schema. Codecs on the other hand try to do exactly that. Using a-priori knowledge about how speech is produced (tube model -> LPCs) or about how speech is perceived (mp3), we can further reduce the bitrates needed for transmission or increase the sound quality for the same bitrate.

[!Typical bit rates]

Text: 50bps

LPC: 2.7kbps

CD: 705.6 kbps per channel

Studio audio: 1152kbps per channel

Pasted image 20240926195527.png

The simplest way to reduce bandwidth is by applying analysis filters and quantizing the signal (a). We are however limited in the compression achievable. Parameterized codecs (b) like LPCs can reduce the bandwidth drastically, however at the largest cost of sound quality. They distort the original sound a lot, as all of the sound must be resynthesized (e.g. requiring an estimated excitation signal). Hybrids system (c) combine both, transmitting some parts of the sound as a parameter and some parts quantized signal (e.g. using LPCs, but based on actual excitation pattern).

Pasted image 20240926200046.png

The different quality compared against the [[Mean Opinion Score]] can be seen here, as can be seen, the gain in sound quality compared to the bitrate required is different for each kind of codec.

Waveform Codecs¶

Waveform coding is the most straight forward way of compressing audio data. Waveform codecs use special filters and quantization to reduce the dynamic range of the time-varying signal, but still transmit a residual of that signal. The default waveform codec is Pulse Code Modulation, short PCM. PCM consists of

sampling (making signal time discrete)
quantizing (making signal value discrete)
encoding (e.g qauntization index to binary)

Pasted image 20240926203037.png

To reduce the dynamic range when transmitting, we can use Differential Pulse Code Modulation, DPCM. This method subtracts weighted successive samples from each other: $x(k)-a_{1}x(k-1)-\ldots-a_{n}x(k-n)$. This is just an order-n filter with coefficients $a_{i}$. Often, just one previous sample is used. DPCM decorrelates successive samples, whitens the signal and reduces dynamic range, making quantization more efficient. This leads to a reduction of ca. 50% in required bitrate as compared to PCM. As with the adaptive quantization methods, there is an open- and a closed-loop variant of DPCM.

The open loop system computes the filter coefficients from the source signal, applies the filtering to the source signal and then quantizes the differential signal. In the closed loop system, the filtering is applied first, then the quantization is applied and only the the coefficients for the next time step are computed. In the open system, the transmitter has to send the coefficients as well, as they need to be computed from the source signal. In the closed loop system, the receiver can compute them just as the sender did, using the already quantized signal. In both cases, the receiver uses the coefficients applied to the previous signal samples and adds it to the incoming signal.

Noise Masking¶

While the closed-loop DPCM is more bit efficient, the open loop system has an interesting advantage: The quantization noise is also masked by the synthesis filter, shaping the noise according to the speech and making it less noticeable. The reason can be derived from the z-domain representation:

The transmitted signal is the sum of the differential signal and quantization noise, the differential signal in turn is the sum of the source signal and source signal with the differential filter applied:

$$ \begin{align}

\tilde D(z) &= \Delta(z)+D(z)\

&=\Delta(z)+X(z)-A(z)X(z) \

&=\Delta(z)+X(z)(1-A(z))

\end{align} $$

And for the receiver side:

$$ \begin{align}

&&Y(z) &= \tilde D(z)+A(z)Y(z)\

\iff && Y(z)-A(z)Y(z) &= \tilde D(z)\

\iff && Y(z)(1-A(z)) &= \tilde D(z)\

\iff && Y(z) &= \frac{\tilde D(z)}{(1-A(z))} &|\space\text{substitute }\tilde D\

&&& = \frac{\Delta(z)+X(z)(1-A(z))}{1-A(z)}\

&&& = \frac{\Delta(z)}{1-A(z)}+X(z)

\end{align} $$

We can see that the reconstructed signal is equal to the original plus some quantization noise. The quantization noise however is filtered by the synthesis filter as well, meaning that it takes on the shape of the noise.

For the backward system, we transmit:

\[ \tilde D(z)=X(z)(1-A(z))\Delta(z)(1-A(z)) \]

and if we perform the same derivation but substituting this formulation for $\tilde D$, we get:

\[ Y(z)=\Delta(z) + X(z) \]

So in the closed system, the quantization noise is not shaped to the speech. However, due to the low bitrate requirements for telephony, the Digitial European Cordless Telephone DECT standard still uses the closed system and saves on transmitting the filter coefficients.

Parametric Codecs¶

For parametric coding schemes, the voice is completely re-synthesized from a parameterized model of the voice. This is done using the previously mentioned Linear Prediction Coefficients. The standard LPC-10 model transmits 10 coefficients, together with the fundamental frequency, gain and a synchronization bit for the unvoiced/voiced decision. Alternatively, the MELP (Mixed Excitation Linear Prediction) model uses a weighted average of noise and impulse train, so also fricative-vowel mixed can be produced (like [v]).

LPC-10 uses the empirically optimized values window length 22.5ms, resulting in 180 samples at 8kHz. One issue with transmitting the LPCs directly is that they have a large and unbounded dynamic range, making normalization and quantization difficult. Therefore, they are typically converted into one of two forms:

Reflection Coefficients (RCs)¶

The reflections coefficients, known from the tube model, are computed by the Levinson-Durbin Recursion as a kind of byproduct. They directly translate to LPCs, but are bounded between 0 and 1.

\[ r_{i}=\frac{A_{i+1}-A_{i}}{A_{i+1}-A_{i}} \]

While being bounded, they are not very uniformly distributed, making the next representation more useful practice.

Log Area Ratios (LARs)¶

The log of the ratio of areas for successive tube segments can easily be computed from the RCs.

\[ L_{i}=\log\frac{A_{i+1}}{A_{i}}=\log\frac{1+r_{i}}{1-r_{i}} \]

Not only are LARs bounded to between 0 and 1, they also apply the companding scheme from quantization, which yields better perceptual quality due to improved accuracy for quantizing large RCs.

As can be seen, parametric codecs not only have to parameterize the signal, but do it in a way that is efficient for quantization. Just as the LPCs, the gain also profits from being log companded. In LPC-10, typically the LARs are transmitted, where the first two coefficients are quantized non-uniformly using 5bits and the other use uniform quantization with 2 to 5 bits. In total, LPC-10 transmits:

Parameter	bits, $\omega$
LPCs	41
F0	7
gain	5
sync	1
total	54

At 22.5 frames per second, this gives: $\frac{54 \text{bit/frame}}{22.5 \text{frames/ms}}=2.4kbps$.

Hybrid Codecs¶

Hybrid coding schemes combine the LPC schema with excitation signal residuals.

Residual Excited Linear Prediction (RELP)¶

Pasted image 20240927124703.png

The RELP algorithm creates a residual excitation signal by inversely applying the LPC filter, heavily compresses it and transmits it together with the LPCs.

1. Windowing The RELP scheme first computes the LPCs over a window of $N$ samples.

2. Creating the residual Then, using the LPCs, a differential signal is created, which is subtracted from the input signal to only keep the residual excitation signal. Basically, the inverse LP filter is applied to the input to only keep the unknown part of the signal.

3. Low-pass Filtering Pasted image 20240927131008.png The residual signal is low-pass filtered to $f_{m}=\frac{fs}{2r}$.

4. Subsampling $\downarrow r$ Pasted image 20240927131034.png The residual signal is then sub sampled using, keeping only every $r$-th component. This is done by multiplying with a [[Dirac Comb]] $\text{Ш}_{\frac{1}{f_{s}/r}}$ with pulse width $\frac{r}{f_{s}}$. This corresponds to a convolution of a dirac comb with pulse width $\frac{f_{s}}{r}$. The spectrum is repeated, but without aliasing effects due to the low pass filtering in step 3.

5. Quantization & Multiplexing The residual signal is quantized and multiplexed together with the LP coefficients for transmission. The LPCs are vector quantized.

Reconstruction The receiver now applies the same steps in reverse. When upsampling the signal again by inserting zeroes in between samples, the spectrum is limited again to $f_{m}=\frac{f_{s}}{2}$. However, the spectrum is still distorted from the sub sampling: Pasted image 20240927131629.png This means, more energy is present in the higher frequencies, leading to a metallic sound. Due to the natural 6dB attenuation however, this effect is less pronounced.

Codebook Excited Linear Prediction (CELP)¶

Pasted image 20240927131817.png

As the successor to RELP, CELP doesn't transmit the residual signal directly, but uses vector quantization for both the LPCs and the residual. That means the receiver has a codebook with many possible excitation signals in it. The sender chooses the index of the best fitting codebook entry. This is done by Analysis-by-Synthesis:

The sender first computes the LPCs. They then synthesize multiple output signals from several codebook entries and compute the perceptual distance to the input signal. This way, the excitation signal which produces the best reconstruction after synthesis is chosen.

Perceptual Codecs¶

Perceptual coding uses a-priori knowledge about the human sound perception. They mainly exploit the fact that when humans perceive a frequency, other frequencies close to it will be masked.

Pasted image 20240927132547.png

Basically, the perception threshold for frequencies around an already perceived frequency is raised. By quantizing sound in a way so that the quantization noise falls within masked areas, the perceived noise will be a lot lower than the actual SNR. This is used in MP3 and AAC compression. Also, the noise shaping of the open-loop DPCM will have a similar effect.

Speech Enhancement¶

Speech enhancement can be sub-divided into two categories: Multi-channel enhancement like beam-former and spatial filters and single-channel enhancement like the Wiener filter.

Single-channel Enhancement¶

Single-channel enhancement must rely on the statistical properties of the input signal. If we have some a-priori knowledge about the structure of the background noise (or other undesired interference signals), we can compute a gain function to optimally attenuate the contribution of the undesired signal and only keeps the true signal.

Wiener Filter¶

The Wiener filter is the MMSE-optimal linear filter that reduces the error between the estimated clean speech and the true clean speech. It assumes the input signal to be a superposition of the true signal and the interference signal $y=s+n$.

The linearity means the filter works for stationary processes and linear systems. Counter examples would be non-stationary processes like long speech segments, where successive samples are not linear dependent or non-linear systems, like neuron excitation or chemical reactions. Furthermore, the filter will assume knowledge about the noise signal.

Derivation in time Domain¶

The derivation of the filter in the time domain is basically the same as for the LPCs. For a given input $y$, we can model the estimated signal $\hat s$ as the convolution of the input with the impulse response $g(n)$ of the Wiener filter:

\[ \hat s(n)=\sum\limits_{\nu=-\infty}^{\infty}g(\nu)y(n-\nu) \]

We can now define the MMSE objective of finding the optimal parameters:

\[ g(v)=\arg\min_{g}\mathbb{E}[(s(n)-\hat s(n))] \]

As we did in the LPC example, we can solve by setting the derivative with respect to g to 0:

Drawing 2024-09-27 15.15.54.excalidraw#^group=y4mPaiDOEmCfsbTWU8_AV

We can now apply the Fourier Transformation to get the filter in the spectral domain. The convolution turns into a multiplication:

$$ \begin{align}

&& \Phi_{YS} &= G\Phi_{YY} \

\iff && G &= \frac{\Phi_{YS}}{\Phi_{YY}}=\frac{\Phi_{SS}}{\Phi_{SS}+\Phi_{NN}}

\end{align} $$

The PSD of the cross correlation between $y$ and $s$ is the same as the autocorrelation of $s$, as the noise in $y$ is completely uncorrelated to $s$ and thus will not change the PSD.

Derivation in Spectral Domain¶

The derivation in the spectral domain is more involved. Not only do we need to find the optimal amplitude of $G$, but also the optimal phase $\Phi_{G_{opt}}$. This is done in two steps. First we simplify the MMSE objective

\[ G_{opt}=\arg\min_{G}\mathbb{E}\left[|S_{k}-G_{k}Y_{k}|^{2}\right] \]

For the resulting expression, we see that the only term that involves the phase of $G$ is $-2\mathcal{R}\{G^{*}\mathbb{E}(S_{k}Y^{*}_{k})\}$. The overall expression is minimized if we find the phase that maximizes this term (as it is subtracted). The phase $\varphi=0$ maximizes the real part, as $\cos{\varphi}=1$, so when $G$ has the same phase as $\mathbb{E}(S_{k}Y^{*}_{k})$, then multiplying with the complex conjugate $G^{*}$ will cancel out the phase and give the maximum real part. So we know

\[ \phi_{G_{opt}}=\angle\space\mathbb{E}(S_{k}Y_{k}) \]

Now we can just try to minimize the magnitude of the overall expression. We do so by taking the derivative and solving for $0$. This yields the expression:

$$ \begin{align}

|G_{opt}| &= \frac{|\mathbb{E}(S_{k}Y_{k})|}{\mathbb{E}(|Y_{k}|^{2})} \

G_{opt}=|G_{opt}|e^{{j\phi_{opt}}&=\frac{\mathbb{E}(S_{k}Y_{k})}{\mathbb{E}(|Y_{k}|}})

\end{align} $$

Now we can substitute the definition of the observed signal, $Y=S+N$:

$$ \begin{align}

G_{opt} &= \frac{|\mathbb{E}(S_{k}Y_{k})|}{\mathbb{E}(|Y_{k}|^{2})}\

&=\frac{|\mathbb{E}(S_{k}(S_{k} + N_{k})^{{2})|}{\mathbb{E}((S_{k}+N_{k})(S_{k}+N_{k})}(N)=0\}} &|\space\mathbb{E

&= \frac{\mathbb{E}(|S_{k}|^{{2})}{\mathbb{E}(|S_{k}|} \})+\mathbb{E}(|N_{k}|^{2})

&= \frac{\sigma_{s,k}^{{2}}{\sigma_{s,k}}}+\sigma_{n,k}^{2}

\end{align} $$

The result is analogous to the time domain derivation, only that PSDs are switched with variances.

[!Full derivation]-

Effect of the Filter¶

Pasted image 20240928175349.png

The figure above shows the effect of the filter on the PSD of a single frame. As can be seen, the peaks, created by the harmonics of the speech, are unchanged by the filter. In the original signal, noise fills the gaps between the spectral peaks. The Wiener filter attenuates the energy of those frequencies and thus reduces the noise.

Variants¶

Non-Linear¶

We derived the Wiener filter in a linear form. That means, that we expected our observed signal $Y$ to be a linear, so that we could find a filter that would work independent of the input signal. But we can also derive an MMSE optimal filter where we drop this linearity constraint, so we try to find a filter that is optimal given the input signal:

\[ \hat S = \arg\min_{\hat S}\mathbb{E}[(S-\hat S)^{2}|Y]=\mathbb{E}(S|Y) \]

If we assume speech and noise to be Gaussian, we can find a closed form solution to this problem. We know

the prior $p(S)=\mathcal{N}(\mu_{S},\sigma_{S}^{2})$,
the likelihood $p(Y|S)=\mathcal{N}(\mu_{Y}-\mu_{S},\sigma_{N}^{2})$ (given the signal, there is only noise left)
the evidence $p(Y)=\mathcal{N}(\mu_{Y},\sigma_{S}^{2}+\sigma_{N}^{2})$

We can then use Bayes formula to get $p(S|Y)=\frac{p(Y|S)p(Y)}{p(S)}$ and take the expectation to get the non-linear MMSE optimal solution. This, in fact, again gives the Wiener filter, making it also the MMSE optimal non-linear filter, if speech and noise a Gaussian.

Spectral Subtraction¶

Another method to obtain the spectral coefficients of the true clean speech is to subtract the noise PSD from the observed signals PSD. However, this results in an error term caused by the cross-term of the squared sum of signal and noise:

$$ \begin{align}

|\hat S|^{2} &= |Y|^{2}-|N| \

&= (S+N)(S+N)^{_}-|N| \

&= |S|^{2}+|N|{SN}+2\mathcal{R^{_}}-|N|\

&= |S|^{{2}+2\mathcal{R}{SN}}

\end{align} $$

But when applying the expectation, since the noise is zero mean and uncorrelated with the signal, the cross-term will go to zero and we get $\mathbb{E}(|\hat S|^{2})=\mathbb{E}(|S|^{2})=\sigma^{2}_{s}$. We can also obtain a filter for spectral subtraction, using a trick of factoring:

$$ \begin{align}

\mathbb{E}(|\hat S|^{2}) &= \mathbb{E}(|Y|^{{2})-\mathbb{E}(|N|}) \

&= \mathbb{E}(|Y|^{{2})\left(1-\frac{\mathbb{E}(|N}\right) \}|)}{\mathbb{E}(|Y|^{2})

&= \mathbb{E}(|Y|^{{2})\left(\frac{\mathbb{E}(|Y|}(|N|})-\mathbb{E^{{2})}{\mathbb{E}(|Y|}\right)\})

&= \mathbb{E}(|Y|^{{2})\left(\frac{\mathbb{E}(|S|}\right)\})}{\mathbb{E}(|Y|^{2})

\end{align} $$

So we see that our estimated signal is the input signal multiplied with the speech variance divided by the observed variance:

\[ G_{opt}=\frac{\sigma_{s}^{2}}{\sigma_{y}^{2}}=\frac{\sigma_{s}^{2}}{\sigma_{s}^{2}+\sigma_{n}^{2}} \]

This can also be reformulated using the smoothed estimates of the speech and the noise periodograms:

\[ \hat{|S|}^{2}=|Y|^{2}\left(1-\frac{\overline{|N|^{2}}}{\overline{|Y|^{2}}}\right) \]

Estimation of Noise PSD¶

The Wiener filter and spectral subtraction both require the periodograms or variances of speech and noise to be known. However, we only observe the noisy signal and somehow need extract the information about the noise periodogram from that signal. To do so, there are multiple approaches, all of which estimate the noise periodogram. The speech periodogram can be estimated when the noise periodogram is known.

Voice Activity Detection¶

The most straight-forward approach is to use VAD. If we have a reliable way to determine if a sample contains speech or not, we can use the empirical variance over the subset of samples where speech is inactive:

\[ \hat \sigma_{n}^{2}=\frac{1}{\mathcal{N}}\sum\limits_{n\in\mathcal{N}}|y(n)|^{2} \]

This can also be done in the spectral domain over each frequency bin to retrieve the PSD estimation.

Issues

The estimate can only be updated when no speech is present. If the noise changes during speech, it will bleed in.
Over/underestimating speech activity will lead to noise or distortions.

Minimum Statistics¶

Pasted image 20240929173555.png

Empirical observations show that the minimum of the signal periodogram over time are related to the true noise PSD. The idea is simple, if no speech is present, the signals power will be lower and the remaining energy will be from the noise. We don't take the minimum directly from the signal periodogram however, but from a recursively smoothed version:

\[ \overline{|Y_{k}(l)|^{2}}=\alpha|Y_{k}(l-1)|^{2}+(1-\alpha)(|Y_{k}(l)|^{2}) \]

Then, the minimum is taken over a window-length of ~1.5 seconds.

Issues

Noise change within the window length can not be accounted for, bleeding into the sound.
Minimum underestimates noise, as, due to statistical flucuations, the minimum will be below the mean of the noise. Requires compensation.
If there is no pause in speech during a frame, the minimum will be estimate of speech periodogram, leading to distortions.

SPP - Speech Presence Probability¶

This method is similar to the VAD approach, but instead of the yes or no approach of speech presence, SPP uses the posterior probability of speech presence given the current observations to compute a weighted average the current signal power and previous estimates.

We use Bayes Theorem to compute the posterior speech probability:

$p(H)$ - prior probability of hypothesis
$p(Y)$ - evidence
$p(H|Y)$ - a-posteriori probability of hypothesis given evidence.
$p(Y|H)$ - likelihood of evidence give hypothesis

\[ \begin{align} p(H|Y) &= \frac{p(H,Y)}{p(Y)} \\ &=\frac{p(Y|H)p(H)}{P(Y)}\\ &=\frac{p(Y|H)p(H)}{\int_{H}p(Y,H)} \end{align} \]

We define the hypothesis $H_{1}$ as speech being present, $H_{0}$ being our null-hypothesis. We then can define the posterior probability:

\[ p(H_{1}|Y)=\frac{p(Y|H_{1})p(H_{1})}{\int_{H}p(Y,H)}=\frac{p(Y|H_{1})p(H_{1})}{p(Y|H_{1})p(H_{1})+p(Y|H_{0})p(H_{0})} \]

If we now can set up a few assumptions:

$p(H_{0})=p(H_{1})=0.5$,

$p(Y|H_{0})=\mathcal{N}(0,\sigma_{n})$

$p(Y|H_{1})=\mathcal{N}(0,\sigma_{s}+\sigma_{n})=\mathcal{N}(0,\sigma_{n}(1+\xi_{H_{1}}))$ where $\xi_{H_{1}}=15dB$

We can then plug in the Gaussian formulas and obtain a closed form solution for the speech probability given our priors and the noise periodogram:

Drawing 2024-09-29 18.15.37.excalidraw#^group=5E7Wsf5ArJCw-rqMV5ztn

Now we can update the speech posterior and noise periodogram in tandem, using one to estimate the other. We estimate the current noise periodogram by using the signal periodogram weighted by probability that no speech is present (the less likely we have speech present, the more likely the signal contains noise) and the previous noise PSD estimate weighted by the probability, that speech is present.

For the noise PSD we then take a weighted average of the current noise periodogram estimate and the previous noise PSD estimate.

Start with some base noise PSD estimate.
Compute posterior speech probability as above.
Compute estimated noise periodogram: $|\hat N|^{2}=P(H_{0}|Y)|Y|^{2}+P(H_{1}|Y)\hat \sigma_{N}^{2}$
Compute estimated noise PSD as weighted average of current estimated periodogram and previous estimate of noise PSD: $\sigma_{N}^{2}=\alpha\space\hat\sigma_{N}^{2}(l-1)^{2}+(1-\alpha)|\hat N(l)|^{2}$
For next frame, go back to (2)

Pasted image 20240929184447.png

The figure shows the noise estimate from minimum statistics and SPP for sine noise. The minimum statistics fails to deal with the changing noise as it uses the minimum of the last 1.5 seconds. It only just starts tracking a rise in noise after almost a complete period of the noise and instantly goes down again. SPP on the other hand does constantly update the noise estimate for each frame, without any look-back, making it much more accurate for non-stationary noise.

Estimation of Speech PSD¶

We can estimate the PSD of the true clean speech using Maximum Likelihood estimation. We choose the PSD so that the likelihood of our observations is maximized.

\[ \hat\sigma_{S}^{2}=\arg\max_{\sigma_{S}^{2}}\prod_{n=0}^{L-1}p(Y(l-n)|\sigma_{S}^{2}) \]

Using our definition of $p(Y|\sigma_{S}^{2})=\mathcal{N}(0,\sigma_{S}^{2}+\sigma_{N}^{2})$, we can set the derivative to zero to get:

$$ \begin{align}

0 &\overset{!}{=}\frac{\partial}{\partial\sigma_{S}^{{2}}\prod_{n=0}})\}p(Y(l-n)|\sigma_{S}^{2

&=\frac{1}{L}\sum\limits_{n=0}^{{L-1}|Y(l-n)|}(l)} -\sigma_{S}^{N

\end{align} $$

So we take the average signal periodogram and subtract the noise PSD. For $L=1$, this is the same as spectral subtraction. To make this more computationally efficient and more robust to outliers, we can use a decision directed approach using weighted average with our posterior speech estimate:

\[ \hat\sigma_{S}^{2}=\alpha_{dd}|\hat S(l-1)|^{2}+(1-\alpha_{dd})\max(0, |Y(l)|^{2}-\sigma_{N}^{2}) \]

where $\alpha_{dd}=p(H_{0}|Y)$. So when the speech probability is low, we mainly use the last estimate of the speech PSD. If the probability of speech is high, we use spectral subtraction to get the signal PSD. The maximum operation is to avoid negative values from computational errors.

Temporal Cepstral Smoothing¶

Another denoising strategy is smoothing cepstral coefficients over time. This is done by computing the cepstral coefficients of the signal and then computing the moving average. In the cepstral domain, harmonic and vocal tract quefrencies are more clearly separated from noise and thus the temporal smoothing keeps the speech information while reducing the overall noise.

Multi-channel Enhancement¶

This category of enhancement is based on the idea to exploit the directionality of sound. For a source that is somewhat to the left or right of a speaker pair, we can use the time-delay between the microphones of when a sound arrives. We can easily find the delay using the maximum of the cross-correlation:

\[ \tau_{0}=\arg\max_{\tau}\gamma_{y_{2}y_{1}}(\tau)=\arg\max_{\tau}\mathbb{E}(y_{2}(t)(t-\tau)) \]

Delay-and-sum Beamformer¶

One way to leverage the time delay is by using adding up the signals of both microphones, where we delay one signal by $\tau$:

\[ y(t)=\frac{1}{2}(y_{1}(t-\tau_{0})+y_{2}(t)) \]

The SNR will be improved as long as…

the sound source is distanced far enough, so that waves are parallel
and the noise is not spatially correlated

because the signals will add up constructively, the noise will interfere destructively.

Minimum Variance Distortionless Response (MVDR)¶

The MVDR beamformer is an extension to an array of $M$ microphones:

\[ \mathbf{y}=[Y_{1}, \ldots,Y_{M}]^{T}=\mathbf{a}S+n \]

The signal vector $\mathbf{S}=\mathbf{a}S$ is the source signal multiplied with a steering vector $a$. The steering vector applies a gain to adjust phase and amplitude of the true signal as perceived by a microphone in a certain spatial position. The MVDR filter is then found by finding a filter that minimizes the noise power without changing the power of the true signal:

\[ \mathbf{h}_{mvdr}=\arg\min_{\mathbf{h}}\mathbb{E}(|\mathbf{h}^{H}\mathbf{n}|^{2})\quad\text{s.t. }\mathbf{h}^{H}\mathbf{s}=S \]

This can be reduced to the filter

\[ \hat s_{mvdr}=\frac{\mathbf{a}^{H}\mathbf{\Phi_{NN}^{-1}}}{\mathbf{a}^{H}\mathbf{\Phi}_{NN}^{-1}\mathbf{a}}\mathbf{y} \]

where $\mathbf{a}$ is the steering vector and $\Phi_{NN}$ is the noise covariance matrix of the form $M\times M$.

Pasted image 20240929221254.png

The steering vector steers the lobe of beam toward the desired sound source while steering nulls (zero-filter) towards on the interferer. This way, $M-1$ interferers can be canceled.

Multi-channel Wiener Filter¶

We can also extend the Wiener filter to a multi-microphone array. We do so by extending our definition of the Wiener filter as a convolutional kernel:

Normal: $\hat s(n)=\sum\limits_{\nu=-\infty }^{\infty}h(n)y(n-\nu)$
Extended: $\hat s(n)=\sum\limits_{m=1}^{M}\sum\limits_{-\nu}^{\infty}h_{m}(n)y_{m}(n-\nu)$

So instead of just searching for the MMSE optimal filter that minimizes the distance between estimated signal and true signal, we estimate the signal as the sum of filtered signal, where the Wiener filter is a vector of length $M$. The MMSE solution is given as…

\[ \sum\limits_{m=1}^{M}\sum\limits_{\nu=-\infty}^{\infty}h_{m}\gamma_{y_{m^{\prime}}y_{m}}(i-\nu)=\gamma_{y_{m^{\prime}}s}(i) \]

Note the similarity to the single-channel solution $\gamma_{YS}(n)=\sum\limits_{\nu=-\infty}^{\infty}g(\nu)\gamma_{YY}(n-\nu)$. This time however, the autocorrelation of the observed signal is switched with the cross-correlation of the microphone $m^{\prime}$ with all other observed signals. Transforming into the spectral domain gives us:

\[ \sum\limits_{m=1}^{M}H_{m}(z)\Phi_{Y_{m}Y_{m^{\prime}}}=\Phi_{Y_{m^{\prime}}}\iff\mathbf{\Phi_{YY}h}=\mathbf{\Phi}_{\mathbf{Y}S}\iff \mathbf{h}=\Phi_{YY}^{-1}\Phi_{YS} \]

The multichannel Wiener filter thus is the inverted covariance matrix of the observed signal times the cross-covariance between observed signal and true signal.

[!Note] We can apply the definition $Y=S+N$ and transform the multi-channel Wiener filter into a two filters:

The MVDR beamformer

The single channel Wiener filter with steered noise

The multi-channel Wiener filter thus is the same as first applying the MVDR beamformer, followed by the normal Wiener filter.

Machine Learning Approaches¶

Speech Separation¶

Deep Clustering¶

We can separate speakers by using a deep clustering approach: We first apply the STFT transform to obtain $N$ time-frequency bins (TF bins) and assume that we can separate the TF bins into sets, where each TF bin in a set is dominated by one speaker. We then create speaker specific spectrograms by keeping only the TF bins from their set and setting all other coefficients to zero. We then apply the ISTFT to get the speaker specific signals.

The separation is done using an indicator vector $y_{i,c}$ , which indicates whether TF bin $i$ belongs to speaker/cluster $c$. We can then form the Affinity matrix $A=YY^{T}$, where the entry $A_{i,j}$ indicates that elements $i$ and $j$ belong to the the same speaker.

For training, we have indicator vectors available. Each time-frequency bin is projected into a $V\in\mathbb{R}^{N\times D}$-dimensional embedding, in which speakers are disentangled. Basically, for each time-frequency bin, we have a $D$ dimensional vector, that points in different directions for different speakers and in a similar direction for the same speaker. This is done by training the embeddings to estimate the affinity matrix $\hat A=VV^{T}$ such that its difference to the real affinity matrix $A=YY^{T}$ is as small as possible. This is the case if similar speakers are represented by similar embedding vectors and dissimilar speakers have very different embedding vectors.

The model performs well for speakers that have a different pitch. But it doesn't seem to pick on vocal tract specific differences, thus the source separation is limited for similar speaker pairs.

Conv-TasNet¶

The Conv-TasNet model is an encoder-decoder architecture. In the latent space, a Separation Network is trained (based on dilated convolutions with skip-connections) to create a speaker-mask that attenuates all frequencies not related to a specific speaker. This model works outstandingly well for speaker separation. It however has the problem, that its performance is very data dependent and might overfit to certain datasets.

Building on top of that ,a later approach tries to make the encoder-decoder structure more deterministic by utilizing a Gammatone Filterbank that performs similarly well to the encoder-decoder learned in the original paper, but is based on signal processing principles. The filter bank is analytically derived and performs faster than the original model, without sacrificing much performance.

Speech Recognition¶

Speech recognition means translating a speech audio signal into a textual representation. This is typically a two-step process, first we must extract features, then we need to classify the features. However, this includes some major challenges:

ambiguous sentences (sound phonetically the same, but have different meanings)
lack of separation between words
extrinsic variables like noise and reverberations
intrinsic variables like
- pitch, vocal tract and dialects (inter-speaker)
- prosody, like emotion, timing and irony (intra-speaker)

Feature Extraction¶

Formants¶

A very basic approach could be using the formants related to different phonemes. This however is not very robust:

Higher pitched voices sample the vocal envelope more sparsely and might even miss out certain formants
The spectral envelope differs slightly between speakers
Noise and reverberations also distort formants

Mel-frequency Cepstral Coefficients (MFSCs)¶

MFCCs are computed similarly to the Power Cepstrum. However, they apply extra Mel-Filters, which that are based on the human perception:

Pasted image 20240625113559.png

As can be seen in the figure, the Mel-filter bank applies overlapping triangular windows to different frequency ranges. This transforms the periodogram to the #Mel Scale, where the first 1000Hz are relatively constant. Higher frequencies are perceived more sensitively but less differentiated. The mel filter smooth out higher frequencies and attenuates them to produce a spectrum that is more similar to how humans perceive sounds.

The MFCCs are typically computed in 5 steps:

STFT$(\cdot)$ Produce windowed spectrum through Fourier Transformation#Short Time Fourier Transform (STFT).
$|\cdot|^{2}$ Calculate Power Spectral Density (PSD)#Periodogram by squaring the spectra .
Mel-filters Multiply spectrum of each frame with #Mel-filter bank (see above)
log$(\cdot)$ Apply logarithm to each frame.
IDCT$(\cdot)$ Apply inverse Fourier Transformation#Discrete Cosine Transformation

The inverse Fourier Transformation#Discrete Cosine Transformation instead of a normal IDFT is used as the coefficients already have to be real-valued, so a IDCT is more efficient.

Mean Subtraction & Variance Normalization¶

In the cepstral domain, any channel-specific distortions are not time-varying and therefore just a constant offset that is added to each coefficient. By subtracting the global empirical mean from the coefficients we can pretty much completely remove channel specific distortions. We can the reduce the effects of distortions by dividing by the global empirical variance.

Classification¶

To classify features, we need to match an extracted feature vector to possibly matching vectors from a data set. But to so, we have to consider the variabilities in speech utterances. Due to the intrinsic and extrinsic variabilities, the relationship between acoustic feature and phoneme features is often not linear, but time-distorted. To find the best global match to a feature that takes that variability into account, we have two common methods.

Dynamic Time Warping (DTW)¶

If we assumed a source and a reference utterance to both progress synchronously, we could just compute the euclidean distance between the MFCC vector for each frame and sum up the cumultative distance for all frames. But in DTW, we assume that the source could progress faster than the reference utterance (or the other way around).

We then compute a distance matrix between the source and reference utterance at different time frames $l$ and $\lambda$ as

\[ d(l,\lambda)=||\hat{x(l)}-\hat{y(\lambda)}||=\sqrt{\sum\limits_n(x_{n}(l)-y_{n}(\lambda))^2} \]

We furthermore accept, that segments of source and reference utterances can differ in length. That means, we have three options:

source and reference segments progress linearly in time ($l+1, \lambda+1$)
source segment stays the same while reference progresses in time ($l, \lambda+1$)
reference segment stays the same while source progresses in time ($l+1, \lambda$)

Pasted image 20240702112931.png

The matrix of a single MFCC for two different reference utterances is visualized in the figure below, low costs between two segments are dark pixel, high costs are white pixels:

Drawing 2024-07-02 11.03.36.excalidraw#^group=28jOML60kdUCpbeKOHL0r

We now try to find the path through the matrix from indices $l=0, \lambda=0$ to $l=N,\lambda=M$. In the figure, this means finding the lowest cost path from bottom-left to top-right. A diagonal step represents option 1, the source and reference progress linearly. A step up represents option 2, the reference utterance progresses ahead of the source (because this part of the reference utterance is longer than in the source utterance). A step to the right represents option 3, the source utterance moves ahead of the reference utterance (this part of the source utterance takes longer than in the reference).

We can utilize dynamic programming to optimize the DTW calculation.

Hidden Markov Models¶

Pasted image 20240702115939.png

A Hidden Markov Model assumes that at each frame, we are in a certain hidden state corresponding a phoneme. For each frame, we have an emission probability of observing the current MFCCs given that phoneme. From one frame to the next, we have a certain transition probability of the current hidden phoneme state changing to another phoneme state. We can define one such model for each reference utterance (trained on a large dataset) and then use the models to compute the probability of an reference utterance given the observation. We choose the most probable reference utterance as our match.

Moodle Questions¶

Introduction¶

What Are the Most Important Classes of Speech Sounds?¶

We generally differentiate between vowels, plosives and frikatives. Vowels are produced by periodic oscillations. Frikatives are produced from turbulent airflow caused mainly in the oral cavity and are well estimated using white noise. Plosives are produces by constricting and the suddenly releasing the airflow.

What is the Difference between a Phone and a Phoneme? What is an Allophone?¶

A phone is the smallest, physically differentiable building block in language production. Compared to phonemes, phones are acoustically very similar and often even perceptibly undifferentiated. Phonemes are the smallest unit that can change the meaning of an utterance and often consists of multiple, similar sounding phones, called allophones.

How Can Phones Be Categorized into an International Phonetic Alphabet?¶

Yes, International Phoneme Alphabet IPA provides a universal set of symbols to represent every possible phone.

What Comprises prosody?¶

Prosody is the content-independent information in an utterance. It is comprised of rhythm, stress and intonation patterns that convey the emotional state and emphasis and intent of the speech.

How Does Human Speech Production Work?¶

The sounds starts when airflow from the lungs passes the Glottis, which periodically opens and closes, producing a periodic impulse train. The airwaves now travel through the vocal tract, the shape of which dictates the phones that are produced. For frikatives, the glottis stays open and turbulent airflow is produced in the oral cavity.

How Can Human Speech Production Be Modeled in a Simplified Framework?¶

It can be modeled using the source-filter model, which considers speech production as a two step process. First an excitation signal is created (source) using either an impulse train at some fundamental frequency for voiced sounds or white noise for unvoiced sounds. A switch decides on which excitation signal to use. The excitation signal is then passed through a filter, which simulates the effects of the vocal tract.

What is the Difference between Formant Frequency and Fundamental Frequency¶

The fundamental frequency is the frequency of the periodic airwaves produced by the glottis. The formants are produced by the shape of the vocal tract. They are peaks within the harmonic spectrum and characteristic for specific phonemes.

What is a Formant Map?¶

Each phoneme has characteristic peaks in the spectrum. The formant map typically has two dimensions, one for the first formant (first peak) and one for the second formant (second peak). The map will contain phoneme-specific regions, because the formants are characteristic for each phoneme.

What Are the Three Essential Parts of the Human Ear?¶

The human ear is separated into the outer ear, middle ear and inner ear. The outer ear consists of the "Hörmuschel", which funnels the sound into the "Pinea". In the middle ear, the airwaves hit the "Trommelfell", from where it is transmitted via the "Hammer", "Ambos" and "Steigbügel" to the inner ear. The inner ear starts at the "Ovalfenster" to which the "Steigbügel" is connected. The membrane of the Ovalfenster transmits the airwaves into the liquid within the "Cochlea". Along the length of the winded Cochlear is the "Basilarmembran", which has varying stiffness along its length and starts resonating for different frequencies of the incoming waves at different points along its length.

Why is a Spectral Representation of Audio Signal so Easy to Interpret for Humans?¶

Because the Basilarmembran is basically performing a Fourier transformation, translating the waveform into frequency-specific neuron activations. Thus, humans perceive a spectral representation of the sound.

Fundamental Frequency Estimation¶

What is a Typical Range for the Fundamental Frequency of Humans¶

Male: ~100Hz, Female: ~200Hz, Children: ~600Hz

How is it Possible to Distinguish Female from Male Speakers in Narrowband Telephony?¶

In narrowband telephony, frequencies are capped to a range between 300hz to 3400hz. While the fundamental frequency will likely be dropped this way, the human perception does reconstruct it from the harmonies still contained in the signal. This is called the residual effect.

How Can the Fundamental Frequency of Phones Be Measured?¶

We can do it in the time domain by counting the number of peaks or zero-crossing over a time-interval. More robustly, it can be measured using the autocorrelation function (using the second peak after zero) or the PSD (using the the first peak).

How Would You Choose the segment/window Length when Estimating the Varying Fundamental Frequency of Speech? What is the Trade-off?¶

When we apply an analysis window to the time domain signal, we are interested in making it small enough so that the waveform is quasi stationary, so that we can track the changes in frequencies over time. However, if we make it too short, we might not observe enough periods to accurately estimate the spectral coefficients. A good trade-off is found at 8ms to 32ms.

Spectral Analysis of Audio Signals¶

What Are Complex Numbers and how Can They Be Represented?¶

Complex numbers are a two dimensional extension of the real number line. Numbers now describe a position on the complex plane. One dimension of the complex line are the real numbers, the other are the imaginary numbers. They can be represented in two ways:

In the cartesian form, we describe each number in far they extend into the real and imaginary dimension: $z=a+jb$, where j is the imaginary unit. In the polar form, we describe the number as a vector from the origin using an angle from the origin and a magnitude: $z=re^{j\varphi}$.

We can convert from cartesian form into polar form:

$r=\sqrt{(a^{2}+b^{2})}$
$\varphi=\arctan{(\frac{b}{a})}$

What is Euler's Relation?¶

It relates an exponential to a point on the complex plane, using the sum of a cosine and a sine:

$e^{j\varphi}=\cos(\varphi)+j\sin(\varphi)$

For what Kind of Signals Would You Use a Fourier Series Analysis, and for Which a Fourier Transform to Analyze Its Spectral Content?¶

One would use a Fourier series analysis only for periodic signals, as their spectrum is discrete and can be described by a series. The Fourier transform can be applied to non-periodic signals as well, but requires the integration over the infinite signal and yields infinitely many frequencie coefficients.

Fourier Transform Pairs: What is the Fourier Transform of¶

an Impulse?
- A constant
a rectangular function?
- The sinc function
a sinusoid?
- A positive dirac at the positive frequency of the sine and a negative dirac at the negative frequency.
a delta comb?
- Also a delta comb, but with inverse period length
a periodic signal like a sawtooth signal (qualitatively)?
- A discrete set of harmonics

What is a Linear Time-invariant (LTI) System? Give Examples¶

An LTI system has a linear and time-invariant effect on some input signal:

linear means that scaling the input also scales the output the same
and that the systems effect on the sum of inputs is the same as the summing the outputs to each individual input
time invariant means the system has the same effect on the output regardless on how we shift it in time.

Examples

Short-term effect of vocal tract on glottis excitation's
Image filters/audio filters
Simple circuits

How Can the relation between the Input and the Output of an LTI System Be Mathematically Described in time and Frequency Domain, Respectively?¶

It can be described in the time domain using the response of the system to a dirac impulse, the so called impulse response. Due to the linearity and time invariance, we can just convolve the impulse response with the input signal to yield the systems output. The fourier transform of the impulse response is the transfer function. Multiplying the input signals spectrum with the transfer function gives the spectrum of the systems output.

How Does a Discretization of the time Domain Signal Affect Its Spectrum?¶

A discrete time domain signal has a periodic spectrum. The spectrum will be repeated at the sampling frequency.

How Does a Discretization of the Spectrum of a Signal Affect Its Time-domain Representation?¶

Discretizing the spectrum is periodically extending the time domain signal, with copies at the sampling frequency in the spectral domain.

Explain the Sampling Theorem¶

Because discretizing the time domain signal leads to repeated spectra at the sampling frequency, we need to cap the maximal frequency of the time-domain signal to half the sampling frequency, otherwise the repeated spectra would overlap and create aliasing effects. This is called the Nyquist-rate: $f_{m}<\frac{f_{s}}{2}$ or $f_{s}>=2f_{m}$

What Are Typical Sampling Rates for Speech and Audio Signals, Respectively? Why?¶

Humans can perceive up to ~20kHz, so we want to sample in a way so that we are able to reconstruct frequencies up to this number. According to the Nyquist theorem, we then need to sample at at least ~40kHz, but typically loss-less audio is sample at ~44-48kHz.

Speech in particular does not require that high frequencies to be understandable. 8kHz are typically used in narrowband telephony, for HD voice 16kHz are used.

What is Cyclic Convolution, and how Can it Be Avoided?¶

Cyclic convolution occurs when want to convolve finite signals by multiplying the discretized spectra. The finite time domain signal yields a discrete spectrum. The discrete spectrum however yields a periodically repeated time domain signal and the convolution would be cyclic. So we have to add zeroes to our finite signal and the filter, so that the convolution would actually be same as when applying it to the finite time domain signal.

What Are the Pros and Cons for Tapered Spectral Analysis Windows, like a Hann Windows, when Compared to a Rectangular Window?¶

When we apply a rectangular window, we have a jump at the end of the signal. Representing such a jump in the spectral domain requires frequencies over the whole spectrum, the spectral energy thus "leaks" over the whole spectrum (spectral leakage). To avoid this, we can use tapered window like the Hann window, which yields a smoother transition and avoids jumps. But in turn it slightly distorts the whole signal, and frequency peaks will be less defined.

What is the Difference between a Wideband and a Narrowband Spectrogram Wrt the Visible Properties of Speech Signals?¶

A wideband spectrogram uses narrow windows to more accurately track frequency changes in time. The finer time resolution leads to a lower frequency resolution, frequencies are spread out across the spectrum, leading to wider frequency bands. The narrowband spectrogram does the opposite, utilizing a wider window so that the frequency bands are more narrow, but with lower time resolution.

How is a time Delay by One Sample Represented in the z-domain?¶

A time delay in the z-domain is represented as a multiplication with $z^{-1}$.

Vocal Tract Model and Linear Prediction¶

How Many Formants Do We Expect in Speech Signals per kHz? How Many in a Speech Signal Sampled at 16kHz?¶

We expect about 1 formant per 1kHz. So for signal sampled at 16kHz, which has a maximum frequency of 8kHz, we expect ca. 8 formants.

The Kelly-Lochbaum is a recursive system derived from the concatenated tube model using differential wave equations. It describes how at each tube transition, part of the wave propagates forward and part of is reflected backward. For a given excitation signal, the Kelly-Lochbaum structure gives the same output as the concatenated tube model and thus approximates the filter effect of the vocal tract.

The Kelly-Lochbaum structure can be brought into a lattice structure, showing that it models a lattice filter, which can be described a digital filter.

How Can a System with an Infinite Impulse Response Be Described with a Finite Amount of Parameters?¶

We can describe the IIR using the Autoregressive Moving Average (ARMA) or an IRR filter with finitely many zeroes and poles. It can be shown that any ARMA system can translated into an IRR filter.

The moving average models the output as the weighted sum of previous inputs. It does not utilize any output-feedback. When translating the MA system into the z-domain, we can reformulate the weights of the moving average to all be in the numerator of a transfer function, thus defining zero points of a digital filter.

An autoregressive system models the output as the weighted sum of previous outputs. It assumes that the observation from the system is completely dependent on the previous observations. When translating such a system into the z-domain, the autoregressive weights are all in the denominator of the transfer function, thus describing an all-pole filter.

What Are Linear Predictive Coefficients?¶

Why are they called "linear predictive"? What is predicted? The are called "linear predictive", because they are the coefficients of an autoregressive system, predicting the next output of the vocal tract filter from previous observations.
How are they derived? They are derived by describing the vocal tract as a autoregressive system with constant excitation: $s(n)=b_{0}e(n)-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu)$ We don't know the constant excitation term, so we can only model $\hat{s}(n)=-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu)$. We know find the optimal coefficients by defining the MMSE objective:

$$

\hat{a}{\nu}=\arg\min]$$}_{\nu}}\mathbb{E}[(s(n)-\hat{s}(n))^{2

This can be solved by taking the derivative with respect to $a_{\nu}$ and setting it to zero.

How are LPC coefficients computed from a speech signal? The derivation above yields the formula: $\varphi_{s}(n)=-\sum\limits_{\nu=1}^{p}\hat{a}_{\nu}\varphi_{s}(n-\nu)$. We can bring this into the matrix notation $\mathbf{\Phi}_{s}=-\mathbf{R}\mathbf{a}_{opt}$ where $\mathbf{R}$ is the autocorrelation matrix of the signal. This formulation is called a Wiener-Hopf equation and can be solved very efficiently using the Levinson-Durbin recursion.

How Many LPC Coefficients Do I Need to Model a Speech Signal? What Does it Depend On?¶

Two model a single formant, 2 LPCs are required. So the number of LPCs depend on the number of formants we want to model. This in turn depends on the maximum frequency of our signal. A signal that is samples at 8kHz has a maximum frequency of 4kHz. We thus need to model 4 formants, requiring 8 LPCs. Because we typically apply extra anti-aliasing filters for high frequencies, typically 2 extra LPCs are transmitted, giving us 10 LPCs for the typically telephony sampling rate of 8kHz.

What is Pre-Emphasis? Why is it Important?¶

The pre-emphasis is based on the empirical observation that there is a natural 6dB attenuation per octave of increasing frequency. Our perception however is very sensitive to those high frequencies, thus we pre-emphasize them before processing to informational loss due to noise introduced when filtering or quantizing the signal.

Sampling, Quantization and Speech Coding¶

What Steps Are Necessary to Digitize an Analog Signal?¶

To digitize an analog signal, we have to make it discrete in time (using sampling) and amplitude (using quantization).

What is a Midrise Characteristic?¶

For the midrise characteristics, the characteristic curve of the quantization is rising through the zero point. That means that the smallest quantization steps are half a step size from zero. The alternative is Midtread, where the smallest quantization value is exactly zero ("treading" on the zero line) and then rises full stepsizes from there.

Why Does the SNR Suddenly Drop if the Signal Power $P_{S}$ is Large?¶

If the signal power is large, the amplitude of the signal will start going over the maximum quantization value $x_{max}$. All values will be capped to $x_{max}$, leading to a quickly rising error.

What is a Companding Scheme, and why is it Used?¶

The uniform quantization only works well for approximately uniformly distributed signals. If signal values are distributed differently, e.g. Gaussian, many values will be close to zero and only few values close to $x_{max}$. Thus, we would want a finer resolution close to zero. Instead of adjusting the stepsize non-linearly, we can use the companding scheme: We compress the signal (e.g. using a logarithmic compression like A-law or $\mu$-law) so that it is more uniformly distributed. We then quantize and transmit the signal. The receiver can now expand the signal again. This way, we retain a better resolution for values close to zero.

What is Adaptive Quantization?¶

Adaptive quantization schema try to account for the variation of the signal variance from frame to frame or sample to sample. This can be done in one of two ways:

Adaptive Quantization Forward (AQF) In AQF, we buffer the last $N$ samples. We then flush the buffer into a stepsize adapter, which computes the empirical variance of those $N$ samples and uses it to adjust the stepsize (number of units from one quantization level to the next). This method has two drawbacks:

It only updates the stepsize in blocks of $N$ samples and might not capture really fast changes in signal amplitude.
Because the variance is computed from the unquantized input signal, the sender also has to transmit the stepsize to the receiver.

Adaptive Quantization Backward (AQB) In ABF, the stepsize is updated using the smoothed average of the already quantized signal. This has two advantages:

The receiver can compute the stepsize from the signal they receive, so there are no extra parameters to be transmitted.
The update can happen on a sample-by-sample basis, making AQB adapt to changes in the signal amplitude very quickly.

What is Vector Quantization?¶

Vector quantization is used to quantized multidimensional signals, like videos, LPCs or MFCCs. The quantization can by distributing centroids in the sample space and assigning each input sample to the closes centroid. The centroids are typically index in a codebook, available to both sender and receiver. The sender thus only needs to transmit the codebook index of the centroid to the receiver. Naively, we just distribute the centroids in a uniform lattice. But we can use algorithms similar to K-means to better position the centroids, so that the cells they define (Voronoi cells) contain about an equal number of samples on average. This helps again for non-uniformly distributed signals, like speech, where many of the samples are close to 0. We can also combine pairs of vectors from successive samples and exploit their linear relationships.

Name Three Fundamental Speech Coding Schemes, along with Their Benefits and Drawbacks¶

Waveform coding Waveform coding tries to encode the waveform directly. Its most basic form is Pulse Code Modulation, where we just sample, quantize and translate the signal into bitcode. We can also use Differential Pulse Code Modulation, where we take the current sample and subtract weighted previous samples. This decorrelates the successive samples, whitening the signal and making it more uniformely distributed over a smaller range (good for quantization). DPCM can be done using an all-pole filter.

There are two DPCM standards:

In the open-loop, the sender computes the linear weights for filter from the actual input signal, then subtracts and quantizes signal. They then have to transmit the quantized signal and the filter parameters.

In the closed-loop system, the sender computes the linear weights from the already quantized signal. This way, they only have to transmit the quantized signal and the receiver can compute the filter parameters themselve.

While requiring more bits to be transmitted, the open-loop system has an interesting advantage: The filter parameters are computed before the quantization noise is added. So when the receiver adds the filter to the signal, not only is the difference signal reshaped to the original speech signal, but also the noise is shaped like the speech signal. This improves the SNR.

Parametric Coding Parametric codings try to completely remove the need to transmit the signal by deconstructing it into a set of parameters, from which the original signal can be re-synthesized. The most common example is Linear Prediction as described previously. However, regarding transmission, the LPCs are not very well suited for quantization, because their range is unbounded. But when using the Levinson-Durbin recursion, we get the Reflection Coefficients from the tube model as a byproduct. Those are bounded between -1 and 1 and much better suited for quantization. Even better suited are the log area ratios, so the log of the ratio between the areas of successive tube segments in the tube model. They can easily be computed as $LAR=\log\frac{1+r_{i}}{1-r_{i}}$. It has been observed that large reflection coefficients are especially important for the perceptual quality and the LARs maps large coefficients close to zero, which in combination with the log, yield a high quantization resolution for those important values.

Hybrid Coding We can also combine waveform coding and parametric coding. There are two common methods to combine LPCs with a residual excitation signal that is transmitted. The residual signal will be a better basis for the source-filter model than just the fundamental period and voiced/unvoiced descision.

Residual excited linear prediction (RELP) In this model, the LPCs are computed from the input signal. A residual signal is created by applying the LPCs filter inversely to the input signal. It now contains all the information that is not predicted by the LPC filter. The residual signal is then heavily compressed by low-pass filtering to $f_{m}=\frac{f_{s}}{2r}$, subsampling at rate $r$, quantizing and multiplexing with the LPCs for transmission. The receiver then upsamples the signal again just by adding zeroes between successive samples. This spectrum of the upsampled signal will contain copies of the low-pass samples signal, leading to a somewhat metallic sound.

Codebook Excited Linear Prediction (CELP) In CELP, the residual signal is not transmitted directly. Instead we choose the optimal entry from a codebook with many excitation signals and transmit that together with the LPCs. The optimal entry is found using analysis-by-synthesis: The LPC filter is applied to different codebook entries and the excitation signal that produces the perceptually closest match to the input signal will be transmitted.

What Coding Scheme Has Been Used in ISDN Telephony and DECT Telephony, and what Are the Datarates?¶

DECT uses the closed-loop DPCM at 32kbps. ISDN uses PCM at 64kbps. For quantization, 8bits adaptive quantization using $A$-law or $\mu$-law. Sampling frequency 8kHz.

Cepstral Analysis¶

In time Domain, We Have the Signal Model s(n)=h(n)∗e(n) with h(n) the Impulse Response of the Vocal Tract and e(n) the Excitation Signal. How Does This Signal Model Look in the Cepstral Domain¶

Due to the logarithm when computing the cepstrum, the signal now is the superposition of the excitation and impulse response: $\hat{s}(n)=\hat{h}(n)+\hat{e}(n)$.

Is the Complex Cepstrum Complex-valued?¶

Yes, the complex cepstrum retains the phase information of the time domain signal, making a complete inversion from cepstral coefficients to time domain signal possible.

How Can We Estimate the Spectral Envelope Caused by the Vocal Tract from the Cepstral Representation?¶

The spectral envelope approximately resembles a low frequency wave in the spectral domain, whereas the harmonics are high frequency periodicities in the spectrum. Low quefrencies will correspond to the low frequencies of the spectrum and thus correspond the effects of the vocal tract. A peak in the high quefrencies will be related to the harmonics. So we can low-pass filter the cepstrum to obtain an estimate of the spectral envelope, eliminating the effects of the glottal excitation.

How Can We Estimate the Speech Fundamental Frequency in the Cepstral Domain?¶

As mentioned before, a peak in the higher quefrencies will be related to the harmonics in the spectrum. The peak will be at the frequency of the harmonics. Since harmonics are multiples of the fundamental frequency, their period length will be the fundamental frequency. So we just need to invert the quefrency where there is the peak.

Explain the Meaning of the Terms: Cepstrum, Quefrency, Liftering, Rahmonic¶

Cepstrum: Inverse Fourier transform of the log spectrum
Quefrency: Equivalent of the frequency in the cepstral domain
Liftering: Equivalent to filtering, translates to a linear superposition of the cepstral coefficients with the cepstral filter.
Rahmonics: Harmonic structures in the Cepstrum.

Speech Enhancement¶

How is the Wiener Filter Defined in the STFT Domain?¶

In the STFT domain, the Wiener filter estimates the signal as the product of the filter spectrum with the input spectrum $\hat{S_{k}(l)}=G_{k}(l)Y_{k}(l)$. The derived MMSE optimal filter is $G_{k}(l)=\frac{\sigma_{s,k}^{2}(l)}{\sigma_{s,k}^{2}(l)+\sigma_{n,k}^{2}(l)}$.

Explain how the Wiener Filter Works in the STFT?¶

The Wiener filter is applied to each time-frequency bin of the STFT signal. It attenuates those bins that have a low SNR, while leaving the bins which have a high SNR uneffected.

Sketch the Derivation of the Wiener Filter__¶

The true signal is estimated using a convolutional filter $\hat{s}(n)=\sum\limits_{\nu=-\infty}^{\infty}g(\nu)y(n-\nu)$. We define the MMSE objective as $\hat{g}(v)=\arg\min_{\hat{g}(v)}\mathbb{E}[(s(n)-\hat{s}(n))^{2}]$. We now just need to set the derivative to zero and solve for $\hat{g}(\nu)$:

$$ \begin{align}

0 &\overset{!}{=} \frac{\partial\mathbb{E}[d(n)^{2}]}{\partial \hat{g}(\nu)} = \mathbb{E}\left(2d(n)\frac{\partial}{\partial \hat g(\nu)}\left(y(n)-\sum\limits_{\nu=-\infty}^{\infty}\hat{g}(\nu)y(n-\nu)\right)\right)&|\space d(n)=y(n)-\hat{s}(n) \

&=2\mathbb{E}(d(n)y(n-\nu))\

&=\mathbb{E}[(s(n)-\sum\limits_{\nu=-\infty}^{\infty}\hat{g}(\nu)y(n-\nu))(y(n-\nu))]\

&=\mathbb{E}[s(n)y(n-\nu)]-\mathbb{E}[\sum\limits_{\mu=-\infty }^{\infty}\hat{g}(\nu)y(n-\mu)y(n-\nu)]\

&=\varphi_{ys}(\nu)-\sum\limits_{\mu=-\infty}^{\infty}\hat{g}(\nu)\varphi_{yy}(\nu-\mu)

\end{align} $$

We can bring this formulation into the spectral domian:

\[ G(z)=\frac{\Phi_{YS}(z)}{\Phi_{YY}(z)} \]

How Are Posterior, Likelihood and prior Defined in Bayesian Estimation?¶

The posterior is the probability of a hypothesis after observing the evidence $p(H|Y)$. The likelihood is the probability of the evidence given the hypothesis $p(Y|H)$. The prior is the probability of the hypothesis in general $p(H)$.

How Can I Find a Bayesian MMSE Estimator of Clean Speech?¶

We search for the MMSE optimal signal estimate given the observation:

\[ \hat{S}=\arg\min_{\hat S}\mathbb{E}[(S -\hat S)^{2}|Y]=\mathbb{E}[S|Y] \]

We know that $\mathbb{E}[S|Y]=\int_{-\infty}^{\infty}S\space p(S|Y)dS$. Assuming Gaussian speech and noise, we have:

\[ p(S|Y)=\frac{p(Y|S)p(S)}{p(Y)} \]

with

$p(Y|S)=\mathcal{N}(0, \sigma_{n}^{2})$
$p(S)=\mathcal{N}(0,\sigma_{s}^{2})$
$p(Y)=\mathcal{N}(0,\sigma_{S}^{2}+\sigma_{N}^{2})$

we can now solve to obtain

\[ \hat S=\mathbb{E}(S|Y)=\frac{\sigma_{s}^{2}}{\sigma_{s}^{2}+\sigma_{n}^{2}}Y \]

So the Wiener filter is the also the Baysian MMSE optimal non-constraint filter.

Explain Different Methods to Estimate the Noise Variances¶

VAD The most basic method to estimate speech variance is to use Voice Activity Detection (VAD). We assign each sample to the set $N$ if there is no voice activity. Then we take the empirical variance of that set. This however has drawbacks. The estimate is updated over time windows where there is no voice. So if the background noise changes while someone speaks, it will start bleeding into the sound. Also, even with perfect noise detection, this wont lead to optimal results.

Minimum Statistics The minimum statistics estimate uses the minimum of the signal periodogram over the last ~1.5 seconds (time depends on implementation) as its estimate for the noise PSD. The assumption is that the speech is much louder than the noise, so that when there is only noise present, the periodogram will dip and we can take that minimum as our estimate. This has several problems:

The minimum underestimates the noise mean. We can add a certain bias to counteract that.
Increasing noise levels are only detected at a delay of 1.5 seconds
If there is no noise in the time interval, then the noise PSD estimate will be the speech PSD, leading to heavy distortions.

SPP We can make use the Bayesian posterior probability for speech in each sample. We then take a weighted average of the current signal power and the previous noise estimates, weighting the signal power more if we find speech to be unlikely. The posterior is computed as follows: We use Bayes theorem to represent the posterior as:

\[ p(H_{1}|Y)=\frac{p(Y|H_{1})p(H_{1})}{p(Y|H_{1})p(H_{1})+p(Y|H_{0})p(H_{0})} \]

where

$p(Y|H_{1})=\mathcal{N}(0,\sigma_{s}^{2}+\sigma_{n}^{2})=\mathcal{N}(0,\sigma_{n}^{2}(1+\xi_{H_{1}}))$
$p(Y|H_{0})=\mathcal{N}(0,\sigma_{n}^{2})$

We set the expected speech power to $\xi_{H_{1}}=15dB$ and can use the formula above to derive:

\[ p(H_{1}|Y)=\left(1+\frac{p(H_{0})}{p(H_{1})}(1-\xi)\exp\left(-\frac{|Y|^{2}}{\sigma_{N}^{2}}\frac{\xi}{1+\xi}\right)\right)^{-1} \]

We can now use this to estimate the noise periodogram as a weighted average:

\[ |\hat N|^{2}=p(H_{1}|Y)\sigma_{N}^{2}+p(H_{0}|Y)|Y|^{2} \]

and from that estimate the noise PSD as:

\[ \hat\sigma_{N}^{2}(l)=\alpha\sigma_{N}^{2}(l-1)+(1-\alpha)|\hat N(l)|^{2} \]

Explain Different Methods to Estimate the Speech Variances¶

MLE We can use the maximum likelihood approach to estimate the speech PSD that makes the last $L$ observations the most likely:

\[ \sigma_{s}^{2,ML}(l)=\arg\max_{\sigma_{s}^{2}}\prod_{n=0}^{L-1}p(Y(l-n)|\sigma_{s}^{2})=\frac{1}{L}\sum\limits_{n=0}^{L}|Y(l-n)|^{2}-\sigma_{N}^{2}(l) \]

This is the same as spectral subtraction for $L=1$

Decision directed We can use the posterior speech probability from the SPP approach for the noise PSD estimation to compute the speech PSD as a weighted average from the power of the previous speech signal estimate (weighted higher if speech is unlikely) and from the spectral subtraction of the current signal (weighted higher if speech is likely).

\[ \hat\sigma_{s}^{2}(l)=\alpha|\hat S(l-1)|+(1-\alpha)\max(0, |Y|^{2}(l)-\sigma_{n}^{2}(l)) \]

The noise PSD is estimated using techniques like SPP and the next frame of the true speech signal is estimate using the Wiener filter with speech and noise PSD estimations.

Temporal Cepstral Smoothing Smoothing out the cepstral coefficients over time using a smoothed temporal average retains speech related information very well. The noise however goes to zero because even across short time intervals, noise will be uncorrelated for successive samples and thus the average will go to zero.

What is the Key Advantage when Multiple Microphones Are Present?¶

It allows us to exploit the spatial properties of sound. Sound from different directions hit different microphones at different time delays. This can be used in beamformers.

Name and Explain Two Different Beamformers¶

Delay-and-sum The Delay-and-sum beamformer find the optimal delay between two microphones using the maximum of the cross-correlation. This works because the noise will interfere with itself destructively in all delays, but the true signal will interfere perfectly constructive for some delay $\tau$. Then we just sum up the signals

\[ y(n)=\frac{1}{2}(y_{1}(n-\tau)+y_{2}(n)) \]

Minimum Variation Distortionless Response The MVDR beamformer uses an array of $M$ microphones which produce a signal $\mathbf{y}$ and defines a filter that, when applied to $\mathbf{y}$ gives the minimum variance estimate for the true signal:

\[ \hat{S}=\mathbf{h}^{H}\mathbf{y} \]

The objective is defined as the filter, that minimizes the noise PSD without changing the true signal:

\[ \mathbf{h}_{mvdr}=\arg\min_{\mathbf{h}}\mathbb{E}[|\mathbf{h}^{H}\mathbf{n}|^{2}]\quad\text{s.t.}\space\mathbf{h}^{H}\mathbf{s}=S \]

Deriving that formula gives the minimum variance estimate:

\[ \mathbf{h}_{mvdr}=\frac{\mathbf{a}\Phi_{NN}^{-1}}{\mathbf{a}^{T}\Phi_{NN}^{-1}\mathbf{a}} \]

where $\mathbf{a}$ is the so called steering vector.

What is the Key Advantage of an MVDR Beamformer over a Delay-and-sum Beamformer?¶

The MVDR beamformer defines a filter that actively attenuates noise sources while leaving the desired source direction unchanged. The delay-and-sum beamformer just relies on the noise to cancel out destrutively, without active attenuation for spatial noise sources.

Also, it can be shown that the MVDR beamformer followed by a single-channel Wiener filtering on the output signal is the same as the mutli-channel Wiener filter, making the combination of both MVDR and single-channel Wiener filtering very efficient.