Skip to content

https: //youtu.be/N0CVIoVQkmc?si=87s8fohLX7EIeUpG&t=1404


Source Filter Model

Pasted image 20240514162025.png

In the Source-Filter model, speech is produced in two steps:

  • Source: A sound is produced through the glottis.
  • Filter: The sound passes through the vocal tract, which acts a filter and produces the formants. The glottis can produce voiced sounds (periodic opening and closing) or unvoiced sounds (just open). The glottis sound can be estimated by Fundamental Frequency (F0)#Estimation and the effect of the vocal tract can be estimated by calculating the Spectral Envelope.

Tube Model of the Vocal Tract

Pasted image 20240516170549.png

Pasted image 20240516170451.png

The tube model considers the vocal tract as a tube, through which oscillations in are pressure are traversing. The tube is open at one end (lips), at the other end, the pressure waves are caused (glottis). The model can be considered in its simple form, considering only glottis to lips, or in its more complex form that also considers nasal sounds. This tube model follows the principle of organ tubes.

Physics

The acoustics are caused by the waves of air pressure traveling through the tube.

Pasted image 20240516170906.png

They are modeled by acoustic impedance:

  • \(p(x,t)\) = acoustic pressure
  • \(u(x,t)\) = sound particle velocity
  • \(v(x,t)=u(x,T)\cdot A\) = volume velocity The acoustic impedance then is \(Z(x,t)=\frac{p(x,t)}{v(x,t)}\). This is similar to Ohm's law, where resistance in a medium is modeled as resistance = voltage / current. At the closed ending of the tube, the particle velocity goes down to zero (\(v(x,t)=0\)), so the acoustic impedance is infinite. At the open end, the pressure goes down to zero (\(p(x,t)=0\)), so the acoustic impedance goes down to zero as well. Sudden changes in the impedance (e.g. when diameter changes) cause part of the wave to be reflected. The reflection factor is given by \(r=\frac{Z_2-Z_1}{Z_2+Z_1}\).

The forward and backward traveling waves interfere. At specific periodic distances, they always interfere constructively, leading to standing waves.

Those specific distances are given by \(\lambda_n=\frac{4l}{2n+1}\), where \(l\) is the length of the tube and \(n\) is the \(n\)-th resonance. For a vocal tract of length \(l=17cm\) and a speed of sound of \(c=340m/s\), we get resonances at \(f_n=\frac{c}{\lambda}=(2n+1)500Hz=\{500Hz, 1500Hz, 2500Hz, 3500Hz, \ldots \}\), so one resonances for every \(kHz\).

Kelly-Lochbaum Structure

While the simple tube model can produce certain resonance frequencies, Formants require multiple resonances at different frequencies. This can be achieved by segmenting the tube and where each segment has a different cross-sectional area \(A_n\).

Pasted image 20240516175924.png

As described in #Tube Model of the Vocal Tract#Physics, the transitions between sections and with it between the cross-sectional areas lead to a reflections within the tube. Assuming perfect cross-sectional areas and no friction, we can model the transfer function of such a segmented tube model as an LTI System, that models the velocity of the forward traveling wave \(f_i(t)\) and the backward traveling wave \(b_i(t)\) for each segment \(i\):

\[ \begin{align} f_{i-1}(t) &= (1+r_i)f_i(t-\tau)+r_ib_{i-1}(t) \\ b_{i(t)} &= -r_if_i(t-2\tau)+(1-r_i)b_{i-1}(t-\tau) \end{align} \]

in which \(\tau=\frac{\Delta x}{c}\) is the discretized time it takes for the wave to travel along one segment of the tube (For detailed derivation of those functions, refer to script, section 4.1). The function shows that the forward traveling wave is the sum of

  • \((1+r_i)f_i(t-\tau)\): The energy that was not reflected backward when entering the current segment.
  • \(r_ib_{i-1}(t)\): The energy that was reflected forward from the backward traveling wave when exiting the current segment.

Vice-versa for the backward traveling wave. This can be drawn as the Kelly Lochbaum structure:

Pasted image 20240516183815.png

This model can be unfolded to the a complete LTI System:

Pasted image 20240516184309.png

A filter can be derived by the impulse response to this system in continuous time. The first output is to be expected after \(n\tau\) seconds and is excited by a sequence of impulses at rate \(\tau\). The impulse response will be discretized in segments of \(2\tau\) as the first impulse travels through the delay twice (forward and back; still unclear, refer to script). To model the filter perfectly in discrete time, we need to sample at twice the output frequency of the the system, so \(f_s=\frac{1}{2\tau}\). #unclear

The model can be further improved by optimizing the connections for the reflections:

Pasted image 20240516185956.png

Linear Prediction

Linear Prediction Coefficients (LPCs) are way to parameterize a speech signal by deconstructing it into the reflection coefficients of a #Kelly-Lochbaum Structure in an autoregressive manner. Speech is then reconstructed using Vocoding. LPCs are an extremely efficient representation of speech, especially compared Speech Coding#Waveform Coding. The LPC-10 requires only a bit rate of \(2.4 \text{kbps}\) for transmission (Speech Coding#Linear Prediction Coefficients#Bitrate)

Derivation

The #Kelly-Lochbaum Structure gives a mathematical representation of the vocal tract. However, the reflection coefficients still need to be computed. As a filter must be causal, we can use linear prediction to solve for the coefficients. While the system is BIBO-stable, its impulse response might still decay indefinitely long. Thus the Autoregressive Moving Average (ARMA) model for a digital filter is used.

\[ s(n)=\sum\limits_{m=0}^{q}b_{m}e(n-m)-\sum\limits_{\nu=1}^{p}a_{\nu}s(n-\nu) \]

where the left part of the difference represents the moving average, the right part the autoregressive part.

ARMA with Z-transform

We can now apply the [[z-Transform]] to the signal and move the operator into the sum:

\[ \begin{align} && S(z) &= E(z)\sum\limits_{m=0}^{q}b_{m}z^{-m}-S(z)\sum\limits_{\nu=1}^{p}a_{\nu}z^{-\nu} \\ & \iff & S(z)\left(1+\sum\limits_{\nu=1}^{p}a_{\nu}\right) &= E(z)\sum\limits_{m=0}^{q}b_{m}z^{-m} \end{align} \]

where \(z^{-x}=e^{-j2{\pi}x}\). \(E(z)\) is the z-transform of the excitation signal, \(S(z)\) the z-transform of the signal resulting from the #Kelly-Lochbaum Structure. To find the transfer function of the filter \(H(z)\), we just need to divide both those transforms:

$$ H(z)=\frac{S(z)}{E(z)}

\overset{a_0=1}{=}\frac{\sum\limits_{m=0}{q}b_{m}z}}{\sum\limits_{\nu=1{p}a_{\nu}z}

=z{q-p}b_{0}\frac{\prod\limits_{m=1} $$}(z-z_{0m})}{\prod\limits_{\nu=1}^{p}(z-z_{\infty\nu})

with

  • \(z_{0m}\) being the roots of the numerator polynomial and zero points of \(H(z)\).
  • \(z_{\infty\nu}\) being the roots of the denominator polynomial and poles points of \(H(z)\).

The last equivalence is given by the fact, that a polynomial can be described by its roots. This shows that an ARMA model and the same as a [[Pole-Zero filter]] when z-transformed.

If we assume the signal to be perfectly represented by either the autoregressive or the moving average part, we can neglect the other. The z-transform of the autoregressive part only gives us an all pole filter. Conversely, the z-transform of the moving average gives us an all zero filter. As the vocal tract is best approximated by the autoregressive part #unclear , we can represent the #Tube Model of the Vocal Tract as an all pole filter:^allPoleFilter

\[ H(z)\approx\frac{b_0}{1+\sum\limits_{\nu=0}^{p}a_{\nu}z^{-\nu}} \]

So from the spectral transformation of the ARMA representation of the signal, we can find a filter for the vocal tract. Increasing the order \(p\) increases the accuracy of the approximation. ^f4c6cb

Computing the AR Coefficients