Speech Coding

Speech coding are methods that try to minimize the amount of data needed for transmitting speech. Some speech coders act directly on the waveform (e.g. Quantization), retaining high quality but also only offer low compression rates. Other coders parameterize the speech completely (e.g. Source-Filter Model#Linear Prediction), with lower quality reconstruction, but very high data efficiency. Hybrid coders use both approaches to provide a trade-off between quality and transmission size. The quality is often measured using the [[Mean Opinion Score]] (MOS).

Pasted image 20240617174857.png

Waveform Coding¶

Quantization¶

Waveform encoders typically encompass Quantization encoders.

DPCM¶

Another method is Differential Pulse Code Modulation (DPCM). In DPCM, we subtract weighted successive samples from other, \(x(k)-ax(k-1)\). This reduces the dynamic range of the signal. It is similar to an LPC analysis with only one coefficient \(a\) and it can even be extended to multiple coefficients. Each coefficients further decorrelates successive samples and whitens the signals. Compared to LPC, where the excitation signal is parametrized, in DPCM the time varying post filtering residual is quantized and transmitted. #unclear

Pasted image 20240617180257.png

The receiver then uses the post filter signal and the parameters to reconstruct the original signal. This can be done in open loop fashion, where the coefficients are transmitted to the receiver, or in closed loop fashion, where the coefficients are computed from the quantized signal.

Pasted image 20240617180228.png

Noise Shaping¶

Using a closed loop DPCM coder might seem like the better choice at we don't have to additionally transmit the coefficients. But there is a key difference in the decoded signal:

For the open loop system, the \(A\) filter is applied to the difference signal only and not the quantization noise. When the receiver applies the \(A\) filter now on the received signal, they reconstruct the original difference that needs to be added to the incoming signal. At the same time, the quantization noise, being unfiltered in the first place, is now shaped to the Spectral Envelope of the input signal, effectively being masked by it.

In the closed loop system, the \(A\) filter is applied to both the difference signal and the quantization noise. When the receiver now filters, they also reconstruct the original difference that needs to be added to the incoming signal. But this time, with the quantization noise also being filtered, they just revert the already shaped noise back to quantization noise, without any masking.

Open Loop¶

Drawing 2024-06-18 10.33.51.excalidraw#^group=ASq_E5sw-n8Hp3HLoE3Lz

Transmitter

If we look at the quantized signal \(\tilde{d}(k)\), we see it is comprised of the difference signal \(d(k)\) and the noise added by the quantizer \(\Delta(k)\). In the \(z\)-domain ([[z-Transform]]), this means we are summing up their spectral representations (FT is linear):

\[ \tilde{D}(z)=\Delta(z)+D(z) \]

We can represent the difference signal \(D(z)\) by the filtered input signal \(X(z)\), where we take \(X(z)\) and subtract its weighted self:

\[ \begin{align} \tilde{D}(z)&=\Delta(z)+X(z)-X(z)A(z)\\ &= \Delta(z)+X(z)(1-A(z)) \end{align} \]

Receiver The receivers signal is the comprised of sum of the current quantized block and the weighted last block:

\[ \begin{align} && Y(z) &= \tilde{D}(z)+A(z)Y(z) && |-A(z)Y(z) \\ \iff && Y(z)-A(z)Y(z) &= \tilde{D}(z) \\ && Y(z)(1-A(z)) &= \tilde{D}(z) && |\div(1-A(z)) \\ \iff && Y(z) &= \frac{\tilde{D}(z)}{1-A(z)} \end{align} \]

If we now substitute our upper expression for \(\tilde{D}\), we get:

\[ Y(z)=\frac{\Delta(z)}{1-A(z)}+X(z) \]

From this expression, we see that the noise is filtered by \(\frac{1}{1-A(z)}\). Since the coefficients \(a\) are computed directly from the speech signal, the filter \(A(z)\) will have the shape of the Spectral Envelope. This means, the quantization noise takes the shape of the speech signal and is thus masked by the actual signal.

Closed Loop¶

Drawing 2024-06-18 10.33.51.excalidraw#^group=0vT9vwAxaWGuEFDSt5VPG

Transmitter

If we look at the closed loop system, the \(A(z)\) filtering is applied after quantization. So the quantization noise in the difference signal is also affected by the filter:

\[ \tilde{D}(z)=X(z)(1-A(z))+\Delta(z)(1-A(z)) \]

Receiver Again, the receiver sees the same structure:

\[ Y(z) = \frac{\tilde{D}(z)}{1-A(z)} \]

But when we substitute \(\tilde{D}\) this time, the filter cancels for both the input signal and the noise:

\[ Y(z)=\Delta(z)+X(z) \]

Therefore, we don't get the noise masking attribute of the open loop system, making the noise more notable.

Parametric Coding¶

In parametric coding, we don't transmit the waveform, but transform it into a simplified representation and transmit its parameters.

Linear Prediction Coefficients¶

The LPC model is here: Source-Filter Model#Linear Prediction. One issue with the LPCs is that they have a high and priory unknown dynamic range, thus requiring lots of bits to transmit. One way to circumvent this issue is to transmit the reflections coefficients, which all fall into a range of \([-1, 1]\).

\[ r_i=\frac{A_{i+1}-A_i}{A_{i+1}+A_i} \]

A further transformation gives us Log Area Ratios (LARs), which retain more information for large reflection coefficients, yielding better perceptual quality.

\[ L_i=\log\frac{A_{i+1}}{A_i}=\log\frac{1+r_i}{1-r_i} \]

Using logarithmic compression is also useful for the gain: As the amplitudes of the signal are mostly around zero, using a linear compression gives weird on-off sounds for low energy areas of the speech. Using logarithmic compression on the gain, we retain that low-energy information better. We also loose some of the detail in the high energy parts of the signal, but that is less audible.

Bitrate¶

For the LPC-10 coding, we need to store:

Parameter	bits, \(w\)
LP coefficients	41 bits
Fundamental Frequency (F0)	7 bits
Gain	5 bits
Synchronization (unvoiced/voiced)	1 bit
TOTAL	54 bits

This amounts to 54 bits per frame. Given that in LPC-10, we use 22ms frames, we get a bit rate of:

\[ \text{bit rate}=54\text{bit}\cdot\frac{1}{22\text{ms}}=2.4\text{kbps} \]

which is less than a tenth of a mobile waveform coding schemes like DECT, but also come with a drop in quality.

Hybrid Coding¶

As the name suggests, hybrid coding combines parametric coding and waveform coding. Hybrid coders transmit both parameters and time domain residuals to enhance the quality compared to parameter-only coding.

Residual Excited Linear Prediction¶

Pasted image 20240618113222.png

The RELP coding scheme uses two methods for optimization:

Firstly, RELP uses Quantization#Vector Quantization for the LPCs. This greatly reduces the number of bits transmitted for each coefficient, which now is only an index in a coding book. Also, the vectors have the additional benefit of capturing the correlations between coefficients, making the compression even more efficient.

Secondly, the excitation signal is not parametrized by Fundamental Frequency (F0), gain and voiced/unvoiced decisions, but using a highly compressed form of the original signal. The compression works by [[Low-pass filtering]] the signal to frequency \(f_s\).

Pasted image 20240618113625.png

It is then sub sampled at a rate of \(\frac{f_s}{r}\).

Pasted image 20240618113710.png

For the decompression, the signal is now upsampled by simply adding zeroes between all samples, resulting in a spectrum with replicas of the original low-passed spectrum.

Pasted image 20240618114033.png

This is of course not the original spectrum, with too much energy in the upper frequencies and creating a metallic sound. But since voiced sounds have a natural decay of \(\frac{1}{f}\), the LPC vocal tract filter will dampen most of the error in the upper frequencies, producing a better quality than the fully parametrized LPC coding.

Codebook Excited Linear Prediction¶

Pasted image 20240618114433.png

CELP uses a Quantization#Vector Quantization for both the LPCs and time domain residuals. To do so, the input signal is first windowed in frames, on which LPC analysis is performed. Then, the excitation signals in the codebook are all filtered by the LPCs and the entry which is perceptually closest to the original frame is chosen. Then the index for both the LPCs and the excitation signal are transmitted.

CELP is the coding scheme employed for GSM.

Perceptual Coding¶

Pasted image 20240618115345.png

In human perception, when a frequency excites our auditory nerves, close frequencies are also excited. This has the effect, that noise close to the excited frequency is less or not audible. Quantization#Adaptive Quantization schemes can make use of this by adapting the step size so that quantization noise is accumulated close to high energy frequencies in the original signal. The noise is then masked by those frequencies. The idea is also employed in #Noise Shaping. The MP3 and AAC codec make use of this property, which revolutionized the way music could be transmitted, stored and played.