Quantization

A digital signal is represented in binary and therefore must be discreet, not only in time (see [[Sampling]]) but also in amplitude. Compared to sampling, the process of quantization is not lossless.

Quantization is not restricted to the direct time representation of the signal, but is also applied when we deal with Source-Filter Model#Linear Prediction coefficients. In any case, we are interested in minimizing the amounts of bits spent (word length) for the encoding.

Since quantization introduces loss, we try to minimize the error between the original and the quantized signal. The error function we try to minimize is the signal to noise ratio (SNR), which compares the signal power to the noise power.

An alternative to quantization is Speech Coding, where we encode speech signals in parametric form (e.g. Source-Filter Model#Linear Prediction). Quantizationare on the other hand are waveform coders and can encode any kind of signal, although it can be optimized for speech signals. With a high bit-rate, it can achieve high quality sound with audibly imperceptible loss.

Uniform Quantization¶

Pasted image 20240617105420.png

For the uniform quantizer, all values are mapped onto uniform quantization levels. The function in the above figure represents the quantization characteristic curve. The height between each step is called step size, $\Delta x$. The range of the quantizer is limited to be between $-x_{max}$ and $x_{max}$. For a given number of bits $w$, we have $2^w$ steps. This means that $x_{max}=\frac{k\Delta x}{2}$.

Typical values range from 12 to 16 bits for speech and 16 to 32 bits for general audio.

Signal to Noise Ratio (SNR)¶

To understand the impact of quantization, we derive the SNR. This requires us to assume that the error $e(n)=\hat{x}(n)-x(n)$ is uncorrelated to the input signal, which holds in the limit for an increasing number of bits $w$. For large $w$, the error $e$ is white noise. As the both an audio signal and its error are zero-mean, we can say $E(xe)=0$.

The SNR is defined as

\[ SNR=\frac{P_x}{P_n} \]

so the fraction of power that is noise in the reconstructed signal.

Signal Power $P_x$¶

We can compute the power of the signal as the expectation of its square:

\[ P_x=E(x^2)=\int\limits_{-\infty}^{\infty}x^2p_x(x) \]

We now assume the signal to be uniformly distributed, which means it ranges from $-x_{max}$ to $x_{max}$ and has a constant value of $\frac{1}{2x_{max}}$ (linear function). This is the optimal case for the uniform quantizer. We can now compute the power for a given number of quantization levels $k=2^w$ and $x_{max}=\frac{k\Delta x}{2}$:

$$ \begin{align}

P_{x} &= \int\limits_{-x_{max}}^{x_{max}}x\}{2x_{max}}dx

&=2\int\limits_0^{\frac{k\Delta x}{2}}x^{2}\frac{1}{k\Delta x}dx\

&=2\frac{1}{k\Delta x}\frac{1}{3}\frac{k^{3}\Delta x^{3}}{8}\

&=\frac{k^2\Delta{x2}}{12}

\end{align} $$

Noise Power $P_n$¶

As above, we apply the definition of the power:

\[ P_n=E(e^{2})=\int e(x)^2p_x(x)dx \]

Because we assume the input to be linear, the quantization is exact at $\frac{\Delta x}{2}$ every step $\Delta x$ from there. Thus, error only accumulates in between each step, thus we can integrate the error only for the interval between steps and multiply by the number of steps $k$:

$$ \begin{align}

P_n &= k\int\limits_{0}^{\Delta x}\left(x-\frac{\Delta x}{2}\right)^2\frac{1}{k\Delta x}dx\

&= \frac{1}{3}\frac{1}{\Delta x}\left[(x-\frac{\Delta x}{2})^{{3}\right]_{0}} \

&= \frac{1}{3}\frac{1}{\Delta x}\left[(x-\frac{\Delta x}{2})^{3} +(\frac{\Delta x}{2}^3)\right]\

&= \frac{\Delta x^2}{12}

\end{align} $$

For the SNR, the $\frac{\Delta x^2}{12}$ cancels and we are left only with $k^2$:

\[ SNR = \frac{P_x}{P_{n}} = k^{2} = 2^{2w} \]

The fraction of noise in the signal thus decreases (and SNR increases) proportionally to the number of bits spent.

Decibel Range¶

We can transform the SNR into decibel range to better show audible differences for every bit spent:

\[ SNR\Bigg|_{db}=10\log_{10}2^{2w}db\approx6db \cdot w \]

So audibly, in the reconstructed signal, for every bit we spent, the true part of the signal gets 6db louder compared to the noise. For an audio CD coded at 16bits, this amounts to an SNR of 96db, so the true signal is 96 times louder than the noise.

Non-uniform Quantization¶

The last section made two strong assumptions:

The input signal was assumed to be uniformly distributed.
The input signal was assumed to exactly range from $-x_{max}$ to $x_{max}$. However, speech signals are not uniformly distributed and might have a range well within or outside of this range.

We could define a quantizer with a non-uniform quantization curve, but we can simplify the process by compressing the signal into a uniform distribution within the range and then just using our uniform encoder:

Pasted image 20240617121503.png

This of course requires the receiver to decompress the signal, using an Expander.

A linear compression already yields a distribution that is closer to a uniform. But still, we have a lot more detail around 0 and less father away from 0.

Pasted image 20240617121442.png

A-Law¶

Pasted image 20240617123728.png

A logarithmic compression is better suited for audio signals and we can force the distribution in the $x_{max}$ range:

\[ C(x)=c_1\log\frac{x}{x_{max}}+x_{max} \]

However, the logarithm goes to negative infinity when the signal approaches zero. We can use the A Law compander to circumvent this problem. This function has a linear definition for values close to zero and a logarithmic definition farther away from zero, with a smooth transition.

$$ C(x)=\begin{cases}

\text{sign}(x)\frac{1+\log(A|x|)}{1+log(A)} &  \frac{1}{A}\le|x|\le1\\

\text{sign}(x)\frac{A|x|}{1+log(A)} &  |x|<\frac{1}{A}

\end{cases} $$

The sign operator mirrors the encoding anti-symmetrically.

The European ISDN standard for telephony uses $A=87.6$.

Pasted image 20240617123756.png

The A-Law compander with 8 bit quantization gives us similar performance to 12 bit quantization without the compander, at least for values within Range II. After that, it plateaus.

Adaptive Quantization¶

With adaptive quantization, we choose $x_{max}$ differently for each frame, keeping the step size constant. This way, we have less noise in where the power of the input signal is lower. This requires keeping track of the step-size for each frame, given by the standard deviation and a chose scalar:

\[ \Delta x(n)=c\hat{\sigma_x}(n) \]

Our signal can the be reconstructed from the quantized signal by:

\[ \hat{x}(n)=\text{sign}(x)Z(n)\frac{\Delta x}{2} \]

where $Z \in \{1,3,5,\ldots\}$ (giving us the step-wise curve). The receiver must estimate the standard deviation, this can be done in two ways.

Adaptive Quantization Forward (AQF)¶

Pasted image 20240617151408.png

In AQF, we estimate the variance over a sliding window of length $N-1$ by buffering the last $N$ samples and computing the signal power and standard deviation. We then adapt the stepsize according to the standard deviation and transmit the step size and quantization value.

\[ \hat{\sigma^2_x}(n)=\frac{1}{N}\sum\limits_{m=0}^{N-1}x^2(n+m) \]

The disadvantage of this process is that we have to transmit extra bits so the receiver knows what step size to apply when decoding.

Adaptive Quantization Backward (ABF)¶

Pasted image 20240617151845.png

To reduce the number of bits transmitted, we can use ABF. Here, we don't calculate the standard deviation from the input signal directly, but from the last quantized frame.

\[ \hat{\sigma}^2_x(n)=\alpha\hat{\sigma}^2_x(n-1)+(1-\alpha)\hat{x}^2(n-1),\quad0<\alpha<1 \]

This has the advantage of not needing any extra bits for the step size to be transmitted. We just use the past quantized signal frames, which is accessible also to the receiver.

Comparison¶

Pasted image 20240617152601.png

The first plot shows the signal power over time.

From the second plot, we see that for the #Uniform Quantization, the SNR degrades directly in proportion to the signal power. The #A-Law improves on that, due to the higher resolution in the lower energy parts of the signal. The

In the third plot, we see the #Adaptive Quantization Forward (AQF), which has no significant dips in SNR due to its step-size adaption for low energy parts of the signal.

Vector Quantization¶

In general, we often deal with multidimensional signals, where we have multiple data points per frame/sample (e.g. LPCs, cepstral coefficients, images), which is represented by a vector. Each vector could represent a block of time domain samples, a block of LPCs, or similar.

The quantization in the samples space works similar to the one dimensional case. We subdivide the space in discrete volumes, so called Voronoi cells. Each cell has a centroid and a vector is assigned to the nearest centroid for quantization. By indexing the centroids, we only need to send the index of the centroid to the receiver, instead of the whole vector. This represents a trade-off between transmission size and memory usage.

Pasted image 20240617171825.png The subdivision is given by how the centroids are spread (a). If they are in a uniform pattern, we get a lattice patter of cells, but centroids can also be distributed non-uniformly, if it better fits the sample space topology (b). For $k$ centroids, we require $w=\log_{2}k\text{ bits}$. The index of the best fitting cell is found by finding $i_{opt}= \mathop{\arg \min\limits_{i}}\space d(\hat{x},\hat{x}_i)$, where $d$ is some distance metric like Euclidean and $x_i$ are the centroids.

Pasted image 20240617172725.png

Two successive samples of a speech signal can be modeled as a two dimensional variable. When we find centroids through a k-means clustering, we observe two things:

Most samples are around the origin, which is expected as speech signals are zero mean.
There is a linear relation between successive samples. The linear relation means quantizing successive samples together retains more information than quantizing single samples. So we can achieve better compression quality with the same number of bits.