Speech Enhancement

For communication, we are often interested in amplifying the speech in a signal and reducing the noise (telecommunication, hearing aids, …). One challenge is that simply boosting soft sounds also increases the noise and decreases the SNR. Therefore, we want to reduce noise and isolate speech as best as possible.

Spatial Processing¶

With multiple microphones, sounds reach each microphone at different times depending on the relative positioning. We can make use of the delay with which a sounds reaches one microphone compared to the other, like humans do with their ears. The MMSE optimal enhancement results from combining spatial processing with single channel postfilters.

Delay and Sum Beamformer¶

Pasted image 20240618132558.png

The delay and sum beamformer, we shift the signal of one of the microphones by \(\tau\). Signals that reach the second microphone with a delay of \(\tau\) will add up constructively. Other signals with destruct. This way, we can effectively point our attention just by choosing \(\tau\).

Minimum Variance Distortionless Response¶

For the MCDR beamformer, we additionally make use of prior knowledge about the direction of the noise source. Then, a high gain would be applied for sounds from the desired direction and sounds from the noise source would be attenuated. If the noise is very diffuse, we can use a superdirective beamformer, by only applying a gain to a very limited range of \(\tau\).

Single Channel Speech Enhancement¶

When no spatial information is available or we want to further improve the signal anyways, we can employ single channel speech enhancement. These techniques exploit the data distribution of speech signals to improve the sound quality.

Pasted image 20240618182058.png

Speech enhancement is typically performed in the spectral domain over the time-windowed input signal (Fourier Transformation#Short Time Fourier Transform (STFT)), based on the Fourier Transformation#Fast Fourier Transform. Tapered Analysis Window are typically used to reduce Spectral Leakage. The speech enhancement is performed by applying [[Filter]]s to the spectral representation of a time frame and then reverting back to the time domain by using inverse FFT and using Analysis Window#Overlap-Add to concatenate the time frames.

In the following sections, two filters will be presented. The Wiener filter requires us to know the characteristics of noise and true signal very well and then gives us the optimal estimate of the true signal from our observations. The other filter, based on spectral subtraction, does not require us to have any knowledge about the true signal. It just requires the observation and the noise we want to filter out and the gives us a tuneable filter.

Both filters require estimates of the noise and true signal PSDs, which is presented in Power Spectral Density (PSD)#Estimation for Speech Processing

Wiener Filter¶

Wiener Filter#^50b851

Spectral Subtraction¶

In the last step of the overview for the Wiener filter, we subtract the PSDs of observation and noise to get an estimate of the true signal PSD. But since we don't know the true PSDs but can only estimate it on finite signal observations, we use the difference in periodograms (magnitude squared of spectrum) as an estimate of the PSD. This technique is called spectral subtraction.

\[ |\hat{S}|^2=|Y|^2-|N|^2 \]

We got one problem when using this estimation. Resolving the right side of the equation gives us:

\[ \begin{align} |\hat{S}|^{2} &= |Y|^{2}-|N|^{2} \\ &= (S+N)(S+N)^{*}-|N|^{2} \\ &= |S|^{2}+|N|^{2}+2\Re(SN^{*})-|N|^{2} \\ &= |S|^{2}+2\Re(SN^{*}) \end{align} \]

So the estimate leaves us with an error term \(2\Re(SN^{*})\), which by it self can't simply be ignored. But under the assumption that noise and signal are uncorrelated, we can take the expected value on both sides and the error term will go to zero:

\[ \begin{align} \mathbb{E}(|\hat{S}|^{2}) &=\mathbb{E}(|Y|^{2})-\mathbb{E}(|N|^{2}) \\ &=\mathbb{E}(|S|^{2})+\mathbb{E}(2\Re(SN^{*})) \\ &=\mathbb{E}(|S|^{2}) \\ &=\sigma_{S}^{2}\end{align} \]

Deriving a Filter¶

We now have an estimate for the true signal variance using the difference between observation and noise periodograms. Now we want to create a filter that gives us the this estimate. We can do so by simply rearranging the terms:

\[ \begin{align} \mathbb{E}(|\hat{S}|^{2}) &= \mathbb{E}(|Y|^{2})-\mathbb{E}(|N|^{2}) \\ &= \mathbb{E}(|Y|^{2})\left(1-\frac{\mathbb{E}(|N|^{2})}{\mathbb{E}(|Y|^{2})}\right) && (2)\\ &= \mathbb{E}(|Y|^{2})\left(\frac{\mathbb{E}(|Y|^{2})-\mathbb{E}(|N|^{2})}{\mathbb{E}(|Y|^{2})}\right) \\ &= \mathbb{E}(|Y|^{2})\left(\frac{\mathbb{E}(|S|^{2})}{\mathbb{E}(|Y|^{2})}\right) && (4) \end{align} \]

If we reformulate the estimate to be the product of observation and true signal, we can derive the spectral subtraction filter (step 4):

\[ G=\frac{\sigma_{S}^{2}}{\sigma_{Y}^{2}}=\frac{\sigma_{S}^{2}}{\sigma_{S}^{2}+\sigma_{N}^{2}} \]

It looks identical to the Wiener filter, but is not applied to the spectrum but the periodogram of the observation. It is a rather heuristic approach, compared to the statistical optimization in the Wiener filter.

Application¶

As in the Wiener filter, we got the issue that we don't know the true signal. But looking at step (2) of our filter derivation, we see that we can formulate the filter in terms of noise and observation periodograms. Therefore, we can formulate the filter a different way, using the smoothed estimates of noise and observation:

\[ |\hat{S}|^{2}=|Y|^{2}\left(1-\left(\frac{\overline{|N|^{2}}}{\overline{|Y|^{2}}}\right)^{\mu}\right)^{\nu} \]

where \(\mu\) and \(\nu\) are heuristically tuned parameters. There is no way of telling the optimal parameters.

Compared to the Wiener filter, spectral subtraction is a more general filter. The Wiener filter requires us to know the characteristics of noise and true signal very well and then gives us the optimal estimate. Spectral subtraction does not require us to have any knowledge about the true signal. It just requires the observation and the noise we want to filter out and the gives us a tuneable filter.