Power Spectral Density (PSD)

The PSD is the spectrum with squared magnitudes of a signal. It can be seen as the Fourier Transformation of a signals autocorrelation function. The autocorrelation shifts a signal against itself, continuously measuring the self-similarity. It will have higher energy when the signal contains the period-length by with the signal is shifted by. When performing a subsequent FT, it will reflect the energy of the signal for certain frequencies. Compared to a direct FT, this method is more robust against noise, where the direct FT would run into infinite integration problems. ^32856

Yin-Algorithm¶

Instead of an autocorrelation function, also a difference approach can be used:

Pasted image 20240423133525.png

This approach yields the same results as the autocorrelation function, yet is less sensitive to changes in signal powers. @decheveigneYINFundamentalFrequency2002

Periodogram¶

The periodogram is the estimated PSD of a signal. The PSD of a signal is derived from its continuous time and value representation, but for processing we need it to be discreet. It can be calculated using Bartlett's method.

Bartlett's Method¶

Simple method to calculate the Periodogram over a discreet time signal. As it uses averages across windowed segments, it has a lower variance than directly computing the periodogram as the squared spectrum.

The method involves 3 steps:

Segment signal into $k$ non-overlapping frames of length $M$.
Apply Fourier Transformation#Discrete Fourier Transform.
Square each frame and divide by $M$.
Average periodograms over all $k$ frames.

Welch's Method¶

Welch's method builds on top of #Bartlett's method. It reduces the noise of the periodogram in exchange for a lower frequency resolution.

Segment signal into $L$ overlapping segments of length $M$ and with overlap $D$.
Apply Analysis Window to each frame.
Apply Fourier Transformation#Discrete Fourier Transform.
Square each frame and divide by $M$.
Average periodograms over all $k$ frames. Compared to Bartlett's method, the overlap and analysis window help in noise reduction.

Estimation for Speech Processing¶

In the context of Speech Enhancement#Single Channel Speech Enhancement, we are required to estimate the PSDs of speech and noise based on our noise observations.

Noise PSD from Voice Activity Detection¶

One straight-forward approach to estimate the noise PSD of a noise speech signal is to use voice detection. For each sample in our observation, we can assign to a speech-absent or speech-present group, based on our voice detector. We can then use the variance of our samples in the speech-absent group as an estimate of our noise variance.

\[ \hat{\sigma}_{n}^{2}=\frac{1}{|\mathcal{N}|}\sum\limits_{n\in\mathcal{N}}|y(n)|^{2} \]

where $\mathcal{N}$ is the set of samples in the speech-absent group. This kind of estimation is simple, but has several issues:

Can not deal with non-stationary noise (car driving by)
Relies on a good speech-presence threshold
- Underestimating the thresholds leads to bad noise reduction
- Overestimating the threshold leads to speech leaking into noise PSD, thus distorting speech in our estimate.

Noise PSD from Minimum Statistics¶

Taking the minimum of the observed periodogram over a frame-length of ca. 1.5 seconds gives a good estimate for the noise PSD. The assumption is that frequencies with a higher power density are related to the speech signal and that the frequencies with the lowest power density come from background noise. The estimation involves two steps:

Smoothing the noise periodogram:

\[ \overline{|Y_{k}(l)|^{2}}=\alpha|Y_{k}(l-1)|^{2}+(1-\alpha)|Y_{k}(l)|^{2} \]

Perform bias compensation.

The advantage of this method is, that it doesn't require a speech detector and thus there is no threshold to under- or overestimate. Yet, it still has some issues:

Can not deal well with non-stationary noise
Detects noise with a delay of 1.5 seconds

Speech Presence Probability (SPP)¶

The SPP estimation uses not a binary threshold for speech presence but a probabilistic model. We define the probability of speech presence and absence using Bayes theorem:

\[ \begin{align} p(\mathcal{H}_{1}|Y) &= \frac{p(\mathcal{H}_{1},Y)}{p(Y)} \\ &= \frac{P(\mathcal{H}_{1})p(Y|\mathcal{H}_{1})}{p(Y)} \\ &= \frac{p(\mathcal{H}_{1})p(Y|\mathcal{H}_{1})}{p(\mathcal{H}_{0})p(Y|\mathcal{H}_{0})+p(\mathcal{H}_{1})p(Y|\mathcal{H}_{1})} \end{align} \]

If we now make three assumptions:

The prior probabilities are 0.5 each.
Noise and speech are zero-mean Gaussians.
When speech is present, we have a SNR $\xi_{H_{1}}\approx15db$ then we can plug in
$p(\mathcal{H}_{0})=p(\mathcal{H}_{1})=\frac{1}{2}$
$p(Y|\mathcal{H}_{0})=\mathcal{N}(0;\sigma_{N}^{2})$ and $p(Y|\mathcal{H}_{1})=\mathcal{N}(0;\sigma_{Y}^{2})=\mathcal{N}(0;\sigma_{S}^{2}+\sigma_{N}^{2})$ and represent the probability as a function of observation and noise power:

$$

P(\mathcal{H}{1}|Y)=\left(1+\frac{P(\mathcal{H}})}{P(\mathcal{H{1})}(1+\xi{1}})\exp\left(\frac{|Y|^{{2}}{\sigma_{N}}\frac{\xi}}}}}{1+\xi_{\mathcal{H_{1}}}}\right)\right)^{-1

$$ We can now apply this probability on each frame and each frequency bin to obtain the noise periodogram:

$$

\hat{|N|^{2}} = P(\mathcal{H}_{0}|Y)|Y|^{2} + P(\mathcal{H}_{1}|Y)\hat{\sigma_{N}^{2}}

$$ Since we are working with consecutive frames, we can estimate the noise periodogram for each frame as a weighted average to make it more robust:

$$

\hat{\sigma_{N}^{{2}}(l)=\alpha\space\hat{\sigma_{N}}}}(l-1)+(1-\alpha)\hat{|N(l)|^{2}

$$ where in practice, alpha is chosen to be $\alpha\approx0.8$.

Pasted image 20240624151433.png Compared to the other methods, SSP tracks non-stationary noise a lot better, especially within frames, having less artifacts than the #Noise PSD from minimum statistics method. Also, it is computationally very simple and efficient to implement.

Speech PSD Estimation¶

Until now, we only used the noise PSD to estimate the true signal. We can also estimate the PSD of the clean speech, using Estimator#Maximum Likelihood Estimation (MLE). This is done by finding the speech variance $\sigma_{S}^{2,ML}$ that maximizes the probability of our observations $Y$ across all frames: $$

\sigma_{S}^{{2,ML}=\arg\max_{\sigma_{S}}}}\prod\limits_{n=0^{{L-1}p(Y(l-n)|\sigma_{S}})

\[ By assuming zero-mean Gaussians for noise and speech, we can solve this problem by setting the derivative to zero: \]

\sigma*{S}^{{2,ML}=\frac{1}{L}\sum\limits*{n=0}}|Y(l-n)|^{{2}-\sigma*{N}}(l)

$$ If we only had a single frame, this would be the same as Speech Enhancement#Spectral Subtraction. However, we now use the average across multiple frames. The problem now is that we only have a single estimate for the clean speech PSD and are very limited to stationary noise.

Again we can use smoothing across multiple frames to improve our estimate. This approach is called _decision directed*: $$

\hat{\sigma_{S}^{{2}}(l)=\alpha_{dd}|\hat{S}(l-1)|})\max(0,|Y(l)|}+(1-\alpha_{dd^{{2}-\sigma_{N}}(l))

$$