Yingram

Based on

A Yingram is frequency-domain representation of audio signals, similar to a spectrogram but using the YIN algorithm. It contains pitch-related information over time, making it particularly effective for representing the Fundamental Frequency (F0), tracking pitch contours, and analyzing intonation in speech signal, disentangled from the linguistic content of the signal.

Compared to just tracking the F0, the Yingram also tracks jitter and subharmonics, where the F0 is not well defined. The Power Spectral Density (PSD)#Yin-Algorithm is used to compute the autocorrelation of the signal for different time-legs and produce multiple candidate f0 values. The Yingram the converts the time-lag axis into a midi-scale axis for better controllability.

Computation¶

The Yingram makes use of the Yin-algorithm, which is based on the Difference function, which gives high values for periodicities at time-lag $\tau$ ($r$ is the auto-correlation function):

\[ d_{t}(\tau)=\sum\limits_{j=1}^{W}(x_{j}-x_{j+\tau})^{2}=r_{t}(0)+r_{t+\tau}(0)-2r_{t}(\tau) \]

The function is used to extract the Cumulative Mean Normalized Differences $d^{\prime}$:

\[ d^{\prime}_{t}(\tau) = \begin{cases} 1,\\ d_{t}(\tau)/\sum\limits_{j=1}^{\tau}d_{t}(j),\\ \end{cases} \]

After some post-processing steps, multiple candidate F0s are produced, from which the Yin algorithm would select a single F0 as output. However, the CMND contains rich information about pitch and harmonics, which makes it a useful feature for deep learning tasks.

To improve the controllability, the Yingram is defined by converting the time-lag axis into a midi-scale axis:

$$ Y_{t}(m)=\frac {d^{\prime}{t}(\lceil c(m)\rceil)-d^{\prime} {\lceil c(m)\rceil-\lfloor c(m)\rfloor}\cdot(c(m)-\lfloor c(m)\rfloor)+d^{\prime}_{t}(\lfloor c(m)\rfloor) $$ where}(\lfloor c(m)\rfloor)

\[ c(m)=\frac{\text{sampling rate}}{440\cdot2^{\frac{m-69}{12}} }\]

is the midi-to-lag conversion function.