Architecture

Pasted image 20241118110020.png

The waveform is encoded into self-supervised representations using a content-, speaker-, and pitch-encoder (following the triple information bottleneck principle in [[SpeechFlow]] with features as in [[Speech Resynthesis from Discrete Disentangled Self-Supervised Representations]]). These latent representations serve as input to a filterand source-encoder, which transforms the latent representations into mel spectrograms. The source-encoder encodes to mel coefficients related to the glottal excitation and the filter-encoder encodes to mel coefficients related to the vocal-tract filter. The sourceand filter-representations are now each denoised by two decoupled denoising networks. Each network predicts the noise in the input. At each time-step, the sum of predicted noise is subtracted from the noisy input in the last time-step.

The model is basically a mix of an autoencoder and a diffusion model. It uses an encoder-decoder architecture to reconstruct the source-filter priors from the latent content, style and pitch representations. The sum of source and filter mel coefficients should already be close to the target. Using the decoupled denoisers, the information lost by disentangling the waveform is now added back in using a-priori knowledge about the distribution of speech mel coefficients.

Source-Filter Encoder¶

Disentangled Speech Representation¶

The self-supervised speech representations are extracted using three encoders:

Content Encoder Continuous representation of linguisitc content using the middle layer of Wav2Vec 2.0#XLS-R. The waveform is perturbed as to remove the content-independent information as in [[Reconstructing speech from self-supervised representations]].
Pitch Representation The Fundamental Frequency (F0) is extracted using the [[YAPPT Algorithm]] and normalized. The normalized F0 is the speaker-independent information about intonation (speaking relatively high-pitched or low-pitched). Using [[VQ-VAE]], the pitch is vector quantized (Quantization#Vector Quantization). This is probably done because the successive pitch values within a single frame are highly correlated.
Speaker Representation The speaker representation is extracted using [[Meta-StyleSpeech]]. The encoder gives a speaker representation for each Mel-spectrogram (each frame), so they are averaged per sentence for a global speaker representation.

Source-Filter Representation¶

The disentangled speech encodings are used to create the sourceand filter representations:

Source Encoder \(Z_{src}=E_{scr}(pitch,s)\) Takes the pitch and speaker representations and transforms them into Mel-spectrogram of excitation signal
Filter Encoder \(Z_{ftr}=E_{ftr}(content,s)\) Takes the content and speaker representation and transform them in to Mel-spectrogram of filter response

The loss is the difference of the reconstructed Mel spectrogram to the actual sample. The reconstruction is done by simply summing the mel-cepstral coefficients (Cepstrum):

\[ \mathcal{L}_{rec}=||X_{mel}-(Z_{src}+Z_{ftr})||_{1} \]

Source-Filter Decoder¶

For decoding, the decoupled denoisers come into play. The same noise is added to the priors \(Z_{src}\) and \(Z_{ftr}\). Each prior is then given to its own denoiser, which remove the noise with respect to the source and filter signal respectively.

Forward Process¶

The forward process follows the one described in #Disentangled Denoising#Forward Process:

\[ \begin{align} dX_{src,t} &= \frac{1}{2}\beta_{t}(Z_{src}-X_{src,t})dt+\sqrt{\beta_{t}}dW_{t} \\ dX_{ftr,t} &= \frac{1}{2}\beta_{t}(Z_{ftr}-X_{ftr,t})dt+\sqrt{\beta_{t}}dW_{t} \end{align} \]

At \(t=0\), the samples are the same (\(X_{src}=X_{ftr}\)), but with increasing time-steps, the noisy samples diverge on different trajectories, each to their own prior, with added stochastic noise.

Reverse Process¶

For the reverse process, the scores of each speaker-conditioned denoiser are summed and removed from the noisy sample for each trajectory, as described in #Disentangled Denoising#Reverse Process:

\[ \begin{align} d\hat{X}_{src,t} &= \left( \frac{1}{2}(Z_{src}-X_{src,t}) -s_{\theta_{src}}(\hat{X}_{src},Z_{src},s,t) -s_{\theta_{ftr}}(\hat{X}_{ftr},Z_{ftr},s,t) \right)\beta_{t}dt+\sqrt{\beta_{t}}d\bar{W}_{t} \\ d\hat{X}_{ftr,t} &= \left( \frac{1}{2}(Z_{ftr}-X_{ftr,t}) -s_{\theta_{src}}(\hat{X}_{src},Z_{src},s,t) -s_{\theta_{ftr}}(\hat{X}_{ftr},Z_{ftr},s,t) \right)\beta_{t}dt+\sqrt{\beta_{t}}d\bar{W}_{t} \\ \end{align} \]

Prior Mixup¶

Prior Mixup is an approach to train the model for Voice Conversion (VC) and not just reconstructing the sample from latent representations. The idea is

Switch the speaker style representation \(s\) of the input sample with a random style \(s_{r}\) during Source-Filter Encoding.
During denoising, use the actual speaker representation \(s\).

The source-filter representation is now of the random style \(s_{r}\) and the denoiser learns to convert to the actual speaker style \(s\). This is the reason why both the Source-Filter encoder and the denoiser need to be conditioned on the speaker.

This approach enables learning voice conversion even from non-parallel datasets (Parallel and Non-parallel Training Data).

Training and Loss¶

The style-, sourceand filter-encoder are all trained jointly end-to-end using the #Disentangled Denoising#Training Objective and the loss described in #Source-Filter Representation. Pre-trained models are used for Wav2Vec 2.0#XLS-R and F0-[[VQ-VAE]].

The objective is for each attribute:

\[ \begin{align} \theta^{*}_{ftr} &= \arg\min_{\theta_{ftr}}\int_{0}^{1}\lambda_{t}\mathbb{E}_{X_{0},X_{ftr,t}}\left[\left|\left|\left( s_{\theta_{src}}(X_{ftr,t},Z_{ftr,r},s,t)+s_{\theta_{ftr}}(X_{src,t},Z_{src,r},s,t) \right)-\nabla\log{p_{t|0}(X_{ftr,t}|X_{0})}\right|\right|^{2}_{2}\right]dt \\ \theta^{*}_{src} &= \arg\min_{\theta_{src}}\int_{0}^{1}\lambda_{t}\mathbb{E}_{X_{0},X_{ftr,t}}\left[\left|\left|\left( s_{\theta_{src}}(X_{ftr,t},Z_{ftr,r},s,t)+s_{\theta_{ftr}}(X_{src,t},Z_{src,r},s,t) \right)-\nabla\log{p_{t|0}(X_{src,t}|X_{0})}\right|\right|^{2}_{2}\right]dt \end{align} \]

and the diffusion loss is

\[ \mathcal{L_{diff}} = \mathbb{E}_{X_{0},X_{t}}\lambda_{t}\left[\left|\left|\left( s_{\theta_{src}}(X_{src,t},Z_{src,r},s,t)+s_{\theta_{ftr}}(X_{ftr,t},Z_{ftr,r},s,t) \right)-\nabla\log{p_{t|0}(X_{t}|X_{0})}\right|\right|^{2}_{2}\right] \]

The overall loss is the weighted sum of the diffusion loss and the source-filter encoder reconstruction loss:

\[ \mathcal{L}_{total}=\mathcal{L}_{diff}+\lambda_{rec}\mathcal{L}_{rec} \]

The authors simply set \(\lambda_{rec}=1\).