DiffVC

Based on @popovDiffusionBasedVoiceConversion2022

This model builds on the average phoneme idea from Grad-TTS applied to the problem of Voice Conversion (VC). Also, the fast sampling scheme is introduced to improve the performance of the usually rather slow Diffusion Model.

Model Architecture¶

Pasted image 20241123141317.png

As in Grad-TTS, a data-driven prior is used as the terminal distribution of the forward process of the diffusion model. So instead of diffusing towards a Gaussian, it diffuses towards the average mel coefficients for each phoneme in the input utterance. In Grad-TTS, the average-phoneme representation is encoded from text input, in DiffVC it is encoded from a source mel-spectrogram.

Encoder¶

The encoder is a transformer-based architecture that is trained to generate the average-phoneme representation given an input utterance. To generate the target representation, three steps are applied to the input mel-spectrogram:

Align speech-frames with phonemes ([[Montreal Forced Aligner]] applied to [[LibriTTS Dataset]]).
Aggregate mel-features for each phoneme across the dataset and take averages.
Train encoder to minimize the MMSE between source sample and average-phoneme representation.

Decoder¶

The decoder is an score-based reverse-diffusion denoiser, the same as in Grad-TTS. The forward diffusion is given by

\[ dX_{t}=\frac{1}{2}\beta_{t}(\bar{X}-X_{t})dt+\sqrt{\beta_{t}}dW_{t} \]

and has the reverse-time solution

\[ d\hat{X}_{t}=\left(\frac{1}{2}(\bar{X}-X_{t})-s_{\theta}(\hat{X}_{t},\bar{X},t)\right)\beta_{t}dt+\sqrt{\beta_{t}}d\bar{W}_{t} \]

The forward Wiener process \(W\) and reverse-time Wiener process \(\bar{W}\) are both just normal but independent Wiener processes in \(\mathbb{R}^{n}\), \(n\) being the data dimensionality.

The forward SDE has an explicit solution, which is a Normal distribution with data-dependent parameters:

\[ \text{Law}(X_{t}|X_{0})=p_{t|0}(X_{n,t}|X_{0}) = \mathcal{N}\left\{e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds} X_{0} + (1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})\bar{X},(1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})\mathbf{I}\right\} \]

The Score-Based Diffusion Model#Score Network is now trained using MMSE towards the actual score function:

\[ \theta^{*}=\arg\min_{\theta}\int_{0}^{1}\lambda_{t}\mathbb{E}_{X_{0}, X_{n,t}}\left|\left|s_{\theta}(X_{t},\bar{X},t)-\nabla\log{p_{t|0}(X_{t}|X_{0})}\right|\right|^{2}_{2}dt \]

Because the prior distribution is a Gaussian of known form, we can derive the score-function:

\[ \nabla\log{p_{t|0}(X_{t}|X_{0})=-\frac{X_{t}-X_{0}(e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})-\bar{X}(1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})}{1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}}} \]

An advantage of having the closed-form Gaussian solution for the SDE is that we can sample the noisy input \(X_{t}\) without calculating the intermediate values \(\{X_{s}\}_{0<s<t}\), which makes the optimization task much more efficient. With a well trained score-network, the reverse-time model now learns to approximate the forward-time trajectory.

Conditioning¶

These formulations allow the model to learn to reconstruct a sample from the average-phoneme representation. To enable Voice Conversion, we need to condition the denoiser on an utterance of the target speaker. To do so, we introduce a trainable function \(g_{t}(Y)\), where \(Y\) is (part) of the forward trajectory of the target mel-spectrogram.

The authors test different setups on what part of the target trajectory might be relevant for high-quality VC:

d-only: The function always returns a speaker-embedding generated from \(Y_{0}\)
wodyn: + the sample \(Y_{t}\)
whole: + the whole trajectory \(Y\) discretized into 15 samples

The authors show the best option to be wodyn, so to condition the model on the speaker embedding + the noisy mel-spectrogram of the target speaker.

Maximum Likelihood SDE Solver¶

The paper also presents a fast sampling scheme based on maximizing likelihood. The solver is similar to the [[Euler-Mauyama]] solver except when \(N\) is rather small or \(t\) has the same order as step-size \(h\) (final steps of inference).

[!warning] Side-note The proof is very long and complex. Should only be worked through if actually needed.

Experimental Evaluation¶

The evaluation was done on [[VCTK Dataset]] and [[LibriTTS Dataset]], vocoding using HiFiGAN and evaluated using [[Mean Opinion Score]]. The model outperforms other VC models on both naturalness and similarity MOS