Overview
[!tip] Master Thesis Base model for Overview.
This section provides a general overview of the DDDM-VC Model. More details in
- Disentangled Denoising: Decoupled denoising process used in this model.
- Architecture: Components and implementation of the model.
- Potential Extensions: Ideas and thoughts on potential extensions of the model.
- Code Base: Mapping of model components to repository code.
Background¶

Diffusion Models like [[eDiff-I]] use multiple sequential denoising networks specialized on different stages of the diffusion process to improve the synthesis quality and speed. However, if we can separate the input space into different attributes, we can use multiple parallel denoising networks that specialize on denoising a single attribute only.
The authors apply this idea to Voice Conversion (VC), using the Source-Filter Model to separate the speech into two attributes (same principle is applied in classical speech processing using autoregressive filters (Speech Signal Processing - Overview#Linear Prediction):
- The speech source, an excitation signal which is highly related to the speaker characteristics, but independent of the linguistic content (Speech Signal Processing - Overview#Source)
- The speech filter, which is a phoneme-specific filter applied to the source signal. It is mostly related to the linguistic contents but independent of the speaker (Speech Signal Processing - Overview#Filter)
Related Models¶
Disentanglement is a central issue in speech processing, so there have been other approaches.
Information Bottleneck¶
Introducing an information bottleneck into the network to force the model to learn the lower-dimensional latent features of the input space.
Examples¶
- [[AutoVC]]: Uses dimensionality-reduction to disentangle content and timbre.
- [[F0-AutoVC]]: Additionally conditions the decoder on the Fundamental Frequency (F0)
Drawbacks¶
An information bottleneck requires heuristic determination of the bottleneck size, which might differ even between datasets and reduces generalizability.
Furthermore, even neurons in the bottleneck tend to have polysemantic neurons. That means that there is not a one-to-one correspondence between neurons and latent features, but neurons may be activated for multiple features. A possible explanation is the superposition hypothesis (@bereskaMechanisticInterpretabilityAI2024), according to which a \(n\)-dimensional space can encode \(n\) orthogonal directions (and features), but \(\propto \exp(n)\) almost orthogonal features. So the networks can learn much more compressed representations by relaxing the orthogonality constraint of features. Many approaches like beta-VAE, Factor VAE, Supervised Guided VAE and Unsupervised Guided VAE have been suggested to enforce orthogonality and might improve latent diffusion VC (Latent Score-Based Generative Model).
Information Perturbation¶
The sample speech is perturbed or encoded into disentangled representations. Most of the models employing this approach are autoencoders.
Examples¶
Perturbing Waveform
- [[SpeechFlow]]
- NANSY Average Phoneme
- [[DiffSVC]]
- DiffVC
- EmoConv-Diff Self-Supervised Representation
- [[Reconstructing speech from self-supervised representations]]
- [[Speech Resynthesis from Discrete Disentangled Self-Supervised Representations]]
- [[S3PRL-VC]]
Drawbacks¶
The disentanglement introduces information loss, which reduces the synthesis quality in autoencoder models. Diffusion models are well suited to restore the lost information, but the current models are not able to denoise in a disentangled manner.
Experimental Evaluation¶
Datasets:
- Train: [[LibriTTS Dataset]]
- Eval: [[LibriTTS Dataset]] and [[VCTK Dataset]]
- Eval for zero-shot cross-lingual VC: [[CSS10 Dataset]]
Preprocessing:
- Waveform downsampled from 24kHz to 16kHz
- Waveform is input to Wav2Vec 2.0#XLS-R content encoder
- Log Mel-spectrogram with 80 bins is input to [[Meta-StyleSpeech]]
- Mel-spectrogram with hop size 320, window-size 1280 and 1280 point Fourier Transformation as recreation target
Training:
- Optimizer#AdamW with \(\beta_{1}=0.8\), \(\beta_{2}=0.99\) and weight decay \(\lambda=0.01\)
- Learning rate decay of \(0.999^{\frac{1}{8}}\) and initial learning rate of \(5\cdot10^{-5}\)
- Batch size 64 for 200 epochs
- Prior Mixup using randomly shuffled speaker representation within same batch
- One-shot speaker adaption
- fine-tuning using single sentence of novel speaker
- 500 steps with optimizer initialization and initial learning rate of \(2\cdot10^{-5}\)
- Vocoding
- HiFiGAN#V1 as an generator
- EnCodec#Multi-Scale STFT-based Discriminators
Metrics:
- Subjective
- [[Mean Opinion Score]] on naturalness (nMOS) and speaker similarity (sMOS)
- Objective:
- Character Error Rate (CER)
- Word Error Rate (WER)
- Equal Error Rate (EER) of Speaker Recognition#VoxCeleb2
- Speaker Encoder Cosine Similarity (SECS)
- Mel-Cepstral Distortion (MCD) with respect to utterance pairs in [[VCTK Dataset]]
Results¶
Many-to-Many VC DDDM-VC outperforms DiffVC model in MOS scores and especially in objective scores regarding speaker EER.
Zero-shot VC DDDM-VC more significantly outperforms other models, especially in regards to synthesizing speaker styles. Increased diffusion iterations (6 -> 30) slightly decreases objective scores but increases MOS scores because of increased diversity in converted speech.
One-shot Speaker Adaption DDDM-VC fine-tuned on single speaker sample (<10 seconds) for 500 steps. Large iterations during training leads to overfitting, smaller iterations better.
Zero-shot Cross-lingual VC For cross-lingual scenarios, the speaker EER is similar to the one for Zero-shot VC within the same language. The CER is larger than GT, but still reasonable.
Ablation Study Study that removes components from DDDM-VC to evaluate their impact and relevance for the output and performance.
- Prior Mixup: Increases speaker adaption (EER and SECS) but decreases naturalness slightly. Authors argue this could be because the input length is fixed and thus the model does not learn the rhythm conversion properly
- Disentangled Denoisers: Increases performance across all metrics.
- Normalized F0: Increases performance across all metrics. Removing the pitch contour makes it more difficult for the encoder to disentangle content information effectively. Authors argue that extracting more pitch information might further improve stability of model.
- Data-driven Prior: Increases speaker adaption, slightly decreases naturalness. Possible improvements through use of normalizing flows.