Disentangled Denoising
This section outlines the general idea of decoupled denoising. The explicit implementation for Voice Conversion (VC) is described in #Model Architecture.
Forward Process¶
In Score-Based Diffusion Model#Forward Process, the forward process is defined by a deterministic shift \(f(X_{t},t)\), a noise-schedule \(g(t)\) and the forward-time Brownian motion \(W_{t}\):
For multiple attributes, the forward trajectory is defined for each of the \(N\) attributes. Each trajectory moves from the input sample to the data-driven and attribute-specific prior (In DiffVC, we only have the data-driven prior but only for a single attribute).
We can plug in the data-driven and attribute-specific deterministic drift \(f(X_{t},t)=\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})\) and the noise-schedule \(g(t)=\sqrt{\beta_{t}}\):
This process yields \(N\) noisy sample that is Gaussian distributed with data-driven parameters. Note that the process starts at \(X_{n}\), which is just the training sample and is the same for each attribute \(n\). Throughout the diffusion process, the \(X_{n,t}\) diverge for increasing time-steps \(t\) toward the attribute-specific priors \(Z_{n}\).
Reverse Process¶
The denoising process is given by the reverse trajectory which exists for the forward SDE. In Score-Based Diffusion Model#Reverse Process, the reverse SDE is defined as
where \(s_{\theta}(\hat{X}_{t},t)\) is the score-model predicting the score-function \(\nabla_{\hat{X}_{t}}\log{p_{t}(\hat{X}_{t})}\) at noise-level \(\theta\) and \(\bar{W}_{t}\) is the reverse Brownian motion.
Again, we extend this formula using the data-driven and attribute-specific deterministic drift \(f(X_{t},t)=\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})\) and the noise-schedule \(g(t)=\sqrt{\beta_{t}}\):
Compared to the standard score-based diffusion, the score-model is additionally parameterized by attribute-specific noise \(\theta_{n}\) and conditioned on the data-driven prior \(Z_{n}\). The predicted gradients are summed up to get the overall score.
Training Objective¶
The training objective is the Score-Based Diffusion Model#Score Matching loss is the integral of score-matching losses over time. The score-matching loss is weighted by \(\lambda_{t}=1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}\) so that high noise-levels don't dominate the overall loss.
The true score is tractable because \(p_{t|0}(X_{n,t}|X_{0})\) is Gaussian:
and score is the derivative of the log-distribution:
[!info] The exponential \(e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}\) is a function of \(t\), \(s\) being the substitution variable for time.
The authors employ fast sampling using DiffVC#Sampling scheme.