Disentangled Denoising

This section outlines the general idea of decoupled denoising. The explicit implementation for Voice Conversion (VC) is described in #Model Architecture.

Forward Process¶

In Score-Based Diffusion Model#Forward Process, the forward process is defined by a deterministic shift \(f(X_{t},t)\), a noise-schedule \(g(t)\) and the forward-time Brownian motion \(W_{t}\):

\[ dX_{t}=f(X_{t},t)dt+g(t)dW_{t} \]

For multiple attributes, the forward trajectory is defined for each of the \(N\) attributes. Each trajectory moves from the input sample to the data-driven and attribute-specific prior (In DiffVC, we only have the data-driven prior but only for a single attribute).

We can plug in the data-driven and attribute-specific deterministic drift \(f(X_{t},t)=\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})\) and the noise-schedule \(g(t)=\sqrt{\beta_{t}}\):

\[ dX_{n,t}=\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})dt+\sqrt{\beta_{t}}dW_{t} \]

This process yields \(N\) noisy sample that is Gaussian distributed with data-driven parameters. Note that the process starts at \(X_{n}\), which is just the training sample and is the same for each attribute \(n\). Throughout the diffusion process, the \(X_{n,t}\) diverge for increasing time-steps \(t\) toward the attribute-specific priors \(Z_{n}\).

Reverse Process¶

The denoising process is given by the reverse trajectory which exists for the forward SDE. In Score-Based Diffusion Model#Reverse Process, the reverse SDE is defined as

\[ \begin{align} d\hat{X}_{t} &= \left[f(\hat{X}_{t}, t)-g^{2}(t)\nabla_{\hat{X}_{t}}\log{p_{t}(\hat{X}_{t})}\right]dt+g(t)d\bar{W}_{t} \\ &= \left[f(\hat{X}_{t}, t)-g^{2}(t)s_{\theta}(\hat{X_{t}},t)\right]dt+g(t)d\bar{W}_{t} \end{align} \]

where \(s_{\theta}(\hat{X}_{t},t)\) is the score-model predicting the score-function \(\nabla_{\hat{X}_{t}}\log{p_{t}(\hat{X}_{t})}\) at noise-level \(\theta\) and \(\bar{W}_{t}\) is the reverse Brownian motion.

Again, we extend this formula using the data-driven and attribute-specific deterministic drift \(f(X_{t},t)=\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})\) and the noise-schedule \(g(t)=\sqrt{\beta_{t}}\):

\[ \begin{align} d\hat{X}_{n,t} &= \left[\frac{1}{2}\beta_{t}(Z_{n}-X_{n,t})-\sqrt{\beta_{t}}^{2}\sum\limits_{n=1}^{N}s_{\theta_{n}}(\hat{X}_{n,t},Z_{n},t)\right]dt+\sqrt{\beta_{t}}d\bar{W}_{t} \\ &= \left[\frac{1}{2}(Z_{n}-X_{n,t})-\sum\limits_{n=1}^{N}s_{\theta_{n}}(\hat{X}_{n,t},Z_{n},t)\right]\beta_{t}dt+\sqrt{\beta_{t}}d\bar{W}_{t} \end{align} \]

Compared to the standard score-based diffusion, the score-model is additionally parameterized by attribute-specific noise \(\theta_{n}\) and conditioned on the data-driven prior \(Z_{n}\). The predicted gradients are summed up to get the overall score.

Training Objective¶

The training objective is the Score-Based Diffusion Model#Score Matching loss is the integral of score-matching losses over time. The score-matching loss is weighted by \(\lambda_{t}=1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}\) so that high noise-levels don't dominate the overall loss.

\[ \theta_{n}^{*}=\arg\min_{\theta_{n}}\int_{0}^{1}\lambda_{t}\mathbb{E}_{X_{0}, X_{n,t}}\left|\left|\sum\limits_{n=1}^{N}s_{\theta_{n}}(X_{n,t},Z_{n},s,t)-\nabla\log{p_{t|0}(X_{n,t}|X_{0})}\right|\right|^{2}_{2}dt \]

The true score is tractable because \(p_{t|0}(X_{n,t}|X_{0})\) is Gaussian:

\[ p_{t|0}(X_{n,t}|X_{0}) = \mathcal{N}\left\{e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds} X_{0} + (1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})Z_{n},(1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})\mathbf{I}\right\} \]

and score is the derivative of the log-distribution:

\[ \nabla\log{p_{t|0}(X_{n,t}|X_{0})=-\frac{X_{n,t}-X_{0}(e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})-Z_{n}(1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds})}{1-e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}}} \]

[!info] The exponential \(e^{-\frac{1}{2}\int_{0}^{t}\beta_{s}ds}\) is a function of \(t\), \(s\) being the substitution variable for time.

The authors employ fast sampling using DiffVC#Sampling scheme.