Potential Extensions

Choice of Priors¶

In Overview, the priors are chosen to be a source-filter representation. DiffVC uses a average phoneme prior, basically manually disentangling the content information from the utterance, similar to the filter representation. This begs the question of why not to use the average speaker or average emotion representation as a terminal distribution in Overview as well. The problem with such an approach is that the phoneme representation and speaker representation are naturally entangled. Different speakers have their own pronunciation, dialect, etc.. That would mean that the average speaker and average phoneme representations would have redundant information and the denoisers wouldn't be truly decoupled anymore. The source-filter representation however is truly disentangled, so denoisers trained on each will learn non-overlapping trajectories, at least large noise levels.

Convergence of Trajectories¶

[!note] Converging Trajectories Sketch of the ensemble approach. At high noise levels, the trajectories towards the source and filter distributions are very distinct. At low noise levels, the trajectories have to be very similar, as they converge towards a single sample in the data distribution. Employing a single denoiser for low noise levels would reduce computational complexity and allows for specialization on increasing audible fidelity.

By definition, the trajectories of both denoisers converge towards the same data distribution.

For high noise levels, the trajectories will be very distinct, but for low noise levels, the trajectories can be expected to be very similar.

Expert Ensemble¶

At later stages of the sampling process, using two denoisers might not be efficient anymore.

This idea is illustrated in the above figure and is well supported by other papers:

In @balajiEDiffTextImageDiffusionModels2023 different models are used for the early stage of the sampling stage (semantic inception) and the late stages of sampling (visual fidelity).

@kwonDiffusionModelsAlready2023 (Disentanglement in Diffusion Models#h-layer Disentanglement) provides a method to edit images by perturbing the latent space within the denoising U-Nets.

They observe that semantic editing works best when only applying the perturbation within the first third of the diffusion space.

Consequently, it could be very interesting to employ decoupled \emph{expert} denoisers only up to timestep $t^{\prime}$ and to use a single denoiser afterwards.

The expert denoisers would generate the sourceand filter-information specific to the speaker and once the trajectories converge, a single denoiser is used to increase audible fidelity.

The implementation could look like this:

Begin denoising with two decoupled denoisers until timestep $t^{\prime}$.
At $t^{\prime}$, take sum of $X_{src,t^{\prime}}$ and $X_{ftr,t^{\prime}}$ (applying filter in cepstral domain).
Proceed denoising with single diffusion model until timestep $t=0$.

Determining $t^\prime$ can be done in multiple ways.

hyperparameter: Setting $t^\prime$ as a hyperparameter is a simple solution but lacks connection to the theoretical idea of converging trajectories.
sample similarity: Since the average of $X_{src,t}$ and $X_{ftr,t}$ would be used, we could define a threshold $\epsilon_{x,t}^\prime$ via the $L2$-loss between them.

$$

\epsilon_{x,t}=\frac{1}{n}\sum_{i=1}^{{n}[||X_{ftr,t}-X{src,t}||_2}2]

$$ - trajectory similarity: Holding closer to the idea of converging trajectories, choosing $t^\prime$ can be mediated by observing the increasingly similar scores of the source and filter models. As in the score-matching objective, we can calculate the Fisher Divergence between the source and filter scores and use that to define a threshold $\epsilon_{s,t}^\prime$ at which to switch to the single denoiser.

$$

\epsilon_{x,t}=\frac{1}{2n}\sum\limits_{i=1}^{{n}\left[||s_{\theta_{ftr}}(X_{ftr,t},Z_{ftr,r},s,t)-s_{\theta_{src}}(X_{src,t},Z_{src,r},s,t)||}\right]}_{2

$$

Disentanglement Measure: Rate of Convergence¶

Another interesting idea might be to analyse the rate of convergence between the trajectories. For each denoising process, we can map the difference in scores over time/noise-level. We can then use the parameters of an exponential linear regression as a measure of rate of convergence. This can be used to compare the average rate of convergence for

same speaker - different emotion and different emotion - same speaker
or same speaker - different content and different speaker - same content

to potentially give some insights into how well different features are disentangled by the decoupled denoisers. Intuitively, if the denoisers converge more slowly, there is less informational overlap and the feature in question is disentangled well by the Source-Filter model. If they converge fast, it would seem that the score of the source or filter are too entangled to enable denoising along separate trajectories. This, however, would need some more formal proof.