Voice Conversion (VC)
Voice conversion allows the conversion from a source to a target speaker. Its less focused on differences in the particular vocabulary of both speakers and more on the differences in prosody, spectrum and formants. It differs to Emotional Voice Conversion (EVC) as it does not try to maintain the emotional content of the source speaker. ^d7c656
Just as TTS, VC has made tremendous process due to Deep Learning approaches. In training, there is a crucial distinction between Parallel and Non-parallel Training Data. While its much easier to train on parallel training data, its hard to produce parallel data in sufficient quantity. One possible way to deal with non-parallel data is CycleGAN, a generative discriminator, that consists of two generators: One produces the target speech from the source, and one that reproduces the source speech from the generated target. This way, one ensures the consistency between source and target utterances.
Another approach are seq2seq models, which employ encoder-decoder architectures. Compared to frame-by-frame mappings as in GANs, they allow for the output speech to vary in length compared to the source utterances, which is essential due to the change in prosody between multiple speakers.