Parallel and Non parallel Training Data
In linguistic processing, there is a crucial distinction in the type of training data available: Parallel vs non-parallel. Parallel data is generally more directly useful for applications involving direct translation (language, Voice Conversion (VC)), whereas non-parallel data can be useful for broader language understanding and generation tasks where direct translations are not necessary.
Parallel Training Data¶
This type of data consists of sets of sentences that are translations of each other across two or more languages. Each sentence in one language corresponds to its equivalent in another language. Parallel data is essential for training machine translation systems, such as those based on neural networks, where the model learns to predict the equivalent in the target language from the source language. For example, a parallel corpus might include a sentence in English and its corresponding translation in German. Compared to non-parallel training data, this type of data is often less available and more difficult to sample.
Non-Parallel Training Data¶
This refers to data that is not aligned at a sentence or phrase level across languages. It includes monolingual data in each language that does not have direct translations linked to it. Non-parallel data is often used in tasks like language modeling and unsupervised or semi-supervised learning approaches. For instance, it can be used in machine translation when parallel data is scarce, by training on large amounts of text from each language independently to understand language structures and vocabularies. Compared to parallel training data, this type of data is easier to sample.