Wav2Vec 2.0
Based on @baevskiWav2vec20Framework2020.
Wav2Vec 2.0 extracts speech units from waveform speech.

The model
- extracts latent features \(z_{1},\ldots,z_{T}\) for \(T\) timesteps using a multi-layer convolutional encoder.
- feeds the latent features to a [[Transfomer Model]] to build the contextual representation \(c_{1},\ldots,c_{T}\).
- quantizes the latent features into speech units \(q_{1},\ldots,q_{T}\).
The stride of the temporal convolution determines the number of inputs timesteps for the transformer, as well as the granularity of the quantization.
Training¶
For training, \(K+1\) distractor quantization vectors \(\tilde{q}\) are creating, for each of which a random subset of the speech units \(q_{t}\), is shuffled . The transformer then has to identify the original quantization vector, trained via the contrastive loss:
The constrastive loss is minimized by
- maximising the cosine similarity between the contextual representation \(c\) and true quantization \(q\)
- and minimising the similarity for the distractor vectors \(\tilde{q}\)
Additionally, a diversity loss encourages the equal useage of the speech units. Speech units are vectors drawn from one of \(G\) codebooks with \(V\) entries each. The diversity loss maximizes the entropy of the average softmax distribution of each codebook:
The total loss amounts to \(\mathcal{L}=\mathcal{L}_{m} + \mathcal{L}_{d}\).
Inference & Performance¶
The temporal convolution results in an output rate of 49Hz (20ms sample to sample). Wav2Vec2 performs similarly but slightly worse than [[HuBERT]],

The latent speech units specialize in specific phonetic sounds, as seen in the figure below. The phoneme bcl represents sielence and is modeled by the most different latents.
