Wav2Vec 2.0

Based on @baevskiWav2vec20Framework2020.

Wav2Vec 2.0 extracts speech units from waveform speech.

Pasted image 20250131162832.png

The model

extracts latent features \(z_{1},\ldots,z_{T}\) for \(T\) timesteps using a multi-layer convolutional encoder.
feeds the latent features to a [[Transfomer Model]] to build the contextual representation \(c_{1},\ldots,c_{T}\).
quantizes the latent features into speech units \(q_{1},\ldots,q_{T}\).

The stride of the temporal convolution determines the number of inputs timesteps for the transformer, as well as the granularity of the quantization.

Training¶

For training, \(K+1\) distractor quantization vectors \(\tilde{q}\) are creating, for each of which a random subset of the speech units \(q_{t}\), is shuffled . The transformer then has to identify the original quantization vector, trained via the contrastive loss:

\[ \mathcal{L}_{m}=-\log\frac{\exp(sim(c_{t},q_{t})/\kappa)}{\sum_{\tilde{q}\sim Q_t}\exp(sim(c_{t},\tilde{q})/\kappa)} \]

The constrastive loss is minimized by

maximising the cosine similarity between the contextual representation \(c\) and true quantization \(q\)
and minimising the similarity for the distractor vectors \(\tilde{q}\)

Additionally, a diversity loss encourages the equal useage of the speech units. Speech units are vectors drawn from one of \(G\) codebooks with \(V\) entries each. The diversity loss maximizes the entropy of the average softmax distribution of each codebook:

\[ \mathcal{L}_{d}=\frac{1}{GV}\sum\limits_{g=1}^{G}-H(\bar{p}_{g}) \]

The total loss amounts to \(\mathcal{L}=\mathcal{L}_{m} + \mathcal{L}_{d}\).

Inference & Performance¶

The temporal convolution results in an output rate of 49Hz (20ms sample to sample). Wav2Vec2 performs similarly but slightly worse than [[HuBERT]],

Pasted image 20250203145100.png

The latent speech units specialize in specific phonetic sounds, as seen in the figure below. The phoneme bcl represents sielence and is modeled by the most different latents.

Pasted image 20250203150108.png

Wav2Vec 2.0

Training¶

Inference & Performance¶

XLS-R¶