Link to paper: https://arxiv.org/abs/1904.05862

Creating speech recognition models is hard. The data are dense, noisy, and labels are expensive to generate at scale. Schneider et al. provide a scaleable approach to overcome these problems by encoding a large amount of audio data into an information-rich subspace. Once their encoder is trained, they can use it to represent audio data that is transcribed as a rich feature vector, and build speech recognition models that are really just decoders. Their results from this method are cutting-edge, cutting the word error rate by more than 20% compared to baseline models.

A major difficulty for creating vector representations of audio data is that the data is dense – your network needs to accurately model a continous distribution. The authors avoid this by training two networks:

  • The encoder network, that performs a lossy compression of the input audio data. As someone who has lost the better part of a year of his life to creating compression pipelines for dense data, this might be the most fascinating part of the paper to me. They train a convolutional neural network where each layer’s kernels have strides proportional to their size — wider convolutions take wider steps across the input data — with the resultant vector containing a feature representation of the data at a lower temporal frequency that the input. Essentially, they’re combining time steps of audio data in an information-dependent way.

  • Each encoded audio sample is fed into the context network, which mixes the latent feature representations together into a single wav2vec vector. This means that a vector generated from time t in the audio file contains contextual information from t ± 90s around it.

To train this layered network, they employ a contrastive loss function. This means that the total network needs to distinguish a vector from the encoder layer at k steps in the future, from an encoded distractor audio sample where the distractor is drawn at random from somewhere else in the audio recording.

As mentioned above, encoding speech data in this way creates information-rich input to speech recognition models that really knock it out of the park. Using 1,000x less training data than the best known speech recognizer in the literature, they demonstrate several percentage-points better on standard test data. The fact that their network is fully convolutional means that their method can be performed at scale. Very exciting work!