Link to paper: https://arxiv.org/abs/1903.00216

Speech recognition is a messy problem. Audio data can suffer from all sorts of complications: multiple speakers to differentiate between, background noise and compression artifacts all come to mind. On top of these, training a machine learning model to peform the “translation” between audio and text typically requires word- or phoneme-level alignments between the two. Doing this to high quality, at scale, is hard, and it’s exactly what this paper is about. If you’d prefer to skip ahead, the authors make their code publicly available.

The largest high-quality speech dataset available to researchers (that is, outside of industry, which currently values speech data above little else) is about 2,000 hours long, which isn’t enough to generate extremely high-quality models in an NLP context.

To fill this gap, Lakomkin et al. have created a pipeline to scrape, filter and clean YouTube videos and their closed captions into a dataset appropriate for speech recognition.

To begin with, the authors had to reach as many videos as possible to construct an initial sample set. They accomplished this by searching for stop words and then iterating through the 600 most recent videos returned by the YouTube Search API. They grabbed every video that appeared, its closed-caption transcript, and the contents of its associated channel.

The most interesting part of this paper is the general heuristics and filtering steps they employ to clean the dataset. YouTube captions are community or creator-provided, and can be low-quality. There’s also a lot of musical content on YouTube, which isn’t appropriate for a speech recognition dataset (yet), and a lot of advertisements that can interfere with or overlap the video content.

Filtering happened in several stages. The first step was to throw-out segments with overlapping captions. This could be due to people speaking over one another in-video, or simply incorrect caption syncing.

The next step was to get rid of musical audio. YouTube actually makes this quite simple by including “♫” symbols around song lyrics, so segments of videos including that symbol were discarded. Similarly, they removed unspoken framents, such as “Speaker 1”, “[laughs]“, etc.

They take the step of spelling-out numbers (31 becomes “thirty one”), and discard punctuation except for spaces and apostrophes. I’m not sure I agree with these steps, since numerical tokens are useful to read in a transcript, and punctuation can correspond to useful audial signals (consider “Stop.”, “Stop!” and “Stop?” — you read those with different pitches in mind).

After filtering-out especially long ( > 10 seconds) and short (< 1 second) speech segments, Lakomkin et al. perform their most expensive and important filtering step. They downselect to three random subsegments within a video and call the Google speech recognition API to generate a “gold standard” transcription of these short pulses of audio. If the Levenshtein similarity between the YouTube and Google transcripts is less than 70%, that video’s transcript is discarded. This must have clobbered the initial dataset.

As a post-processing step, they fine-tuned the audio/transcript alignment by matching the positions of the first and last words of a segment to the audio using Kaldi, which allowed them to perform an alignment to better than 500ms.

The paper goes on to prove that their data is useful by building a relatively simple deep speech recognition model, but that’s not what interests me. I want to run their pipeline at scale! It would be really interesting to build-up speech recognition datasets for different “domains” of conversation, such as speech about finance, politics or technology, and deploy them in those environments — such as financial call centers, congressional phone lines or tech support agencies.