Link to paper: https://arxiv.org/abs/1903.00089

Neural Machine Translation (NMT) is the challenge of using neural networks to translate languages from one to another. Google Translate is famously good at this, and is where the latter two authors on this paper work. NMT is difficult largely due to the plethora ways humans choose to communicate with one another.

Dialects, jargon, slang; all of these feed in to the challenge of performing NMT, even before one considers the number of languages that are spoken. Additionally, many languages are considered “low-resource”, in that there are not many translations outside of their language family.

All of this makes the work of Aharoni, Johnson and Firat very interesting. They perform “Massively Multilingual NMT”, building single models that can translate 102 languages to-and-from English. Not only do they actually pull this off, but also they are introspective enough to analyze their model’s shortcomings, and dig in to any underperformances.

The authors use a multilingual dataset constructed from TED Talk transcripts to provide parallel data between languages. They restrict themselves to the “English-centric” approach mentioned above, and also restrict their architecture to stacked Transformer components. This means that they have several encoder/decoder modules, with multi-headed attention and residual connections, over-and-over-and-over.

They build a ‘small’ model for a low-resource setting, which has 98 million parameters, and a large one for a high-resource setting, which has 473.7 million parameters. Something I don’t like about this work is that they overuse the word “resource”, as both the amount of translation data they have on a given language, and how many languages they are trying to translate.

Two training approaches are evaluated in this work:

  • Many-to-many“: training data is composed of many language pairs, with English on either side (source or target). A target language token is included to guide the model as to which language it should be translating to.
  • Many-to-one“: training data only has English on the target side.

In their low-resource experiment, the authors investigate training the ‘small’ model on 59 languages. For language X → English, their many-to-many model outperforms almost all baselines (Hebrew → English being the outlier). However, it doesn’t stand-up the other way; the many-to-one model is much better at English → X.

This isn’t too surprising. As noted in the paper, in many-to-one training, English vocabulary will overlap between training examples, allowing the model to overfit and memorize the training set. Additional target languages, as in the many-to-many case, act as regularizers and prevent overfitting. Figure 1 in the paper is a great example of this.

In the high-resource experiment, the number of language pairs are almost doubled to 102 languages, mirrored to-and-from English. The size of the model is almost tripled. In this case, the many-to-one model consistently outperforms baselines and many-to-many in both directions, with the strange outilier of German → English. This should be relatively easy, due to the closeness of the two languages. This shows that there’s no silver bullet when making a generalized model — there will always be tradeoffs in accuracy between classes for a given modelling decision. Given the large scale of the model in question though, it may indicate that Transformers are hitting a ceiling, and that a new modelling innovation is required.

A major implication of this work is that it implies that increasing the number of source languages causes degradation in many-to-many-type models. The authors dig-in to this, training the large many-to-many model on consecutively-larger subsets of language pairs: 5-5, 25-25, 50-50, 75-75 and finally 103-103. Testing on a range of languages from distant families in the smallest case (Arabic, French, Russian, Ukranian, English) and scaling up, they find that 5-5 gives the best results.

However, the kind of testing this work and many like it performs isn’t perfect for translating the technology from the lab into the wild. There’s no way to capture a translation for every utterance a language is capable of in a single dataset. Instead, models need to be capable of “zero-shot translation” — translating something they’ve never seen before. The authors are cognizant of this, and hold-out a validation set for testing zero-shot translation capabilities. They find that the 50-50 many-to-many model performs best for this task. This makes some sense, since more languages could cause the model to learn a more general linguistic interpretation, but it also points out that the model doesn’t capture everything required to scale-up to the 103-103 level.