Reinforcement Learned Curricula for NMT

Link to paper: https://arxiv.org/abs/1903.00041

An interesting aspect of training deep NLP models is that the order you present training data to them matters. This can be very useful for transfer learning: train the model on a lot of general data from a large corpus, and then fine-tune the model on less noisy, more in-domain data that you really care about it understanding. This way, modellers can take advantage of both a wide understanding of semantics and syntax, and a more narrow in-depth understanding of the task at hand.

One can think of this transfer learning method as a two-step curriculum. But what if we delved deeper, instead figuring out which order to present each sentence in a dataset to the model during training?

A team of Google researchers recently investigated this by grading input sentences by noise, and training a neural machine translation (NMT) model while gradually decreasing the proportion of noisy sentences it was taught to translate (Wang et al. 2018). This allowed the model to perform well in general, while being robust to severe noise.

Kumar et al. take this a step further, letting the machine choose its training curriculum, at least with respect to noise, by casting training as meta-learning in a reinforcement learning (RL) context. The RL model chooses how much noise to train-on before each training step. They divide their training corpus into 6 bins of increasing noise.

The RL paradigm they use is called “Deep Q-Learning (DQN)”: the RL agent (their NMT model) recieves an observation from the environment and conditions on it, and produces an action to execute on the environment. The reward it receives represents the goodness of the action.

In this case, the observation stage is cast as the NMT’s performance at various levels of noise - drawn from each noise-level bin. Performance is a function of log likelihood, which plummets in the early stages of trianing. Because of this, the authors train the NMT without reinforcement to start with, letting it level-out a little before turning it on. The reward is the gradient of the log likelihood of a translation from a held-out validation set.

The RL behavior that results from this training schedule is fascinating. Instead of following the heuristics of Wang et al. (2018), it very rapidly chooses a constant ⁸⁰⁄₂₀ split between the cleanest and second-to-noisiest bins, respectively. Its performance is similar to Wang et al.’s model.

Unfortunately, they don’t dig in to this interesting behavior, pushing it off to future work. I look forward to reading their findings.