To tune or not to tune?

Link to paper: https://arxiv.org/abs/1903.05987

When building deep learning NLP models, a common way to speed-up the training process is to get someone else to do most of the work for you. That is, transfer learn from a pre-trained model. Good pre-trained models are those that have been exposed to a wide variety of text, so their representation of semantic/syntactic space is useful for many tasks. One can adapt these models by:

using them to transform text features into a more useful (vector) representation, and feed those representations into a downstream task, or
fine-tune the model, retraining it (typically with a much smaller learning rate) to perform a given task.

Peters, Ruder & Smith address the question: is one method preferable to the other for a given task or a given pre-trained model? Specifically, they compare the two adaptation methods for the pre-trained BERT and ELMo models on a wide variety of tasks. A fantastic summary of these models can be found here, but briefly, BERT and ELMo differ in their approach to modelling language. BERT uses stacks of Transformers to encode sentences, drawing upon the self-attention mechanism to form an encoding for each word in a sentence in reference the similarity to the other words in the sentence. On the other hand, ELMo uses bidirectional LSTMs to read the whole sentence forward and backwards before generating an embedding. The authors are interested if the different concepts of “context” in each of these models affect their performance as seed models for other tasks.

(As an FYI, they use a lot of emojis in this paper, so it’s worth scrolling through it just for that!)

The authors test BERT and ELMo in feature extraction and fine-tuning modes, for challenges including named entity recognition, sentiment analysis, and a range of sentence-pairing tasks. They find that ELMo typically performs better as a feature extraction model, whereas BERT was better to use when fine-tuning. Moreover, BERT got a bigger performance boost when fine-tuned to perform semantic similarity tasks, whereas ELMo gave a bigger performance boost when used for feature extraction on natural language inference tasks. The authors point-out that this is in line with their encoding methods, with BERT computing similarities and ELMo drawing upon language models.

Digging deeper, the authors find no significant impact of the target corpus on the performance of the models. BiLSTMs are harder to fine-tune, which might explain the lower performance of ELMo in that domain. That said, by probing the hidden layers of each model, they do see a performance boost in both BERT and ELMo from fine-tuning.

NLP practitioners such as myself rely on strong pre-trained models from companies like Google and research institutions like Allen AI — that have access to very large corpora and huge processing power — to seed our machine learning products. Peters, Ruder & Smith provide empirical evidence that choices in initial architecture of deep NLP models effect performance in downstream tasks. One needs to choose a pre-trained model that captures semantics in a way that suits the task at hand.