Link to paper: https://arxiv.org/abs/1903.11222

I spent 4 years doing PhD research at the University of Pennsylvania, and not once did I realize that there was such a strong NLP department two blocks down the road from the Physics building. But that’s grad school for you, at least in the sciences!

This paper is a short-and-sweet investigation of how much Named Entity Recognition (NER) and Part of Speech (PoS) tagging models rely on word capitalization. The quick answer is: a lot. A bidirectional LSTM NER model based on ELMO vectors falls from an F1 score of 92.45 to 34.46 when all characters are lowercased. The decrease for a similar model performing PoS tagging is less stark, but still appreciable – 97.85 on cased, to 88.66 on uncased data.

NLP scientists often lowercase their data, almost as common-wisdom – it’s just another processing stage in standard pipelines. Could this be driving down the effectiveness of existing models? Conversely, sometimes our models are interacted with using uncased data: this is especially true in online communities.

One approach to this problem is to train a truecasing model. Such a model, given lowercased strings, predicts which characters should be capitalized. Mayhew, Tsygankova & Roth train such a model (a bidirectional LSTM with a linear, binary classification layer on top) and deploy it in a number of experiments that test of the robustness of NER and PoS models to different forms of “uncased-ness”:

  • Baseline – train on cased data.
  • Lowercase everything.
  • Train on a 50-50 mix of cased and uncased data, evaluate on cased data.
  • Train on cased data, test on truecased data (this lets them test the effectiveness of truecasing alone).
  • Train and test on truecased data.

Their results for NER are not surprising. Of course the baseline takes the top spot when testing on cased data – this is the optimal situation. When testing on uncased data, if a truecasing model is used, performance also jumps – this is getting as close as possible to the optimal situation. On average, the 50-50 mix model wins out – this is what it was trained to do! It’s a similar story for PoS, except that truecasing proves less effective (this also makes sense, since case is less important for this task).

If the results are unsurprising, this does not at all mean that Mayhew, Tsygankova & Roth’s work is not important and timely. Their analysis of the seemingly innocouous effect of a standard piece of NLP pipelines is systematic and well thought out. As the field of NLP broadens, this self-analytical lens will become increasingly important.

(I also believe that their results on the 50-50 experiment favor the growing concensus that multi-task learning is the most effective training regimen for NLP models.)