Seeing language through character-level taggers

Link to paper: https://arxiv.org/abs/1903.05041

This is a short-and-sweet paper that answers a well-defined question: does a language’s morphology and orthography — the way in which it is constructed grammatically — change the way in which recurrent models encode it?

To perform part-of-speech (POS) tagging (identifying nouns, adjectives, etc.), modellers typically employ recurrent neural nets such as LSTMs to read-through the text and build a probability of each word belonging to a given class. These models don’t move word-by-word, but instead subword-by-subword, where subwords could be characters or byte-pairs. Additionally, there is a recent advent of bidirectional recurrent encoders (e.g. Bi-LSTMs) that create a hidden representation of a word or sentence by reading the word both forwards and backwards.

Consider the English sequence of characters:

c h a r a c t e r i z i n g

This should be tagged as a present-tense verb. Is it more efficient to read it forwards (starting with c), backwards (staring with g), or both ways around, to perform the tagging? How many layers should be used for each direction? One might expect that a backwards encoder would pick-up the ing sooner, allowing for a more efficient tagging.

By investigating the activation functions of forward, backward and bidirectional encoders trained to perform POS tagging, Pinter, Marone & Eisenstein found that the answer to this was language specific.

Their own conclusion puts it best:

While character-level Bi-LSTM models compute meaningful word representations across many languages, the way they do it depends on each language’s typological properties. These observations can guide model selection: for example, in agglutinative languages we observe a strong preference for a single direction of analysis, motivating the use of unidirectional character-level LSTMs for at least this type of language.

I love when authors delve in to the specifics of the trained weights of their models. It lets you navigate how information flows through the neural architecture, giving insight into an otherwise black box. This is a short, thorough and well-written example of that!