Formality Style Transfer

Link to paper: https://arxiv.org/abs/1903.06353

The two sentences:

Gotta see both sides of the story

You have to consider both sides of the story

contain the same content, but one is more formal than the other. Performing the “translation” of one to the other can be framed as a style transfer task. Style transfer is typically thought about in the artistic domain, which makes this NLP paper particularly interesting.

Xu, Ge & Wei point-out that thinking about the problem as a translation, where one could use a seq2seq model, generally requires large amounts of parallel training data, which does not currently exist for this problem. They propose a hybrid remedy. Keep the seq2seq task, but augment it in such a way that it can use parallel data and labelled data for training, where the labels are whether or not the input is “formal” or “informal”.

Their modelling approach was to use an autoencoder to project the input data into some “style” subspace that was decoded into its formality translation. With parallel data, one could compute a translation loss to update the autoencoder with. When available, the authors perform this update, but they also combine it with three other loss functions:

Classification Guided loss: seperately training a formal/informal classifier, they use the classifier’s probability of the decoded sentence belonging to it’s target class as an update to the autoencoder weights.

If the model just used classification guided loss, the autoencoder could simply learn to output keywords for the classifier to label as formal or informal. To avoid this, two reconstruction losses are also implemeted:

Self-reconstruction loss: train the autoencoder to reconstruct the input – this is equivalent to keeping the input formality label the same as the desired output formality.
Cycled-reconstruction loss: Include a “looped transformation”. Given an input x and output S(x), treat S(x) as an additional input data point with “pseudo-parallel translation data” x, and compute the reconstruction loss of this x -> S(x) -> x translation cycle.

There are few formalized benchmarks for the authors to compare their algorithm against, so they try a lot of other methods as well. Their model ourperforms all of them. They also evaulate whether they are able to maintain or transfer the sentiment of a sentence on unseen datasets. They don’t always succeed in beating established benchmarks, but are in the same ballpark in terms of accuracy or BLEU score.

For myself, the main takeaway from this paper is their interesting hybrid training approach, where they were able to leverage different kinds of training data to accomplish their goal, and mixed loss functions from different components of the model in interesting ways.