Link to paper: https://arxiv.org/abs/1902.09875

In this work, Schmidt provides a simple and coherent derivation of what he calls an “optimal embedding” for a document. Optimality is defined such that it maximizes the similiarty of documents in downstream tasks. For example, as an employee at TripAdvisor, he would like to group reviews about the same location.

Schmidt approaches the calculation of an optimal embedding through a weighted sum of word2vec skip-gram embeddings. One of the key innovations of this work is the introduction of a 𝛿 factor for each token as a weight. 𝛿 is defined as the difference between the term frequency of that token in the document that we would like to optimally embed and the term frequency in the overall corpus. A positive 𝛿 implies an over-represented token, a negative one implies under-representation.

If this sounds like TF-IDF, it should. It lets Schmidt calculate document vectors over a large corpus in a very interpretable fashion, drawing upon the foundation of sparse vector representation as weights for a dense representation of the document.

Comparing against doc2vec on the CQADupStack benchmark challenge (finding duplicate StackExchange questions), he requires an additional step of principal component projection to show a consistent improvement (and doesn’t always beat the doc2vec benchmark).

Importantly, all of his methods outperform simply summing the dense vectors together for a given document. A simple sum is the M.O. for quick NLP analyses. Given the simplicity and effectiveness of Schmidt’s weighting scheme, I’m going to start implementing it more often.