Link to paper:

Topic models are a major interest of mine, I think largely due to their underlying simplicity. How do you cluster words in meaningful ways?

In Latent Dirichlet Allocation (LDA), the model learns vector representations of documents as a mixture of “topics”, where topics are multinomial distributions of words. The underlying assumption is that each topic is a mixture of words, and each document is a mixture of topics, which are Dirichlet distributed. From this assumption, one can build an unsupervised generative model that learns each topic based on words that are shared between documents. Each word used in an unseen document provides a probability that the document represents its topic.

Topics words are typically presented in the order that they contribute to the given topic; the probability that they contribute to a document being labelled by that topic. For example, a topic referring to “science museums” might be represented by words:

space, museum, years, history, science, earth, mission, ...

where space and museum boost the probability of a document getting labelled with topic “science musem” more than the word mission does.

The authors of this paper pursue the question - does the order we present these words effect the interpretability of the topic? How might we re-rank them? To cut the story short, the answers are yes, definitely and use TF-IDF.

Alokaili, Aletras & Stevenson consider four ranking schemes:

  • Orig: simply ranking by the probability a word contributes to a topic, as demonstrated above.
  • Norm: ranking by a “normalized probability”, which is the probability a word contributes to a topic, normalized by the sum of probabilities for that word over all of the topics.
  • TF-IDF: ranking the by the probability the word contributes to a topic, multiplied by the log of the probability of the word, normalized by the sum of the log probability of the word over all topics. [This one sounds complicated, but it’s simple enough if you look at Equation 2 in the paper.]
  • IDF: ranking by the probability the word contributes to the topic, multplied by the log rate at which the word appears in the corpus.

Each of these methods behaves in very interesting ways. Consider one of their example topics, about movies:

*Orig*: film, even, movie, world, stars, man, much, little, ...
*Norm*: vampire, que, winchell, tomei, westin, swain, marisa, ...
*TF-IDF*: film, movie, stars, vampire, rating, spielberg, hollywood, ...
*IDF*: film, movie, stars, vampire, rating, star, spielberg, ...

Training an LDA model on a large, generic corpus, they conduct experiments based on both human and automatic interpretability of generated topics.

In the human experiment, they showed documents to people, providing them with a list of potential topics to label them with, one of which is the “correct” topic, with words ranked in one of the four ways, and either 5, 10 or 20 words shown. They measured the time taken to decide the label, and the accuracy of the labelling.

In the automatic experiment, they used top-ranked 5, 10 or 20 words as keys for automatic retrieval of documents, where the authors prepared a gold-standard set of documents that should be retrieved, as they correspond to one of each topic.

Both tests showed a strong preference towards TF-IDF weighting, with a weak dependence on the number of words shown — if only the 5 top-ranked words were shown, IDF occasionally outperforms. The authors sum it up well:

Re-ranking the topic words was found to improve the interpretability of topics and therefore should be used as a post-processing step to improve topic representation. The most effective re-ranking schemes were those which combined information about the importance of words both within topics and their relative frequency in the entire corpus, thereby ensuring that less informative words are not used.

Like my previous post on weighted document embedding, it’s always great to see relatively simple weighting schemes show so much promise in more complicated or nuanced settings. It also makes me feel better that TF-IDF is always my first stop when it comes to weighting my data!