NER and PoS when nothing is capitalized

2019-04-18 :: Saul :: Mayhew, Tsygankova & Roth

Link to paper: https://arxiv.org/abs/1903.11222 I spent 4 years doing PhD research at the University of Pennsylvania, and not once did I realize that there was such a strong NLP department two blocks down the road from the Physics building. But that’s grad school for you, at least in the sciences! This paper is a short-and-sweet investigation of how much Named Entity Recognition (NER) and Part of Speech (PoS) tagging models rely on word capitalization.

Read more →

wav2vec: unsupervised pre-training for speech recognition

2019-04-15 :: Saul :: Schneider et al.

Link to paper: https://arxiv.org/abs/1904.05862 Creating speech recognition models is hard. The data are dense, noisy, and labels are expensive to generate at scale. Schneider et al. provide a scaleable approach to overcome these problems by encoding a large amount of audio data into an information-rich subspace. Once their encoder is trained, they can use it to represent audio data that is transcribed as a rich feature vector, and build speech recognition models that are really just decoders.

Read more →

A CNN for Language-Agnostic Source Code Summarization

2019-04-12 :: Saul :: Moore, Gelman & Slater

Link to paper: https://arxiv.org/abs/1904.00805 Hear me out: what if your code could comment itself? It’s the dream of many a software developer, myself included. Building a deep NLP architecture for summarizing code comes with major challenges, including different syntaxes for each language (or even versions of the same language; think Python 2.7 vs 3.6) and enormous vocabularies (everyone has their own naming conventions for variables). Moore, Gelman & Slater introduce an interesting encoder-decoder translation architecture to deal with these issues.

Read more →

Re-Ranking Words to Improve Interpretability of Automatically Generated Topics

2019-04-02 :: Saul :: Alokaili, Aletras & Stevenson

Link to paper: https://arxiv.org/abs/1903.12542.pdf Topic models are a major interest of mine, I think largely due to their underlying simplicity. How do you cluster words in meaningful ways? In Latent Dirichlet Allocation (LDA), the model learns vector representations of documents as a mixture of “topics”, where topics are multinomial distributions of words. The underlying assumption is that each topic is a mixture of words, and each document is a mixture of topics, which are Dirichlet distributed.

Read more →

Distributed Vector Representations of Folksong Motifs

2019-03-24 :: Saul :: Alvarez & Gómez-Martin

Link to paper: https://arxiv.org/abs/1903.06353 Following the distributional hypothesis in semantics, the goal of this research is to adopt the skip-gram version of the word2vec model for the distributional representation of melodic units. I’m not an expert in music theory, but apparently there exists some evidence that monophonic melodies adhere to the distributional hypothesis! That is, “you shall know a word motif by the company it keeps (those words motifs that are nearby it it)”.

Read more →