NER and PoS when nothing is capitalized
Link to paper: https://arxiv.org/abs/1903.11222
I spent 4 years doing PhD research at the University of Pennsylvania, and not once did I realize that there was such a strong NLP department two blocks down the road from the Physics building. But that’s grad school for you, at least in the sciences!
This paper is a short-and-sweet investigation of how much Named Entity Recognition (NER) and Part of Speech (PoS) tagging models rely on word capitalization.
wav2vec: unsupervised pre-training for speech recognition
Link to paper: https://arxiv.org/abs/1904.05862
Creating speech recognition models is hard. The data are dense, noisy, and labels are expensive to generate at scale. Schneider et al. provide a scaleable approach to overcome these problems by encoding a large amount of audio data into an information-rich subspace. Once their encoder is trained, they can use it to represent audio data that is transcribed as a rich feature vector, and build speech recognition models that are really just decoders.
A CNN for Language-Agnostic Source Code Summarization
Link to paper: https://arxiv.org/abs/1904.00805
Hear me out: what if your code could comment itself? It’s the dream of many a software developer, myself included.
Building a deep NLP architecture for summarizing code comes with major challenges, including different syntaxes for each language (or even versions of the same language; think Python 2.7 vs 3.6) and enormous vocabularies (everyone has their own naming conventions for variables). Moore, Gelman & Slater introduce an interesting encoder-decoder translation architecture to deal with these issues.
Re-Ranking Words to Improve Interpretability of Automatically Generated Topics
Link to paper: https://arxiv.org/abs/1903.12542.pdf
Topic models are a major interest of mine, I think largely due to their underlying simplicity. How do you cluster words in meaningful ways?
In Latent Dirichlet Allocation (LDA), the model learns vector representations of documents as a mixture of “topics”, where topics are multinomial distributions of words. The underlying assumption is that each topic is a mixture of words, and each document is a mixture of topics, which are Dirichlet distributed.
Distributed Vector Representations of Folksong Motifs
Link to paper: https://arxiv.org/abs/1903.06353
Following the distributional hypothesis in semantics, the goal of this research is to adopt the skip-gram version of the word2vec model for the distributional representation of melodic units.
I’m not an expert in music theory, but apparently there exists some evidence that monophonic melodies adhere to the distributional hypothesis! That is, “you shall know a word motif by the company it keeps (those words motifs that are nearby it it)”.