Link to paper: https://arxiv.org/abs/1904.00805

Hear me out: what if your code could comment itself? It’s the dream of many a software developer, myself included.

Building a deep NLP architecture for summarizing code comes with major challenges, including different syntaxes for each language (or even versions of the same language; think Python 2.7 vs 3.6) and enormous vocabularies (everyone has their own naming conventions for variables). Moore, Gelman & Slater introduce an interesting encoder-decoder translation architecture to deal with these issues.

Their encoder contains a few important steps. Given some method as input:

def hello_world():
    boringString = "hello, world!"
    return boringString

they encode each character in its own learned embedding. This gives them an open vocabulary, letting them overcome many of the aforementioned challenges, at the cost of a huge jump in model complexity.

The encoder convolves these character vector sequences with length 1, 2, 3, 4 and 5 1D convolutional filters, and sum their output “over time”. This is really a kind of sequential, or cumulative, pooling. While the convolutions learn useful combinations of characters, the pooling layer lets them generate fixed-length outputs for arbitrary length inputs. The output of their encoder is termed a thought vector; a dense abstraction of the input code.

The decoder is a more standard LSTM, generating a sequential output of characters that forms a summary of the input code. The authors include an important addition to the vanilla decoder architecture: to help the decoder recognize that it is spelling-out a variable name, which could be in camel case, underscored, or what have you, they introduce two special tokens that bound the variable names, <BEGIN SPELL> and <END SPELL>.

Using the MUSE corpus, 16 million methods, with their definitions were gathered from Java, Python and C++. After some light cleaning stages, the authors train their encoder-decoder model on all of the data, not just one language at a time.

For Java, where they have a study to compare against, they achieve state-of-the-art BLEU scores for their summaries. The model performs well on C++, and just-OK on Python. Digging in to the Python training data, the authors find that grammar is poorer and misspellings more rife in this language than the others! In any case, theirs are the first results reported as baseline summaries of C++ and Python code, and they are enthusiastic about the model as it is, suggesting the next step of integrating it into existing IDEs.

While the results are exciting, I’m not convinced that they have done enough digging to know whether their results are prediction or simply memorization. I worry about large overlaps in methods between the training and testing sets – pursuing something along the lines of bloom embeddings to compare the two may be worthwhile.

I do love the interesting mix of embeddings, convolutions and LSTMs in the model architecture. I’d like to see the same creative mindset exercised in encoder-decoder architectural design in the future!