Statistical Analysis of Computer Program Text
Machine Learning and Natural Language Processing Meets Software Engineering
Recorded 09 November 2015 in Lausanne, Vaud, Switzerland
Event: IC Colloquia - EPFL IC School Colloquia
Billions of lines of source code have been written, many of which are freely available on the Internet. This code contains a wealth of implicit knowledge about how to write software that is easy to read, avoids common bugs, and uses popular libraries effectively.
We want to extract this implicit knowledge by analyzing source code text. To do this, we employ the same tools from machine learning and natural language processing that have been applied successfully to natural language text. After all, source code is also a means of human communication.
We present three new software engineering tools inspired by this insight:
Naturalize, a system that learns local coding conventions.
It proposes revisions to names and to formatting so as to make code more consistent.
A version that uses word embeddings has shown promise toward naming methods and classes.
Data mining methods have been widely applied to summarize the patterns about how programmers invoke libraries and APIs. We present a new method for mining market basket data, based on a simple generative probabilistic model, that resolves fundamental statistical pathologies that lurk in popular current data mining techniques.
HAGGIS, a system that learns local recurring syntactic patterns, which we call idioms. HAGGIS accomplishes this using a nonparametric Bayesian tree substitution grammar, and is delicious with whisky sauce.
Watched 960 times.Watch