I completed my PhD as a part-time student in the Computing & Communications Department at the Open University from 2008 to 2015. My dissertation describes the development of improved techniques for the analysis of Java identifier names and their application to investigate the structure and content of class, field, parameter and variable names.
My research interests focus on identifier naming, particularly the way identifier names are used in practice and the implications this may have for the design and implementation of tools that analyse identifier names in source code and provide support for software engineers for program comprehension and source code maintenance. I am also interested in the implementation of approaches to analyse names including the application of lexical analysis and natural language processing.
My recent research has investigated the identifier names found in a corpus of sixty open source Java projects through the development of new techniques for the tokenisation of names, and the application of natural language processing methods. Details of the projects included in the corpus are provided to facilitate replication of my results. The research was conducted using a collection of software tools that mine and analyse Java identifier names, some of which are open source.
simon [at] facetus [dot] org [dot] uk
To support my identifier naming research I developed some software tools that are components of a source code fact extraction and analysis process. The principal tool is a Java identifier name mining tool, which gathers identifier names and metadata from source code and records it in a RDBMS. Some older versions of the software are available for download in binary form, while current versions of the projects are open source.
Jim is a Java Identifier Miner. JIM is written in Java and uses parsers created using the JavaCC parser generator and ANTLR to extract identifier names from Java source code and to collect metadata. The identifier names are then tokenised before being stored, with metadata, in an Apache Derby database.
A key aspect of the identifier name miner's functionality is the tokenisation of identifier names. The tokeniser is available as a Java library named INTT (identifier name tokenisation tool). The techniques used in INTT are described in a paper presented at ECOOP 2011.
MDSC is a spell checking library intended to be used with identifier names.
Nominal is a Java library to evaluate the compliance of identifier names with naming conventions. Nominal includes a simple language for defining identifier naming conventions.
Our investigations of identifier names are conducted using a corpus of sixty open source Java projects. The projects were selected from a range of application types and domains. The intention is to allow us to analyse the entire corpus, as well as being able to focus on specific domains or application types as required.
We give details of the projects below to allow others to replicate our results.
|Project||Version||Desktop Application||Project Management||Language Tool||Programmer Tool||Programming Language||IDE||SDK||Library||Server||Framework/
|Google Web Toolkit||2.0.4||*|
|Java 6 library||6u20-b02||*|
|Jython||2.5.2 beta 1 (svn r7109)||*|
Other researchers have created their own corpora for empirical research. The criteria for project selection used to create the corpora vary and not all are publicly available, though corpora are often specified in individual papers. Of particular interest is the Qualitas Corpus curated by Ewan Tempero. The November 2010 version of the Qualitas Corpus consists 585 versions of 106 Java systems.