Simon Butler

I am currently a postdoctoral researcher in the Informatics Department at the University of Skövde where I work on the LIM-IT project.


I completed my PhD as a part-time student in the Computing & Communications Department at the Open University from 2008 to 2015. My dissertation describes the development of improved techniques for the analysis of Java identifier names and their application to investigate the structure and content of class, field, parameter and variable names.

My research interests focus on identifier naming, particularly the way identifier names are used in practice and the implications this may have for the design and implementation of tools that analyse identifier names in source code and provide support for software engineers for program comprehension and source code maintenance. I am also interested in the implementation of approaches to analyse names including the application of lexical analysis and natural language processing.

My recent research has investigated the identifier names found in a corpus of sixty open source Java projects through the development of new techniques for the tokenisation of names, and the application of natural language processing methods. Details of the projects included in the corpus are provided to facilitate replication of my results. The research was conducted using a collection of software tools that mine and analyse Java identifier names, some of which are open source.

My doctoral research was supervised by Michel Wermelinger, Yijun Yu and Helen Sharp.

Contact Details

simon [at] facetus [dot] org [dot] uk



Simon Butler, Michel Wermelinger, and Yijun Yu ‘A Survey of the Forms of Java Reference Names’, Proceedings of the 23rd International Conference on Program Comprehension (ICPC 2015), May 18–19 2015, Firenze, Italy. [pdf] [BibTeX]

Simon Butler, Michel Wermelinger, and Yijun Yu ‘Investigating Naming Convention Adherence in Java Reference Names’, Proceedings of the 31st International Conference on Software Maintenance and Evolution (ICSME 2015), Sep 29–Oct 1 2015, Bremen, Germany. [pdf] [BibTeX]


Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp ‘INVocD: Identifier Name Vocabulary Dataset’, Proceedings of the 10th Working Conference on Mining Software Repositories (MSR 2013), May 18–19 2013, San Francisco, CA, USA. [pdf] [BibTeX]


Simon Butler ‘Mining Java Class Identifier Naming Conventions’, Proceedings of the 34th International Conference on Software Engineering (ICSE 2012), June 2–9 2012, Zürich, Switzerland. [BibTeX] [poster]


Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp ‘Mining Java Class Naming Conventions’, Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM 2011), September 25–30 2011, Williamsburg, VA, USA. [pdf] [BibTeX] [doi] [Slides]

Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp ‘Improving the tokenisation of identifier names’, Proceedings of the 25th European Conference on Object-Oriented Programming, LNCS 6813, Springer Berlin/Heidelberg, 25–29 Jul 2011, Lancaster, UK. [pdf] [BibTeX] [doi]


Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp ‘Exploring the influence of identifier names on code quality: An empirical study’, Proceedings of the 14th European Conference on Software Maintenance and Reengineering, 15–18 March 2010, Madrid, Spain. [pdf] [BibTeX] [doi]


Simon Butler, Michel Wermelinger, Yijun Yu, and Helen Sharp ‘Relating identifier naming flaws and code quality: An empirical study’, Proceedings of the 16th Working Conference on Reverse Engineering, 13–16 October 2009, Lille, France. [pdf] [BibTeX] [doi]

Simon Butler ‘The effect of identifier naming on source code readability and quality’, Proceedings of the ESEC/FSE Doctoral Symposium 2009, 25 August 2009, Amsterdam, Netherlands. [BibTeX] [doi]


To support my identifier naming research I developed some software tools that are components of a source code fact extraction and analysis process. The principal tool is a Java identifier name mining tool, which gathers identifier names and metadata from source code and records it in a RDBMS. Some older versions of the software are available for download in binary form, while current versions of the projects are open source.

Identifier Name Miner

Jim is a Java Identifier Miner. JIM is written in Java and uses parsers created using the JavaCC parser generator and ANTLR to extract identifier names from Java source code and to collect metadata. The identifier names are then tokenised before being stored, with metadata, in an Apache Derby database.

Older versions of JIM are available for download as in binary form, and the source code is available on GitHub.

Identifier Name Tokeniser

A key aspect of the identifier name miner's functionality is the tokenisation of identifier names. The tokeniser is available as a Java library named INTT (identifier name tokenisation tool). The techniques used in INTT are described in a paper presented at ECOOP 2011.

Further information about INTT. The development version of INTT is open source software and is available on GitHub.


MDSC is a spell checking library intended to be used with identifier names.


Nominal is a Java library to evaluate the compliance of identifier names with naming conventions. Nominal includes a simple language for defining identifier naming conventions.


Our investigations of identifier names are conducted using a corpus of sixty open source Java projects. The projects were selected from a range of application types and domains. The intention is to allow us to analyse the entire corpus, as well as being able to focus on specific domains or application types as required.

We give details of the projects below to allow others to replicate our results.

Project Version Desktop Application Project Management Language Tool Programmer Tool Programming Language IDE SDK Library Server Framework/
AirTODO 1.27 final * *
Ant 1.7.1 *
ANTLR 3.2 *
ArgoUML 0.30.2 *
ASM 3.3 *
AspectJ 1.6.9 *
Azureus (Vuze) 4.5.02 *
BCEL 5.2 *
Beanshell 2.0b4 *
BlueJ 3.0.2 * *
BORG Calendar1.7.3 * *
Cactus1.8.1 *
Cobertura *
Derby * *
Eclipse 3.6.0 * * *
EGantt 0.5.3a * *
FindBugs 1.3.9 * *
Freemind0.9.0RC9 *
GanttProject 2.0.10 * *
Geronimo 3.0-M1 *
Google Web Toolkit2.0.4 *
Greenfoot 1.5.6 *
Groovy 1.7.4 *
Hibernate 3.5.5-final *
JabRef 2.6 *
JasperReports 3.7.4 *
Java 6 library 6u20-b02 *
Javacc 5.0 *
JBoss 6.0.0 M4 *
JDK 6u21 fcs *
jEdit 4.3.2 * *
Jetty 7.2.0 *
JFreeChart 1.0.13 *
Jin *
JRuby 1.5.2 *
JUnit 4 4.8.2 * *
Jython 2.5.2 beta 1 (svn r7109) *
Kawa 1.10 *
Log4J 1.2.16 *
Lucene 3.0.2 *
Maven2.2.1 *
Memoranda1.0-rc3.1 * *
MindRaider 7.6 *
MPXJ 4.0.0 * *
MultiJava 1.3.2 *
NetBeans 6.9.1 * *
OpenProj 1.4 * *
Polyglot 1.3.5 *
Rapla 1.3.2 * *
Rhino 1.7R2 *
Spring 3.0.3 *
Stripes 1.5.3 *
Struts 2.2.1 *
Tapestry *
Tomcat 6.0.29 *
Velocity 1.6.4 *
Wicket 1.4.10 *
Xalan-Java 2.7.1 *
Xerces-j 2.10.0 *
XOM 1.2.6 *

Alternative Corpora

Other researchers have created their own corpora for empirical research. The criteria for project selection used to create the corpora vary and not all are publicly available, though corpora are often specified in individual papers. Of particular interest is the Qualitas Corpus curated by Ewan Tempero. The November 2010 version of the Qualitas Corpus consists 585 versions of 106 Java systems.