Semester

Summer

Date of Graduation

2014

Document Type

Dissertation

Degree Type

PhD

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Bojan Cukic

Committee Co-Chair

Donald Adjeroh

Committee Member

Mark Culp

Committee Member

Katerina Goseva-Popstojanova

Committee Member

Tim Menzies

Abstract

Issue tracking systems play a critical role in the management and maintenance of software. Developers and users are allowed to submit reports pertaining to observed problems. A human triager must read through each newly submitted report, and determine if it describes an unreported issue (Primary) or an existing issue (Duplicate). If the quality of the report is deemed to be poor (Incomplete), describing an issue with a different software (Invalid), irreproducible (Worksforme), or beyond the scope of the project (Wontfix) it is annotated with the appropriate label. If the report is deemed to describe a new problem, it is assigned to a developer to work on a solution. In instances when the report is a duplicate, it is assigned the report number associated with the original problem report.;In typical large-scale software systems several hundred problem reports are submitted daily. Thus, the triager faces a daunting task in ensuring that problem reports are quickly annotated with the correct status, and if necessary assigned to a developer. Given the efforts required to triage a problem report, it is desirable to develop automated systems that assist the triager. Existing research in the field has failed to address the problem of automatically determining if a report is Primary or Duplicate. Many efforts made have utilized methodologies that do not scale into the real world, or created artificial datasets that do not adequately model the dynamics of existing repositories.;In this research, we present a fully automated framework that utilizes multiple document similarity measures, summary statistics describing each report and user behavior attributes to determine if the problem at hand is new or duplicate. The framework relies on making as few assumptions as possible on the data in order to reflect the dynamics of a repository. If a problem is deemed to be duplicate, a multi-label classification framework is applied to select the 20 most likely original reports. In order to determine feasibility of the framework, three large-scale datasets from Eclipse (363,770 problem reports), OpenOffice (124,476 problem reports) and Firefox (111,205 problem reports) are used to validate the approach. Our results show that document similarity, user and specific attributes can be employed to differentiate between primary and duplicate reports, with an in-class recall of around 70\%. Furthermore, we show that while a silver bullet approach does not exist for determining the correct primary for a duplicate report, a fusion scheme relying on multi-label classification can be used to effectively classify duplicates using simple document classification techniques. Unlike existing research, our results are scalable to the full size of the large-scale datasets.

Share

COinS