Semester

Summer

Date of Graduation

2013

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Timothy Menzies.

Abstract

Data mining has been successfully applied in the recent decades to automatically discover valuable hidden patterns from vast data. However, the current data mining approaches almost exclude business users from the process of knowledge discovery. Most of the learning algorithms operate as "black-boxes", which give predictions based on some input, without giving any hint on how they make their decisions. Moreover, their output is often difficult to interpret or understand by business users.;This thesis explores how to facilitate the users' engagement in the process of data analysis. We present PEEKING2, which combines a set of data reduction and data transformation techniques to create succinct summaries from raw data. In this way, users can "peek" at small representative data to reason over them. After removing uninformative features, PEEKING2 clusters the data combining FASTMAP projection and grid-clustering. A condensed summary of the data is then formed from the centroids of the resulted clusters. Finally, PEEKING2 extrapolates between centroids to predict the class of new instances.;PEEKING2 has been tested on Software Engineering data for software defect prediction and development effort estimation. Specifically, we have applied PEEKING2 on 10 defect data sets and 10 effort data sets from the PROMISE repository. PEEKING2 could reduce large data of 800+ rows and 20+ columns to just 10-30 rows and less than 6 columns.;To assess its predictive ability, we have compared PEEKING2 to more elaborate learners. Regarding defect prediction, PEEKING2 performed almost as well or better than Naive Bayes and Random Forest in most of the data sets. Similarly, when applied on effort estimation data, PEEKING2 outperformed or performed the same as Linear Regression and M5P in the majority of cases.;These results shows that it is possible to "peek" at the data without losing significant information. Consequently, we recommend PEEKING2 as a data summarization tool to assist managers and software engineers in their analysis of the project data.

Share

COinS