Semester

Summer

Date of Graduation

2019

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Katerina Goseva-Popstojanova

Committee Co-Chair

Roy Nutter

Committee Member

Roy Nutter

Committee Member

Matthew Valenti

Abstract

As the numbers of software vulnerabilities and cybersecurity threats increase, it is becoming more difficult and time consuming to classify bug reports manually. This thesis is focused on exploring techniques that have potential to improve the performance of automated classification of software bug reports as security or non-security related. Using supervised learning, feature selection was used to engineer new feature vectors to be used in machine learning. Feature selection changes the vocabulary used by selecting words with the greatest impact on classification. Feature selection was able to increase the F-Score across the datasets by increasing the precision. We also explored unsupervised classification based on clustering. A distribution of software issues was created using variational autoencoders, where the majority of security related issues were closely related. However, a portion of non-security issues also ended up in the distribution. Furthermore, we explored recent advances in text mining classification based on deep learning. Specifically, we used recurrent networks for supervised and semi-supervised classification. LSTM networks outperformed the Naive Bayes classifier in projects with a high ratio of security related issues. Sequence autoencoders were trained on unlabeled data and tuned with labeled data. The results showed that using unlabeled software issues different from the testing datasets degraded the results. Sequence autoencoders may be used on large datasets, where labeled data is scarce.

Embargo Reason

Publication Pending

Share

COinS