Author

Kelly Cecil

Date of Graduation

2017

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Elaine Eschen

Committee Co-Chair

Alan Barnes

Committee Member

Roy Nutter

Abstract

We examine two proposed indexing algorithms taking advantage of the new SSMinT libraries. The two algorithms primarily differ in their selection of documents for learning. The batch indexing method selects some random number of documents for learning. The iterative indexing method uses a single randomly selected document to discover semantic signatures, which are then used to find additional related documents. The batch indexing method discovers one to three semantic signatures per document, resulting in poor clustering performance as evaluated by human cross-validation of clusters using the Adjusted Rand Index. The iterative indexing method discovers more semantic signatures per document, resulting in far better clustering performance using the same cross-validation method.;Our new tools enable faster development of new experiments, forensic applications, and more. The experiments show that SSMinT can provide effective indexing for text data such as e-mail or web pages. We conclude with areas of future research which may benefit from utilizing SSMinT. (Abstract shortened by ProQuest.).

Share

COinS