Date of Graduation

2014

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Elaine M Eschen

Committee Co-Chair

Alan V Barnes

Committee Member

Vinod Kulathumani

Abstract

Semantic Signature Mining Tool (SSMinT) is a suite of software tools that aid a data analyst to develop semantic signatures that capture targeted content, and uses these semantic signatures to categorize text documents with unknown content or retrieve documents of a specific type or interest. This was developed by Barnes, Eschen, Para, and Peddada in 2010. These tools require expert input. An automated version of SSMinT software package was developed with the aim to reduce manual input and use machine learning techniques to discover semantic signatures. This was developed by Barnes, Eschen, and Kota in 2011. Key features of this include automated keyword group generation, automated generation of candidate semantic signatures, and methods to prune redundant relevant semantic signatures from the semantic signature set. Human input is required only at the time of choosing the training documents.;This thesis develops an enhanced version of the Automated Semantic Signature Mining Tool which increases the scope for capturing semantic content from the training documents. In particular, problems with analyzing very short documents are addressed. Improvements made in the tools minimize the unnecessary keyword groups in the early stages of the learning phase, and thereby maximizes the number of significant semantic signatures generated in the later stages of the learning phase. Thereby, a larger number of documents that are similar to the training documents are retrieved. The resulting fine-tuned semantic signatures also yield effective categorization of text documents into groups with closely related content. Tools are developed to automate the tedious process of measuring the document retrieval rates. A statistical method is also employed to estimate the precision of document retrieval.

Share

COinS