Semester
Fall
Date of Graduation
2011
Document Type
Thesis
Degree Type
MS
College
Statler College of Engineering and Mineral Resources
Department
Lane Department of Computer Science and Electrical Engineering
Committee Chair
Elaine M Eschen
Abstract
Text mining refers to the process of extracting information from text. There are massive amounts of data available today due to enhanced data collection capabilities, inexpensive high capacity storage, and the proliferation of World Wide Web pages. A substantial portion of this data is in text format. The main goal of text data mining software tools is to help us learn and benefit from this wealth of text data. Humans cannot cope with the overwhelming text data resources. The information in text data needs to be filtered, summarized, analyzed, and refined for human analysts.;A semantic signature is the concept that semantic content in text has characteristic word patterns, such as frequency of words and proximity between words, which can be identified and quantified. A type of quantitative semantic signature was developed by Barnes, Eschen, Para, and Peddada in 2010. The utility and sensitivity of semantic signatures of this type in capturing semantic content in text data was demonstrated by this group via the development of a software package named Semantic Signature Mining Tool (SSMinT). SSMinT is a suite of software tools that assist a data analyst to develop semantic signatures that capture targeted content and then use these semantic signatures to categorize a corpus of text documents with unknown content or to retrieve text documents with the targeted content from a corpus of documents with arbitrary content.;Key features of SSMinT are the expert input from the human analyst and the interaction between the analyst and the software; the tool is designed to assist the analyst and does not work independently. This is a strong feature in the sense that the resulting semantic signatures are tailored by the analyst's expert knowledge of the domain. This was demonstrated by Barnes, Eschen, Para, and Peddada to be a powerful approach to text data mining.;This thesis develops an automated version of the SSMinT software package that requires minimal input from an analyst. This work includes an automated keyword group generation and refinement algorithm, automated generation of candidate semantic signatures, methods to prune irrelevant and redundant relevant semantic signatures from the semantic signature set. Relieving the analyst from the tedious and time consuming task of developing semantic signatures is not the only motivation for an automated tool. The automation is designed to discover semantic signatures in text data without human input, except for the choice of training documents. The advantage of automated semantic signature discovery is the ability to identify patterns an analyst may not recognize due to the large volume of data or his point of view bias. The effectiveness of Automated SSMinT in categorizing text documents into groups with closely related content and retrieving documents with content similar to those in its training set is demonstrated in experiments on various corpora. These experiments prove Automated SSMinT to be an efficient, convenient, and powerful text mining tool.
Recommended Citation
Kota, Rukmini Ravali, "Automated Discovery of Relevant Features for Text Mining" (2011). Graduate Theses, Dissertations, and Problem Reports. 4742.
https://researchrepository.wvu.edu/etd/4742