Semester

Fall

Date of Graduation

2010

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Elaine M Eschen

Abstract

The rapid development of the Internet and the ability to store data relatively inexpensively has contributed to an information explosion that did not exist a few years ago. Just a few keystrokes on search engines on any given subject will provide more web pages than any time before. As the amount of data available to us is so overwhelming, the ability to extract relevant information from it remains a challenge.;Since 80% of the available data stored world wide is text, we need advanced techniques to process this textual data and extract useful in formation. Text mining is one such process to address the information explosion problem that employs techniques such as natural language processing, information retrieval, machine learning algorithms and knowledge management. In text mining, the subjected text undergoes a transformation where essential attributes of the text are derived. The attributes that form interesting patterns are chosen and machine learning algorithms are used to find similar patterns in desired corpora. At the end, the resulting texts are evaluated and interpreted.;In this thesis we develop a new framework for the text mining process. An investigator chooses target content from training files, which is captured in semantic signatures. Semantic signatures characterize the target content derived from training files that we are looking for in testing files (whose content is unknown). The semantic signatures work as attributes to fetch and/or categorize the target content from a test corpus. A proof of concept software package, consisting of tools that aid an investigator in mining text data, is developed using Visual studio, C# and .NET framework.;Choosing keywords plays a major role in designing semantic signatures; careful selection of keywords leads to a more accurate analysis, especially in English, which is sensitive to semantics. It is interesting to note that when words appear in different contexts they carry a different meaning. We have incorporated stemming within the framework and its effectiveness is demonstrated using a large corpus. We have conducted experiments to demonstrate the sensitivity of semantic signatures to subtle content differences between closely related documents. These experiments show that the newly developed framework can identify subtle semantic differences substantially.

Share

COinS