Semester
Spring
Date of Graduation
2013
Document Type
Thesis
Degree Type
MS
College
Statler College of Engineering and Mineral Resources
Department
Lane Department of Computer Science and Electrical Engineering
Committee Chair
Elaine M Eschen
Committee Co-Chair
Alan V Barnes
Committee Member
Arun Ross
Abstract
The quantity of test that is added to the web in the digital form continues to grow and the quest for tools that can process this huge amount of data to retrieve the data of our interest is an ongoing process. Moreover, observing these large volumes of data over a period of time is a tedious task for any human being. Text mining is very helpful in performing these kinds of tasks. Text mining is a process of observing patterns in the text data using sophisticated statistical measures both quantitatively and qualitatively. Using these text mining techniques and the power of the internet and its technologies, we have developed a tool that retrieves documents concerning topics of interest, which utilizes novel and sensitive classification tools.;This thesis presents an intelligent web crawler, named Intel-Crawl. This tool identifies web pages of interest without the user's guidance or monitoring. Documents of interest are logged (by URL or file name). This package uses automatically generated semantic signatures to identify documents with content of interest. The tool also produces a vector that is a quantification of a document's content based on the semantic signatures. This provides a rich and sensitive characterization of the document's content. Documents are classified according to content and presented to the user for further analysis and investigation.;Intel-Crawl may be applied to any area of interest. It is likely to be very useful in areas such as law enforcement, intelligence gathering, and monitoring changes in web site contents over time. It is well-suited for scrutinizing the web activity of large collection of web pages pertaining to similar content. The utility of Intel-Crawl is demonstrated in various situations using different parameters and classification techniques.
Recommended Citation
Pinnamaneni, Lingaiah Choudary, "Intelligent Web Crawling using Semantic Signatures" (2013). Graduate Theses, Dissertations, and Problem Reports. 4989.
https://researchrepository.wvu.edu/etd/4989