Date of Graduation
Statler College of Engineering and Mineral Resources
Lane Department of Computer Science and Electrical Engineering
Elaine M Eschen
Alan V Barnes
The quantity of test that is added to the web in the digital form continues to grow and the quest for tools that can process this huge amount of data to retrieve the data of our interest is an ongoing process. Moreover, observing these large volumes of data over a period of time is a tedious task for any human being. Text mining is very helpful in performing these kinds of tasks. Text mining is a process of observing patterns in the text data using sophisticated statistical measures both quantitatively and qualitatively. Using these text mining techniques and the power of the internet and its technologies, we have developed a tool that retrieves documents concerning topics of interest, which utilizes novel and sensitive classification tools.;This thesis presents an intelligent web crawler, named Intel-Crawl. This tool identifies web pages of interest without the user's guidance or monitoring. Documents of interest are logged (by URL or file name). This package uses automatically generated semantic signatures to identify documents with content of interest. The tool also produces a vector that is a quantification of a document's content based on the semantic signatures. This provides a rich and sensitive characterization of the document's content. Documents are classified according to content and presented to the user for further analysis and investigation.;Intel-Crawl may be applied to any area of interest. It is likely to be very useful in areas such as law enforcement, intelligence gathering, and monitoring changes in web site contents over time. It is well-suited for scrutinizing the web activity of large collection of web pages pertaining to similar content. The utility of Intel-Crawl is demonstrated in various situations using different parameters and classification techniques.
Pinnamaneni, Lingaiah Choudary, "Intelligent Web Crawling using Semantic Signatures" (2013). Graduate Theses, Dissertations, and Problem Reports. 4989.