Date of Graduation


Document Type


Degree Type



Statler College of Engineering and Mineral Resources


Lane Department of Computer Science and Electrical Engineering

Committee Chair

Katerina Goseva Popstojanova

Committee Co-Chair

Roy Nutter

Committee Member

Arun Ross.


Attacks targeting Web system vulnerabilities have shown an increasing trend in the recent past. A contributing factor in this trend is the deployment of Web 2.0 technologies. Due to the ability of users to create their own content, Web 2.0 applications have become increasingly popular and in turn this has made them attractive targets for malicious attacks. Given these trends there is a need to better understand and classify malicious cyber activities. The work presented in this thesis is based on malicious data collected by three high-interaction honeypots, and organized in HTTP sessions, each characterized by 43 different features. The data were divided into multiple vulnerability scans and attack classes. Five batch supervised machine learning algorithms (J48, PART, Support Vector Machine SVM, Multi Layer Perceptron MLP and Naive Bayes Learner NB) and one stream semi-supervised algorithm (CSL-Stream) were used to study whether machine learning algorithms could be used to distinguish between vulnerability scans and attacks and also among eleven vulnerability scan and nine attack classes. The Information Gain feature selection method, and three other feature selection methods, were used to determine whether different attacks and vulnerability scans can be characterized by a small number of features (i.e., session characteristics). The results showed that supervised algorithms can be trained to distinguish among different classes of malicious traffic using only a small number of features. The stream semi-supervised algorithm was able to classify the partially labeled data almost as good as the completely labeled data. The classification of the data was dependent on the number of instances in each class, distinctive features for each class and amount of concept drift. The supervised algorithms, however, were better in classifying the completely labeled data.