Date of Graduation

2015

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Afzel Noore

Committee Co-Chair

Yaser Fallah

Committee Member

Xin Li

Abstract

Support Vector Machines (SVMs) is a popular and highly performing classification model with a very good generalization ability for many classification applications. This method uses kernels to classify data that are not linearly separable. The added complexity of the kernels that map data to a higher dimensional space degrade the SVM classifier performance when dealing with large datasets. Moreover, classifying a dataset by choosing an appropriate kernel and finding the best set of its parameters is challenging. Failing in this process can easily cause to an overfit problem.;In this thesis we propose the Piece-wise Linear SVM (PWLSVM) using MagKmeans clustering to address the complexity and computational cost of SVMs. We use a linear SVM to overcome the complexity of dealing with the kernels, and a MagKmeans clustering to cluster the data into balanced groups. MagKmeans which is a supervised technique clusters equal number of each class in one group. It ensures that a linear SVM has balanced training samples for each class and can attain an accurate model.;The detailed mathematical formulation and modeling of the proposed Distributed MagKmeans (D-MagKmeans) is presented. The algorithm uses a Distributed MagKmeans clustering approach to transform the PWLSVM to Distributed Piece-wise Linear SVM (D-PWLSVM). The proposed D-MagKmeans clustering approach makes the MagKmeans clustering work in distributed network by only passing the centroid of each cluster in one node only to its one-hop neighbors. This feature of the D-Magkmeans makes our approach appropriate for a distributed processing and decision making while maximizing privacy and minimizing the communication overload.;The proposed algorithm was validated using four datasets in terms of dimensions on the features and the number of samples. Pima Indian Diabetes, with 768 samples and 8 features, is the smallest dataset of the four. We also examined Abalone, with 4177 samples and 8 features, Waveform, with 5000 samples and 22 features, and EHarmony, with over half million samples and 116 features. The results reveal that a reasonable trade-off is required when dealing with a large dataset. The results also illustrate that PWLSVM and D-PWLSVM outperform SVMs on a relatively large dataset, such as EHarmony, Abalone, and Waveform.

Share

COinS