Semester

Summer

Date of Graduation

2009

Document Type

Dissertation

Degree Type

PhD

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald A Adjeroh

Committee Co-Chair

E James Harner

Abstract

High dimensional data is widely available in bioinformatics, chemometrics and other applications. For example, in gene expression experiments, tens of thousands of genes are probed. Phenotype data may be clinical data such as tumor types, or quantities measuring biological characteristics of a subject. While such high dimensional data can be readily generated, successful analysis and modeling of these data is highly challenging.;Random KNN, as proposed in this dissertation, is a novel generalization of traditional nearest-neighbor modeling. Random KNN consists of an ensemble of base k nearest-neighbor models, each taking a random subset of the input variables. A theoretical and empirical analysis of the performance of the Random KNN is performed. Based on the proposed Random KNN, a new feature selection method is devised. To rank the importance of the variables, a criterion, named support, is defined and computed on the Random KNN framework. A two-stage backward model selection method is developed using supports. The present study shows that the Random KNN is a more effective and more efficient model for high-dimensional data than existing approaches.;The Random KNN approach can be applied to both qualitative and quantitative responses, i.e., classification and regression problems, and has applications in statistics, machine learning, pattern recognition and bioinformatics, etc.;Keywords. classification, regression, feature selection, bioinformatics, gene expression analysis.

Share

COinS