Semester
Summer
Date of Graduation
2009
Document Type
Dissertation
Degree Type
PhD
College
Statler College of Engineering and Mineral Resources
Department
Lane Department of Computer Science and Electrical Engineering
Committee Chair
Donald A Adjeroh
Committee Co-Chair
E James Harner
Abstract
High dimensional data is widely available in bioinformatics, chemometrics and other applications. For example, in gene expression experiments, tens of thousands of genes are probed. Phenotype data may be clinical data such as tumor types, or quantities measuring biological characteristics of a subject. While such high dimensional data can be readily generated, successful analysis and modeling of these data is highly challenging.;Random KNN, as proposed in this dissertation, is a novel generalization of traditional nearest-neighbor modeling. Random KNN consists of an ensemble of base k nearest-neighbor models, each taking a random subset of the input variables. A theoretical and empirical analysis of the performance of the Random KNN is performed. Based on the proposed Random KNN, a new feature selection method is devised. To rank the importance of the variables, a criterion, named support, is defined and computed on the Random KNN framework. A two-stage backward model selection method is developed using supports. The present study shows that the Random KNN is a more effective and more efficient model for high-dimensional data than existing approaches.;The Random KNN approach can be applied to both qualitative and quantitative responses, i.e., classification and regression problems, and has applications in statistics, machine learning, pattern recognition and bioinformatics, etc.;Keywords. classification, regression, feature selection, bioinformatics, gene expression analysis.
Recommended Citation
Li, Shengqiao, "Random KNN modeling and variable selection for high dimensional data" (2009). Graduate Theses, Dissertations, and Problem Reports. 4492.
https://researchrepository.wvu.edu/etd/4492