Date of Graduation


Document Type


Degree Type



Statler College of Engineering and Mineral Resources


Lane Department of Computer Science and Electrical Engineering

Committee Chair

Nancy Lan Guo


The main goal of this study is to identify molecular signatures to predict lymph node metastases and recurrence in colon cancer patients. Recent advances in microarray technology facilitated building of accurate molecular classifiers, and in depth understanding of disease mechanisms.;Lymph node metastasis cannot be accurately estimated by morphological assessment. Molecular markers have the potential to improve prognostic accuracy. The first part of our study presents a novel technique to identify molecular markers for predicting stage of the disease based on microarray gene expression data. In the first step, random forests were used for variable selection and a 14-gene signature was identified. In the second step, the genes without differential expression in lymph node negative versus positive tumors were removed from the 14-gene signature, leading to the identification of a 9-gene signature. The lymph node status prediction accuracy of the 9-gene signature on an independent colon cancer dataset (n=17) was 82.3%. Area under curve (AUC) obtained from the time-dependent ROC curves using the 9-gene signature was 0.85 and 0.86 for relapse-free survival and overall survival, respectively. The 9-gene signature significantly stratified patients into low-risk and high-risk groups (log-rank tests, p<0.05, n=73), with distinct relapse-free survival and overall survival. Based on the results, it could be concluded that the 9-gene signature could be used to identify lymph node metastases in patients. We further studied the 9-gene signature using correlation analysis on CGH and RNA expression datasets. It was found that the gene ITGB1 in the 9-gene signature exhibited strong relationship of DNA copy number and gene expression. Furthermore, genome-wide correlation analysis was done on CGH and RNA data, and three or more consecutive genes with significant correlation of DNA copy number and RNA expression were identified. These results might be helpful in identifying the regulators of gene expression.;The second part of the study was focused on identifying molecular signatures for patients at high-risk for recurrence who would benefit from adjuvant chemotherapy. The training set (n=36) consisted of patients who remained disease-free for 5 years and patients who experienced recurrence within 5 years. The remaining patients formed the testing set (n=37). A combinatorial scheme was developed to identify gene signatures predicting colon cancer recurrence. In the first step, preprocessing was done to discard undifferentiated genes and missing values were replaced with k=30 and k=20 using the k-nearest neighbors algorithm. Variable selection using the random forests algorithm was applied to obtain gene subsets. In the second step, InfoGain feature selection technique was used to drop lower ranked genes from the gene subsets based on their association with disease outcome. A 3-gene and a 5-gene signature were identified by this technique based on different missing value replacement methods. Both of the recurrence gene signatures stratified patients into low-risk and high-risk groups (log-rank tests, p<0.05, n=73), with distinct relapse-free survival and overall survival. A recurrence prediction model was built using LWL classifier based on the 3-gene signature with an accuracy of 91.7% on the training set (n=36). Another recurrence prediction model was built using the random tree classifier based on the 5-gene signature with an accuracy of 83.3% on the training set (n=36). The prospective predictions obtained on the testing set using these models will be verified when the follow-up information becomes available in the future. The recurrence prediction accuracies of these gene signatures on independent colon cancer datasets were in the range 72.4% to 88.9%. These prognostic models might be helpful to clinicians in selecting more appropriate treatments for patients who are at high-risk of developing recurrence. When compared over multiple datasets, the 3-gene signature had improved prediction accuracy over the 5-gene signature. The identified lymph node and recurrence gene signatures were validated on rectal cancer data. Time-dependent ROC and Kaplan-Meier analysis were done producing significant results. These results support the fact that the developed prognostic models could be used to identify patients at high-risk of developing recurrence and get an estimate of the survival times in rectal cancer patients.