Semester

Summer

Date of Graduation

2009

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Nancy Lan Guo

Abstract

The main objective of this study is to develop a novel network-based methodology to identify prognostic signatures of genes that can predict recurrence in cancer. Feature selection algorithms were used widely for the identification of gene signatures in genome-wide association studies. But most of them do not discover the causal relationships between the features and need to compromise between accuracy and complexity. The network-based techniques take the molecular interactions between pairs of genes in to account and are thus a more efficient means of finding gene signatures, and they are also better in terms of its classification accuracy without compromising over complexity. Nevertheless, the network-based techniques currently being used have a few limitations each. Correlation-based coexpression networks do not provide predictive structure or causal relations among the genes. Bayesian networks cannot model feedback loops. Boolean networks can model small scale molecular networks, but not at the genome-scale. Thus the prediction logic induced implication networks are chosen to generate genome-wide coexpression networks, as they integrate formal logic and statistics and also overcome the limitations of other network-based techniques.;The first part of the study includes building of an implication network and identification of a set of genes that could form a prognostic signature. The data used consisted of 442 samples taken from 4 different sources. The data was split into training set UM/HLM (n=256) and two testing sets DFCI (n=82) and MSK (n=104). The training set was used for the generation of the implication network and eventually the identification of the prognostic signature. The test sets were used for validating the obtained signature. The implication networks were built by using the gene expression data associated with two disease states (metastasis or non-metastasis), defined by the period and status of post-operative survival. The gene interactions that differentiated the two disease states, the differential components, were identified. The major cancer hallmarks (E2F, EGF, EGFR, KRAS, MET, RB1, and TP53) were considered, and the genes that interacted with all the major hallmarks were identified from the differential components to form a 31-gene prognostic signature. A software package was created in R to automate this process which has C-code embedded into it. Next, the signature was fitted into a COX proportional hazard model and the nearest point to the perfect classification in the ROC curve was identified as the best scheme for patient stratification on the training set (log-rank p-value=1.97e-08), and two test sets DFCI (log-rank p-value=2.13e-05) and MSK (log-rank p-value=1.24e-04) in Kaplan-Meier analyses.;Prognostic validation was carried out on the test sets using methods such as Concordance Probability Estimate (CPE) and Gene Set Enrichment Analysis (GSEA). The accuracy of this signature was evaluated with CPE, which achieves 0.71 on the test set DFCI (log-rank p-value=5.3e-08) and 0.70 on test set MSK (log-rank p-value=2.1e-07). The hazard ratio of this 31-gene prognostic signature is 2.68 (95% CI: [1.88, 3.82]) on the DFCI dataset and 3.31 (95% CI: [2.11, 5.2]) on the MSK set. These results demonstrate that our 31-gene signature was significantly more accurate than previously published signatures on the same datasets. The false discovery rate (FDR) of this 31-gene signature is 0.21 as computed with GSEA, which showed that our 31 gene signature was comparable to other lung cancer prognostic signatures on the same datasets.;Topological validation was performed on the test sets for the identified signature to validate the computationally derived molecular interactions. The interactions from implication networks were compared with those from Bayesian networks implemented in Tetrad IV. Various curated databases and bioinformatics tools were used in the topological evaluation, including PRODISTIN, KEGG, PubMed, NCI-Nature pathways, MATISSE, STRING 8, Ingenuity Pathway Analysis, and Pathway Studio 6. The results showed that the implication networks generated all the curated interactions from various tools and databases, whereas Bayesian networks contained only a few of them. It can thus be concluded that implication networks are capable of generating many more gene or protein interactions when compared to the currently used network techniques such as Bayesian networks.

Share

COinS