Semester
Spring
Date of Graduation
2026
Document Type
Dissertation
Degree Type
PhD
College
Statler College of Engineering and Mineral Resources
Department
Lane Department of Computer Science and Electrical Engineering
Committee Chair
Donald Adjeroh
Committee Co-Chair
Gangqing Hu
Committee Member
Gianfranco Doretto
Committee Member
Jeremy Dawson
Committee Member
Elaine Eschen
Committee Member
Sijin Wen
Abstract
This work develops computational methods for identifying molecular signatures from high-throughput genomic data and for modeling long non-coding RNA (lncRNA) sub-cellular localization. The response of multiple myeloma to CB-6644, a selective RUVBL1/2 complex inhibitor with potential anti-tumor activity, is analyzed to identify drug-responsive pathways and molecular signatures. Conventional gene set enrichment analysis (GSEA) often excludes low-expression genes. Here, phenotype comparison is reformulated as a supervised machine learning problem: genes most informative for discrimination are first selected using a machine learning approach, and GSEA is then applied to these machine-learning derived gene sets. This framework improves detection of CB-6644-associated pathways. For lncRNAs, first, lncRNA localization signatures are studied directly using the RNA sequence. Inexact q-mer representations are analyzed for sub-cellular localization prediction, cell-type specificity, and localization-switching lncRNAs. The analyses identify localization-associated sequence segments (signatures), show that part of the signal is cell-type dependent, and indicate that 5’ transcript regions contain stronger localization signals than 3’ regions. Then, a graph neural network framework for lncRNA localization is developed in which each transcript is represented as a graph whose nodes are sequence windows and whose edges encode both local adjacency and non-local sequence similarity, while an optional global branch captures transcript-level context. Across benchmark task, the GNN achieves competitive performance and shows that graph structure provides useful contextual refinement beyond strong sequence features. Perturbation-based interpretation further highlights biologically plausible sequence windows, including nuclear-retention-like C-rich motifs. Together, these studies show that machine learning can recover molecular and localization signatures across multiple levels of biological organization, from gene sets to transcript motifs to graph-structured representations. The dissertation contributes new methods for drug-response signature discovery, lncRNA sub-cellular localization prediction, and interpretable sequence modeling, advancing computational approaches for functional genomics.
Recommended Citation
Yi, Weijun, "Computational Methods for Identification of Molecular Signatures" (2026). Graduate Theses, Dissertations, and Problem Reports. 13300.
https://researchrepository.wvu.edu/etd/13300
Included in
Other Biomedical Engineering and Bioengineering Commons, Other Computer Engineering Commons