Author ORCID Identifier

https://orcid.org/0000-0002-6912-2925

Semester

Spring

Date of Graduation

2024

Document Type

Dissertation

Degree Type

PhD

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald Adjeroh

Committee Member

Elaine Eschen

Committee Member

Gianfranco Doretto

Committee Member

Jeremy Dawson

Committee Member

Ivan Martinez

Committee Member

Granger Sutton

Abstract

The applied science of bioinformatics encompasses computational analysis of molecular biology data. Advances in genomics and DNA sequencing technology have enabled computational analysis of ribonucleic acids (RNAs), which play diverse and critical roles in most cells. To assist the study of human RNA, we trained machine learning models on RNA nucleotide sequences, devoid of domain knowledge. We built models that distinguish long non-coding lncRNA from protein-coding mRNA, and models that predict the cytoplasmic vs. nuclear preferences of lncRNAs. In a review of published lncRNA subcellular localization classifiers, we show that the commonly used validation protocol generates optimistic performance measures, and we propose a new benchmark for this application of machine learning. To assist the study of plant biology, we applied our own alignment-based method to the analysis of maternal vs. paternal imbalance of mRNA in seeds. We also generated initial results indicating how k-mer-based methods might complement our alignment-based methods. Finally, we developed and published a machine learning method that improved the accuracy of our alignment-based method in the specific case of detecting parental imbalance in interspecies hybrids. These results demonstrate several enhancements to the field of RNA bioinformatics through the application of machine learning.

Embargo Reason

Publication Pending

Available for download on Friday, April 25, 2025

Share

COinS