Semester

Fall

Date of Graduation

2021

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald Adjeroh

Committee Member

Gianfranco Doretto

Committee Member

Ivan Martinez

Abstract

Long non coding Ribonucleic Acids (lncRNAs) can be localized to different cellular components, such as the nucleus, exosome, cytoplasm, ribosome, etc. Their biological functions can be influenced by the region of the cell they are located. Many of these lncRNAs are associated with different challenging diseases. Thus, it is crucial to study their subcellular localization. However, compared to the vast number of lncRNAs, only relatively few have annotations in terms of their subcellular localization. Conventional computational methods use q-mer profiles from lncRNA sequences and then train machine learning models, such as support vector machines and logistic regression with the profiles. These methods focus on the exact q-mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these changes might improve our ability to model lncRNAs and their localization. I hypothesize that considering these changes may improve the ability to predict subcellular localization of lncRNAs. To test this hypothesis, I propose a deep learning model with inexact q-mers for the localization of lncRNAs in the cell. The proposed method can obtain a high overall accuracy of 94.7%, an average of 91.3% on a benchmark dataset, using the 8-mers with mismatches. In comparison, the exact 8-mer result was 89.8%. The proposed approach outperformed existing state-of-art lncRNA predictors on two different datasets. Therefore, the results support the hypothesis that deep learning models using inexact q-mers can improve the performance of computational lncRNA localization algorithms. The lengths of the lncRNAs vary from hundreds to thousands of nucleotides. In this work, I also check whether the length of lncRNA will impact the prediction accuracy. The results show that when the lncRNA sequence's length is between 2000 and 3000 nucleotides, our model is more accurate.

Share

COinS