Date of Graduation
Statler College of Engineering and Mineral Resources
Lane Department of Computer Science and Electrical Engineering
Donald A Adjeroh
Recent improvements in high-throughput next generation sequencing (NGS) technologies have led to an exponential increase in the number, size and diversity of available complete genome sequences. This poses major problems in storage, transmission and analysis of such genomic sequence data. Thus, a substantial effort has been made to develop effective data compression techniques to reduce the storage requirements, improve the transmission speed, and analyze the compressed sequences for possible information about genomic structure or determine relationships between genomes from multiple organisms.;In this thesis, we study the problem of lossless compression of genome resequencing data using a reference-based approach. The thesis is divided in two major parts. In the first part, we perform a detailed empirical analysis of a recently proposed compression scheme called MLCX (Maximal Longest Common Substring/Subsequence). This led to a novel decomposition technique that resulted in an enhanced compression using MLCX. In the second part, we propose SMLCX, a new reference-based lossless compression scheme that builds on the MLCX. This scheme performs compression by encoding common substrings based on a sorted order, which significantly improved compression performance over the original MLCX method. Using SMLCX, we compressed the Homo sapiens genome with original size of 3,080,436,051 bytes to 6,332,488 bytes, for an overall compression ratio of 486. This can be compared to the performance of current state-of-the-art compression methods, with compression ratios of 157 (Wang et.al, Nucleic Acid Research, 2011), 171 (Pinho et.al, Nucleic Acid Research, 2011) and 360 (Beal et.al, BMC Genomics, 2016).
Farheen, Aliya, "Compressing Genome Resequencing Data" (2016). Graduate Theses, Dissertations, and Problem Reports. 5578.