Faculty & Staff Scholarship

A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data

Richard Beal, West Virginia University
Tazin Afrin, West Virginia University
Aliya Farheen, West Virginia University
Donald Adjeroh, West Virginia University

Document Type

Article

Publication Date

2016

College/Unit

Statler College of Engineering and Mining Resources

Department/Program/Center

Not Listed

Abstract

Background: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data.

Methods: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself.

Results: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011).

Conclusion: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.

Digital Commons Citation

Beal, Richard; Afrin, Tazin; Farheen, Aliya; and Adjeroh, Donald, "A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data" (2016). Faculty & Staff Scholarship. 1952.
https://researchrepository.wvu.edu/faculty_publications/1952

Source Citation

Beal, R., Afrin, T., Farheen, A., & Adjeroh, D. (2016). A new algorithm for “the LCS problem” with application in compressing genome resequencing data. BMC Genomics, 17(S4). https://doi.org/10.1186/s12864-016-2793-0

Comments

© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated

Download

COinS

Faculty & Staff Scholarship

A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data

Document Type

Publication Date

College/Unit

Department/Program/Center

Abstract

Digital Commons Citation

Source Citation

Comments

Browse

Resources

Search

Author Corner

Faculty & Staff Scholarship

A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data

Authors

Document Type

Publication Date

College/Unit

Department/Program/Center

Abstract

Digital Commons Citation

Source Citation

Comments

Share

Browse

Resources

Search

Author Corner