Semester

Summer

Date of Graduation

2008

Document Type

Dissertation

Degree Type

PhD

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald Adjeroh.

Abstract

The suffix sorting problem is to construct the suffix array for an input sequence. Given a sequence T[0... n -- 1] of size n = |T|, with symbols from a fixed alphabet Sigma, (|Sigma| ≤ n), the suffix array provides a compact representation of all the suffixes of T in a lexicographic order. Traditionally, the suffix array is often constructed by first building the suffix tree for T, and then performing an inorder traversal of the suffix tree. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. We propose a direct suffix sorting algorithm which rearranges the biological sequences of interests and facilitates high throughput pattern query, retrieval and storage in O(n) time. The improved algorithm requires only 7n bytes of storage, including the n bytes for the original string, and the 4n bytes for the suffix array. The basis of our improved algorithm is an extension of Shannon-Fano-Elias codes used in information theory. This is the first time information-theoretic methods have been used as the basis for solving the suffix sorting problem.;The direct suffix sorting algorithm is then applied to solve the multiple sequence alignment problem. The sequences to be aligned are concatenated and then passed to the suffix sorting algorithm to locate the repetitive regions. Gaps are determined by those repetitive regions and consequently inserted back to the input sequences to attain the optimal alignment with reference to the biological mutation matrix.;We also apply the direct suffix sorting algorithm to the problem of compressibility of protein sequences. The basis is the observation of genome-scale long-range correlation in concatenated protein sequences from the same organism. We propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significant reduction in the number of bits required for representing the sequences. The observed long-range correlations could have significant implications beyond compression and complexity analysis of protein sequences.

Share

COinS