Semester

Summer

Date of Graduation

2011

Document Type

Dissertation

Degree Type

PhD

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald Adjeroh.

Abstract

The suffix tree is a data structure used to represent all the suffixes in a string. However, a major problem with the suffix tree is its practical space requirement. In this dissertation, we propose an efficient data structure -- the virtual suffix tree (VST) -- which requires less space than other recently proposed data structures for suffix trees and suffix arrays. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form, where n is the length of the sequence.;Markov models are very popular for modeling complex sequences. In this dissertation, we present the probabilistic suffix array (PSA), a space-efficient alternative to the probabilistic suffix tree (PST) used to represent Markov models. The PSA provides all the capabilities of the PST, such as learning and prediction, and maintains the same linear time construction (linearity with respect to sequence length). The PSA, however, has a significantly smaller memory requirement than the PST, for both the construction stage, and at the time of usage.;Using the proposed suffix data structures, we study the circular pattern matching (CPM) problem. We provide a linear time, linear space algorithm to solve the exact circular pattern matching problem. We then present four algorithms to address the approximate circular pattern matching (ACPM) problem. Our bidirectional ACPM algorithm provides the best time complexity when compared with other algorithms proposed in the literature. Further, we define the circular pattern discovery (CPD) problem and present algorithms to solve this problem. Using the proposed circular pattern matching algorithms, we perform experiments on computational analysis and function prediction for multidomain proteins.

Share

COinS