Semester

Fall

Date of Graduation

2018

Document Type

Thesis

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Jeremy M. Dawson

Committee Co-Chair

Donald Adjeroh

Committee Member

Donald Adjeroh

Committee Member

Stephen DiFazio

Abstract

Bacterial communities found in and on the human body are not only used in studying human health conditions but are also effective in differentiating individuals due to their distinct profiles. Human palm regions harbor relatively more diverse bacterial communities and are indicative of population groups, life styles, geographic locations, age groups and health conditions. Sequences extracted from hypervariable region V3 of the 16S rRNA bacterial gene of hand bacterial samples from 9 different population groups were classified into Operational Taxonomic Units (OTU) with GreenGenes reference taxonomy using RDP (Ribosomal Database Project) classifier. Frequencies of identified OTUs were used to study dissimilarities between samples by calculating the Kullback-Leibler Divergence (KLD) between every two samples. In addition to OTU frequencies, the frequencies of nucleotide k-mers from each OTU sequence were used to study the dissimilarities between samples. Based on the structure, 65 nucleotides of V3 hypervariable region were mapped into 47 elements, and distribution of k-mers from these mapped elements were used to determine dissimilarities between samples. Furthermore, a new technique was applied to classify sequences where sequences were clustered based on their k-mer frequency profile and a unique signature is assigned to every cluster. Frequencies of these signature clusters were used to calculate the KLD between different samples. This method classifies the unknown sequences that were ignored in OTU based methods. Ensemble learning method is applied to each of the above case of k-mers to identify the population group of a given hand bacterial sample. Samples were identified with a range of 51-98 % accuracy for different cases of k-mer distribution. Samples were classified with greater accuracy with k-mer classified sequences than with OTU sequences. Though applied on a small group of samples, these results provide a basis for the use of k-mer distributions in classifying and identifying individuals which could perform better on a broader range of time-varying dataset from other regions of 16S rRNA or even other genes, such as 23S rRNA of bacteria

Embargo Reason

Publication Pending

Share

COinS