Date of Graduation


Document Type

Problem/Project Report

Degree Type



Statler College of Engineering and Mineral Resources


Lane Department of Computer Science and Electrical Engineering

Committee Chair

Donald Adjeroh

Committee Co-Chair

Gangqing Hu

Committee Member

Lan Guo


ATAC-seq is a new high-throughput sequencing technology for measuring chromatin accessibility within genomic samples. It can be used to discover new information about open regions, nucleosome positions, transcription factor binding sites, and DNA methylation. It is especially useful when combined with other next-generation sequencing techniques, such as RNA-seq. Unlike previous technologies, however, ATAC-seq is more sensitive to bacterial contamination, which is a well-known problem in cell cultures that can lead to incorrect experimental results. Previous studies have measured the contamination in public RNA-seq data and found that 5%- 10% of samples were contaminated. In this report, we investigate the prevalence of contamination in ATAC-seq samples, rather than RNA-seq data, uploaded to the Sequence Read Archive using two popular alignment-based tools: Bowtie 2 and Kraken 2. We then develop an alignment-free method of detection using machine learning and a novel method of estimating DNA fragment lengths from paired-end ATAC-seq data. Our results show that around 5% of ATAC-seq samples are contaminated and our machine learning method is able to correctly classify 97% of samples as contaminated or not while using less computational resources than the alignment-based tools. Thus, our method shows promise as a preliminary rapid screening tool for contamination in labs with limited access huge to computational resources.