Author ORCID Identifier

https://orcid.org/0009-0005-5278-2176

Semester

Fall

Date of Graduation

2025

Document Type

Thesis (Campus Access)

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Industrial and Managements Systems Engineering

Committee Chair

Avishek Choudhury

Committee Member

Ashish Nimbarte

Committee Member

JuHeyong Ryu

Abstract

Cancer has become one of the most significant contributors to the global burden of disease, accounting for nearly 10 million deaths annually. Despite progress in reducing overall cancer mortality, the rising incidence of several major cancer types, combined with persistent racial, ethnic, and geographic disparities, highlights that cancer remains a pressing and evolving threat to public health in the USA. Environmental pollution (air, water, and land) has been recognized as a major determinant of cancer risk, yet most studies continue to examine pollutants in isolation or rely on broad cumulative indices that obscure the role of individual exposures. This lack of a combined study that also captures individual aspects limits the understanding of how diverse environmental and social factors together shape cancer prevalence at a national level.

The main objective of this thesis is to evaluate how diverse environmental factors contribute to cancer prevalence across the USA, using an approach that considers them together in a single model while preserving their distinct individual effects. To achieve this, various parametric and nonparametric models, including binomial logistic regression, random forest classifier, gradient boosting, and artificial neural network, were formulated and compared to identify the best-performing model on the EJScreen and HDLP2020 datasets.

An average performance was observed among the formulated models, with the best performance from the tuned Random Forest Model. This model achieved an AUC-ROC of 0.691 with a sensitivity of 0.65 and a specificity of 0.67. This moderate performance was consistent across all tested models. The Shapley additive explanations analysis on the random forest model identified older age, people of color, and unemployment as the dominant predictive factors, with secondary impacts from smoking and fine particulate matter levels. The social and demographic covariates were found to be more dominant predictors than the pollution-related covariates in the final model. This shows that the social determinants are closely intertwined with pollution exposures and are crucial in determining cancer prevalence. This suggests the need for more multifaceted policies that are able to address both pollution and socio-economic challenges to reduce cancer prevalence in the USA.

Available for download on Thursday, December 10, 2026

Share

COinS