Author ORCID Identifier

https://orcid.org/0000-0003-0378-6291

Semester

Summer

Date of Graduation

2025

Document Type

Thesis (Campus Access)

Degree Type

MS

College

Statler College of Engineering and Mineral Resources

Department

Industrial and Managements Systems Engineering

Committee Chair

Imtiaz Ahmed

Committee Member

Srinjoy Das

Committee Member

Zeyu Liu

Abstract

Sequential decision-making is the process of making a sequence of decisions over time, where each decision can impact future outcomes. This framework is critical in many real-world scenarios, where decisions must be adapted based on the observed results. Within this context, contextual multi-armed bandit (CMAB) models have emerged as a powerful tool, aiming to balance exploration of new actions and exploitation of known rewards. These models are employed under uncertainty and have been proven to be efficient in different domains such as recommendation systems, adaptive clinical trials, and dynamic resource allocation. However, many existing methods rely on strong assumptions, which limit their effectiveness in real-world settings characterized by complex reward structures, evolving user preferences, and high-dimensional feature spaces. As a result, such models often suffer from inadequate exploration or an inability to accurately estimate rewards in non-linear or temporally drifting environments. To address these limitations, this work proposes a novel model that enhances traditional bandit algorithms through two key innovations: (1) an attention-based exploration mechanism that modulates exploration adaptively over time for each arm, and (2) an adaptive k-Nearest Neighbors (k-NN) regression module that captures local reward variations to better model non-linear relationships. We enhance the classic Linear UCB (LinUCB) structure with these two components, leading to development of a new algorithm called LNUCB-TA. LNUCB-TA combines global linear estimation of expected rewards with a non-parametric k-NN adjustment, where the number of neighbors is dynamically selected based on the reward variance of each arm. This allows the model to flexibly respond to both stable and volatile conditions. The exploration component is enhanced by a temporal attention mechanism that balances global and local reward trends, ensuring that arms with promising but underexplored behavior receive appropriate attention. This design enables LNUCB-TA to adaptively navigate dynamic environments, balancing the exploration-exploitation trade-off more effectively than existing models. Theoretical analysis confirms that LNUCB-TA maintains sublinear regret bounds, while incorporating non-linear estimation and dynamic exploration. Empirical results on a wide range of benchmark datasets and real-world news recommendation tasks further validate its effectiveness. LNUCB-TA consistently outperforms existing baselines in terms of both average regret and stability across hyperparameter configurations. In addition, this thesis explores how a proposed attention-based exploration mechanism can be integrated into other bandit models, demonstrating consistent performance gains. Finally, ablation studies and error analyses highlight the individual contributions of the k-NN and attention components and reinforce the robustness of the proposed model.

Available for download on Friday, July 31, 2026

Share

COinS