Author ORCID Identifier
Semester
Summer
Date of Graduation
2025
Document Type
Thesis (Campus Access)
Degree Type
MS
College
Statler College of Engineering and Mineral Resources
Department
Industrial and Managements Systems Engineering
Committee Chair
Imtiaz Ahmed
Committee Member
Srinjoy Das
Committee Member
Zeyu Liu
Abstract
Sequential decision-making is the process of making a sequence of decisions over time, where each decision can impact future outcomes. This framework is critical in many real-world scenarios, where decisions must be adapted based on the observed results. Within this context, contextual multi-armed bandit (CMAB) models have emerged as a powerful tool, aiming to balance exploration of new actions and exploitation of known rewards. These models are employed under uncertainty and have been proven to be efficient in different domains such as recommendation systems, adaptive clinical trials, and dynamic resource allocation. However, many existing methods rely on strong assumptions, which limit their effectiveness in real-world settings characterized by complex reward structures, evolving user preferences, and high-dimensional feature spaces. As a result, such models often suffer from inadequate exploration or an inability to accurately estimate rewards in non-linear or temporally drifting environments. To address these limitations, this work proposes a novel model that enhances traditional bandit algorithms through two key innovations: (1) an attention-based exploration mechanism that modulates exploration adaptively over time for each arm, and (2) an adaptive k-Nearest Neighbors (k-NN) regression module that captures local reward variations to better model non-linear relationships. We enhance the classic Linear UCB (LinUCB) structure with these two components, leading to development of a new algorithm called LNUCB-TA. LNUCB-TA combines global linear estimation of expected rewards with a non-parametric k-NN adjustment, where the number of neighbors is dynamically selected based on the reward variance of each arm. This allows the model to flexibly respond to both stable and volatile conditions. The exploration component is enhanced by a temporal attention mechanism that balances global and local reward trends, ensuring that arms with promising but underexplored behavior receive appropriate attention. This design enables LNUCB-TA to adaptively navigate dynamic environments, balancing the exploration-exploitation trade-off more effectively than existing models. Theoretical analysis confirms that LNUCB-TA maintains sublinear regret bounds, while incorporating non-linear estimation and dynamic exploration. Empirical results on a wide range of benchmark datasets and real-world news recommendation tasks further validate its effectiveness. LNUCB-TA consistently outperforms existing baselines in terms of both average regret and stability across hyperparameter configurations. In addition, this thesis explores how a proposed attention-based exploration mechanism can be integrated into other bandit models, demonstrating consistent performance gains. Finally, ablation studies and error analyses highlight the individual contributions of the k-NN and attention components and reinforce the robustness of the proposed model.
Recommended Citation
Khosravi, Hamed, "Attentive Optimism in the Face of Uncertainty: Balancing Exploration-Exploitation in Multi-Armed Bandits" (2025). Graduate Theses, Dissertations, and Problem Reports. 13028.
https://researchrepository.wvu.edu/etd/13028