Author ORCID Identifier

https://orcid.org/0000-0002-6719-3228

Semester

Spring

Date of Graduation

2024

Document Type

Dissertation

Degree Type

PhD

College

School of Pharmacy

Department

Pharmaceutical Systems and Policy

Committee Chair

Sabina Nduaguba

Committee Member

Usha Sambamoorthi

Committee Member

Virgina G. Scott

Committee Member

Traci LeMasters

Committee Member

Jay S. Patel

Abstract

Hodgkin’s lymphoma (HL) is a rare malignancy of lymphocytes that predominantly occurs in young adults aged 20-30 years or elderly individuals aged 65-75 years. Despite its low incidence, there were an estimated 223,512 HL survivors in the US in 2020. Hodgkin’s lymphoma shows a favorable prognosis among young adults, with a high cure rate of 85-90%; however, older adults experience poor prognosis, with a 5-year overall survival rate of 40-55% in patients over 60 years. HL survivors incur high total and out-of-pocket (OOP) healthcare expenditures, averaging $78,183 and $4,180 per patient in the first year after diagnosis, highlighting a considerable economic burden. Despite the high economic burden, the existing HL cost-related literature has primarily focused on young adults, overlooking the costs among older adults. Healthcare costs among older HL survivors could be substantial due to their poor prognosis and complexity arising from factors such as comorbidities and adjusted chemotherapy regimens. Machine learning (ML) methods are increasingly used in healthcare cost predictions. However, they pose challenges, including algorithmic bias, particularly affecting underprivileged demographics such as females, non-whites, and individuals from lower economic status. Given HL’s low incidence and its prevalence among young adults, who often use the internet and social media for obtaining health-related information, we adopt a data-driven approach to leverage claims and social media data to address gaps in the literature on health expenditures among older HL patients and explore the feasibility of using social media to study HL. We aim to achieve these objectives through three related research aims: 1. Determine the leading predictors of Medicare and OOP healthcare expenditures in older HL survivors across different phases of cancer care using interpretable ML methods. 2. Assess the fairness of ML models in predicting health expenditures in HL patients based on their sensitive attributes- sex, race, and economic status. 3. To assess the feasibility of using social media data to study the disease and treatment characteristics of HL. We used a retrospective research design, utilizing data from multiple sources to address aims 1 and 2. We used the Surveillance, Epidemiology, and End Results (SEER) data linked with the fee-for-service (FFS) Medicare claims with a primary diagnosis of incident HL between 2009 and 2017, with a two-year baseline and follow-up period. Along with SEER-Medicare data, we incorporated geographical information from SEER census and zip code files and publicly available data from Area Health Resource File (AHRF) and County Health Ranking File (CHRF). We employed multiple ML models for analysis, including linear regression, random forest, and XGBoost. Additionally, we used Shapley Additive exPlanations (SHAP) values to determine the contribution of each feature to the model’s prediction. Model fairness was assessed using the group fairness matrix, which assesses independence, separation, and sufficiency. We also examined individual fairness through counterfactual analysis- the Flip Test, which assesses the model performance by altering sensitive attributes from privileged to underprivileged to test model performances. For aim 3, we analyzed data from the X platform spanning January 2010 to October 2022, extracting and identifying pre-defined classes and attributes related to HL using Named Entity Recognition (NER) Natural Language Processing (NLP) techniques. Our findings showed high Medicare and OOP healthcare expenditures among HL survivors during the pre-diagnosis, treatment, and post-treatment phases. The XGBoost outperformed other models for predicting Medicare and OOP expenditure, with the interpretable ML methods highlighting baseline expenditures and chronic conditions as the leading predictors in the pre-diagnosis phase. In contrast, chemotherapy, immunotherapy, and surgery appeared as leading predictors of expenditures during the treatment and post-treatment phases. Our fairness assessment showed varying model accuracy by sensitive attributes, yet model predominantly remained fair in the group and individual fairness assessments. Aim 3 findings indicated high NER performance, with accuracy (86%) and F1 score (87%) in extracting HL-related classes and attributes from the free text in the posts, demonstrating the potential of X as a valuable preliminary research source in rare diseases such as Hodgkin’s Lymphoma.

Embargo Reason

Publication Pending

Available for download on Tuesday, April 22, 2025

Share

COinS