Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index Lisa M. Lines, PhD, MPH Julia Cohen, MA Michael T. Halpern, MD, PhD Erin E. Kent, PhD Michelle A. Mollica, PhD, MPH, RN AcademyHealth Annual Research Meeting June 4, 2019 – Washington, DC 1
Disclosures The authors declare no conflicts of interest. Funding for this research was provided to LML, JC, and MTH under National Cancer Institute contract #HHSN-261-2015-00132U. 2 2
Background Linked data: SEER Cancer Registry Data Surveillance, Epidemiology, and End Results Medicare FFS Claims Medicare Consumer Assessment of CAHPS & Healthcare Providers and Surveys Enrollment Systems Data Questions: • Could a simple score identify Medicare CAHPS respondents with high medical needs and serious illness burdens? • Are care experiences associated with illness burdens? (not presented today) Approach: • Supervised machine learning: random survival forests (RSF) using R • Predict 1-year mortality with whatever data are available for each person 3 • SEER‐CAHPS includes many kinds of information about people’s health, with variables differing by year and group – analysis presents a challenge • Many indices that summarize morbidity use claims data – example: NCI Combined • More than half of our sample (Medicare Advantage enrollees) does not have claims • We have self‐reported information on a huge range of measures, including validated measures from other instruments (SF‐12, PHQ‐2) and widely used measures like ADLs • SEER‐CAHPS provides an opportunity to merge information from different data sources to improve our understanding of morbidity burden • More precise assessment allows more accurate comparisons of illness burden between similar individuals with and without (and before and after) cancer 3
Regression models vs. RSF methods RSF… Survival regression models… – handles the proportionality assumption – require assumptions (e.g. proportional automatically hazards) – non-parametric, makes no assumptions about underlying distribution of values of the – may fail to converge when there are too predictor variables (can handle skewed and many predictors or outliers multi-modal distributions) – may fail to converge when there are too – can handle hundreds of independent variables many interaction terms – can identify survival risk factors without prior knowledge of interactions among variables – require laborious effort to account for – is robust to outliers and does not suffer from missing data convergence problems – may not be able to handle both imputation – identifies the independent variables that best segregate subgroups as important predictors and survey weights and identifies interactions among independent variables – uses imputation techniques to account for missing data 4 4
Conceptual Model Socio- Contextual demographic factors characteristics Cancer- Proxy related assistance morbidity SEER ‐ CAHPS Illness Burden Index (SCIBI) Self-reported Self-reported activity health status limitations Chronic Utilization conditions 5 5
Cohorts and groups Assessed for eligibility Excluded (n=4,483,388) • Comparison beneficiaries outside SEER areas (n=3,400,754) • Surveyed outside of 2007-2013 period (n=519,284) • Comparison beneficiaries Analyzed (n=524,929) w/ self-reported cancer history (n=32,430) • Missing sample weight (n=3,706) People with People without • Survey date after date of cancer cancer death (n=2,060) (n=116,735) (n=408,194) • Missing diagnosis date or diagnosed on/after death (n=225) Surveyed Surveyed after before diagnosis diagnosis (n=31,869) (n=84,866) MA FFS (n=216,794) (n=191,400) MA FFS MA FFS (n=16,222) (n=15,647) (n=42,834) (n=42,032) 6 6
Classification: A Simplified Example Population Needs Help with Personal Care YES NO 65% died 38% died 7 These numbers are for example purposes only! 7
Steps in RSF process Split each group into annual subsamples 1. In each of those subsamples, take 500 bootstrap samples from the original data 2. Grow a survival tree for each bootstrapped dataset 3. At each node of the tree, randomly select 10 variables for splitting on a. Split on the variable that optimizes the survival splitting criterion b. Grow the tree to full size with terminal nodes having at least 50 unique cases 4. Calculate the tree predictor to generate the cumulative hazard estimate ( relative risk ) of a. mortality (SCIBI score) Calculate in-bag and out-of-bag (OOB) estimates by averaging the 500 tree predictors 5. Use the OOB estimator to estimate out-of-sample prediction performance 6. Use OOB estimation to calculate variable importance 7. 8 8
Ability of SCIBI Scores to Differentiate 12-month Mortality Risk People with cancer in SEER People without cancer in SEER Surveyed pre‐ Surveyed post‐ diagnosis diagnosis MA FFS MA FFS MA FFS N 16,222 15,647 42,834 42,032 216,794 191,400 Percent who died within 12 months of survey in all years 6% 5% 7% 6% 3% 3% Percent who died in bottom 25th percentile 0% 0% 0% 0% 0% 0% Percent who died in 99th percentile 95% 99% 96% 100% 98% 97% Error rate 37% 11% 25% 9% 23% 12% 9 9
Variable Importance (x100) People with cancer People without cancer Surveyed pre‐diagnosis Surveyed post‐diagnosis MA FFS MA FFS MA FFS Rank Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP 1 Age 18 Any 186 General health 21 Any hospice 163 Age 44 Any hospice 93 hospice 2 Social 9 # inpatient 31 SF12 – 15 # inpatient 30 Needs help 13 # inpatient 36 activity stays physical stays w/ personal stays limitations care 3 SF12 ‐ 7 Any SNF 10 Cancer stage 14 Any SNF 8 Proxy 12 Age 12 mental 4 Mental 5 Age 3 Age 13 Any DME 4 General 8 Any SNF 8 health health 5 Pain 5 Wheelchair 2 Needs help w/ 10 Needs help 3 ADL ‐ 7 Lethargy 3 personal care w/ personal bathing care Gold – self-report; Blue – Medicare claims; Gray – SEER; Green – Medicare enrollment data 10 A numerical indicator of how important a variable is to the classification algorithm Based on the increase (or decrease) in misclassification error on the test data if the variable were not available We report the top 5 most influential variables (ranked) and their VIMP for variables included in any random survival forest (RSF) within that group Provides important new information about what factors influence mortality risk in our sample Results are shaded based on whether data were from claims, registry, or survey data For FFS beneficiaries, at least half of the variables ranked most important came from claims data and self‐reported variables were less important Among MA enrollees, self‐reported information – such as needing help with routine tasks – provided most of the information 10
Caveats and limitations Medicare CAHPS has relatively low response rates (<50%) – Used weights to account for non-response Hard to compare results with regression-based approaches – Literature has other examples Error rates much higher for MA enrollees – FFS error rates are comparable or better than prior studies Care experience measures do not necessarily correlate with other quality measures 11 11
Conclusions Among more than 500,000 Medicare beneficiaries, SCIBI scores was relatively accurate as measured by the overall error rate (20%) – Individuals in the 99th percentile of the score had an average mortality rate of 97% SCIBI = omnibus measure summarizing functional and health status, other conditions, and utilization associated with high medical need, serious illness, frailty, and end of life Future research needed: – associations between site-specific risk indicators and SCIBI measures? – consistency between markers of cancer burden and the care experiences of people with cancer? 12 This score is available for both people with cancer and without, so that accurate comparisons can be made between populations 12
Comments or Questions? Email LLines@RTI.org More information about SEER-CAHPS: healthcaredelivery.cancer.gov/seer-cahps/ 13 13
Recommend
More recommend