Random Survival Forests Using Linked Data to Measure Illness Burden - PDF document

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index Lisa M. Lines, PhD, MPH Julia Cohen, MA Michael T. Halpern, MD, PhD Erin E. Kent, PhD Michelle A. Mollica, PhD, MPH, RN AcademyHealth Annual Research Meeting June 4, 2019 – Washington, DC 1

Disclosures  The authors declare no conflicts of interest. Funding for this research was provided to LML, JC, and MTH under National Cancer Institute contract #HHSN-261-2015-00132U. 2 2

Background Linked data: SEER Cancer Registry Data Surveillance, Epidemiology, and End Results Medicare FFS Claims Medicare Consumer Assessment of CAHPS & Healthcare Providers and Surveys Enrollment Systems Data Questions: • Could a simple score identify Medicare CAHPS respondents with high medical needs and serious illness burdens? • Are care experiences associated with illness burdens? (not presented today) Approach: • Supervised machine learning: random survival forests (RSF) using R • Predict 1-year mortality with whatever data are available for each person 3 • SEER‐CAHPS includes many kinds of information about people’s health, with variables differing by year and group – analysis presents a challenge • Many indices that summarize morbidity use claims data – example: NCI Combined • More than half of our sample (Medicare Advantage enrollees) does not have claims • We have self‐reported information on a huge range of measures, including validated measures from other instruments (SF‐12, PHQ‐2) and widely used measures like ADLs • SEER‐CAHPS provides an opportunity to merge information from different data sources to improve our understanding of morbidity burden • More precise assessment allows more accurate comparisons of illness burden between similar individuals with and without (and before and after) cancer 3

Regression models vs. RSF methods  RSF…  Survival regression models… – handles the proportionality assumption – require assumptions (e.g. proportional automatically hazards) – non-parametric, makes no assumptions about underlying distribution of values of the – may fail to converge when there are too predictor variables (can handle skewed and many predictors or outliers multi-modal distributions) – may fail to converge when there are too – can handle hundreds of independent variables many interaction terms – can identify survival risk factors without prior knowledge of interactions among variables – require laborious effort to account for – is robust to outliers and does not suffer from missing data convergence problems – may not be able to handle both imputation – identifies the independent variables that best segregate subgroups as important predictors and survey weights and identifies interactions among independent variables – uses imputation techniques to account for missing data 4 4

Conceptual Model Socio- Contextual demographic factors characteristics Cancer- Proxy related assistance morbidity SEER ‐ CAHPS Illness Burden Index (SCIBI) Self-reported Self-reported activity health status limitations Chronic Utilization conditions 5 5

Cohorts and groups Assessed for eligibility Excluded (n=4,483,388) • Comparison beneficiaries outside SEER areas (n=3,400,754) • Surveyed outside of 2007-2013 period (n=519,284) • Comparison beneficiaries Analyzed (n=524,929) w/ self-reported cancer history (n=32,430) • Missing sample weight (n=3,706) People with People without • Survey date after date of cancer cancer death (n=2,060) (n=116,735) (n=408,194) • Missing diagnosis date or diagnosed on/after death (n=225) Surveyed Surveyed after before diagnosis diagnosis (n=31,869) (n=84,866) MA FFS (n=216,794) (n=191,400) MA FFS MA FFS (n=16,222) (n=15,647) (n=42,834) (n=42,032) 6 6

Classification: A Simplified Example Population Needs Help with Personal Care YES NO 65% died 38% died 7 These numbers are for example purposes only! 7

Steps in RSF process Split each group into annual subsamples 1. In each of those subsamples, take 500 bootstrap samples from the original data 2. Grow a survival tree for each bootstrapped dataset 3. At each node of the tree, randomly select 10 variables for splitting on a. Split on the variable that optimizes the survival splitting criterion b. Grow the tree to full size with terminal nodes having at least 50 unique cases 4. Calculate the tree predictor to generate the cumulative hazard estimate ( relative risk ) of a. mortality (SCIBI score) Calculate in-bag and out-of-bag (OOB) estimates by averaging the 500 tree predictors 5. Use the OOB estimator to estimate out-of-sample prediction performance 6. Use OOB estimation to calculate variable importance 7. 8 8

Ability of SCIBI Scores to Differentiate 12-month Mortality Risk People with cancer in SEER People without cancer in SEER Surveyed pre‐ Surveyed post‐ diagnosis diagnosis MA FFS MA FFS MA FFS N 16,222 15,647 42,834 42,032 216,794 191,400 Percent who died within 12 months of survey in all years 6% 5% 7% 6% 3% 3% Percent who died in bottom 25th percentile 0% 0% 0% 0% 0% 0% Percent who died in 99th percentile 95% 99% 96% 100% 98% 97% Error rate 37% 11% 25% 9% 23% 12% 9 9

Variable Importance (x100) People with cancer People without cancer Surveyed pre‐diagnosis Surveyed post‐diagnosis MA FFS MA FFS MA FFS Rank Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP Factor VIMP 1 Age 18 Any 186 General health 21 Any hospice 163 Age 44 Any hospice 93 hospice 2 Social 9 # inpatient 31 SF12 – 15 # inpatient 30 Needs help 13 # inpatient 36 activity stays physical stays w/ personal stays limitations care 3 SF12 ‐ 7 Any SNF 10 Cancer stage 14 Any SNF 8 Proxy 12 Age 12 mental 4 Mental 5 Age 3 Age 13 Any DME 4 General 8 Any SNF 8 health health 5 Pain 5 Wheelchair 2 Needs help w/ 10 Needs help 3 ADL ‐ 7 Lethargy 3 personal care w/ personal bathing care Gold – self-report; Blue – Medicare claims; Gray – SEER; Green – Medicare enrollment data 10 A numerical indicator of how important a variable is to the classification algorithm Based on the increase (or decrease) in misclassification error on the test data if the variable were not available We report the top 5 most influential variables (ranked) and their VIMP for variables included in any random survival forest (RSF) within that group Provides important new information about what factors influence mortality risk in our sample Results are shaded based on whether data were from claims, registry, or survey data For FFS beneficiaries, at least half of the variables ranked most important came from claims data and self‐reported variables were less important Among MA enrollees, self‐reported information – such as needing help with routine tasks – provided most of the information 10

Caveats and limitations  Medicare CAHPS has relatively low response rates (<50%) – Used weights to account for non-response  Hard to compare results with regression-based approaches – Literature has other examples  Error rates much higher for MA enrollees – FFS error rates are comparable or better than prior studies  Care experience measures do not necessarily correlate with other quality measures 11 11

Conclusions  Among more than 500,000 Medicare beneficiaries, SCIBI scores was relatively accurate as measured by the overall error rate (20%) – Individuals in the 99th percentile of the score had an average mortality rate of 97%  SCIBI = omnibus measure summarizing functional and health status, other conditions, and utilization associated with high medical need, serious illness, frailty, and end of life  Future research needed: – associations between site-specific risk indicators and SCIBI measures? – consistency between markers of cancer burden and the care experiences of people with cancer? 12 This score is available for both people with cancer and without, so that accurate comparisons can be made between populations 12

Comments or Questions? Email LLines@RTI.org More information about SEER-CAHPS: healthcaredelivery.cancer.gov/seer-cahps/ 13 13

Random Survival Forests Using Linked Data to Measure Illness Burden - PDF document

Random Survival Forests Using Linked Data to Measure Illness Burden Among People With Cancer: Development and Internal Validation of the SEER-CAHPS Illness Burden Index Lisa M. Lines, PhD, MPH Julia Cohen, MA Michael T. Halpern, MD, PhD Erin

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

Linked Lists Fundamentals of Computer Science Outline Sequential vs. Linked Linked List

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

csci 210: Data Structures Linked lists Summary Today linked lists single-linked

Survival Analysis / Time-to- Event Analysis in R Heidi Seibold Statistician at LMU Munich

Linked Lists Definition of Linked Lists A linked list is a sequence of items (objects) where

Joint Regional Seminar 2016 Risk Analysis of Equity-linked Products 1 Equity-linked products 2

Linked Lists Kruse and Ryba Textbook 4.1 and Chapter 6 Linked Lists Linked list of items

Ch 5 Linked Lists A Node Class for Linked Lists A Linked List Toolkit The Bag Class with a

Linked Lists first: 3 first: 4 first: 5 first: 3 first: 4 first: 5 rest: rest: rest:

A telescope for reconstructing H decays Hvard Gjersdal University of Oslo December 16,

Random access for dense networks: Design and Analysis of Multiband CSMA/CA Baher Mawlawi

Florence Sawyer School Student Leaders: Isaiah Bateman, Lyla Cotter, Kat Green, Charlotte Romeo,

The SAT 2005 Competition Industrial category Certified UNSAT Special track Fourth Edition Non

Simulating Space Use of Animals from RSF and SSF Johannes Signer ( signer_j) Wildlife

Random Set Solutions to Stochastic Wave Equations Michael Oberguggenberger Lukas Wurzer ISIPTA

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

The generalization error of random features model: Precise asymptotics and double descent curve

Sambuz

Useful Links

Newsletter

Mail Us