A k-means approach to clustering disease progressions Duc Thanh Anh Luong Varun Chandola Department of Computer Science & Engineering University at Buffalo IEEE ICHI 2017 August 26, 2017
Outline • Motivation • K-means approach • An application for Chronic Kidney Disease • Generating patient-specific disease profiles
Motivation • Find subgroup of patients have similar disease progression • Identify the underlying mechanism of the subgroup • Provide better treatment for each subgroup
Motivation • Different patients have different disease progressions • Consider the case of Chronic Kidney Disease 90 ● Are there few general trends of ● ● ● ● disease progressions? ● patient ID ● 70 8562280 ● ● ● ● ● ● ● ● ● ● ● ● 8563881 ● ● ● ● ● ● ● ● ● ● ● ● ● 8567589 ● ● ● ● ● ● ● ● ● ● ● ● eGFR ● ● ● ● 8571050 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8582794 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● 8587204 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8587950 ● ● ● ● ● ● ● ● ● ● ● Can we group patients by their ● ● ● ● ● ● ● ● 8601598 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8602147 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● progressions into few groups? ● 8602554 ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 days from first clinical record
Motivation 75 eGFR 50 Trajectories of 500 25 patients 0 0 300 600 900 days from first clinical record Trajectories after being clustered
Clustering problem and k-means algorithm • Cluster a set of data points into k clusters • Can be solved by K-means approach Bishop, Christopher M. Pattern recognition and machine learning . Springer, 2006.
K-means approach ● 55 ● ● ● 50 eGFR ● ● Patient disease Data object ● ● progression 45 ● ● ● 40 50 0 500 1000 1500 days from first clinical record 45 ● 40 eGFR Distance ● ● ● 70 35 metric ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● 30 ● ● eGFR ● ● ● centroid 50 ● red 0 300 600 900 ● days from first clinical record ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● Centroid Regression line ● ● 0 400 800 1200 days from first clinical record
K-means approach randomly assign patient into k Initial step clusters No Perform regression for each cluster Update to obtain “centroid” step Yes No patient move End to another group? Assignme Assign patient to the the cluster nt step that has closest centroid
Dataset & Preprocessing DARTNet patients (n = 69,817) Excluded Invalid birth year and sex Excluded Number of serum creatinine value (n = 6,418) records < 1 (n = 181) Excluded Invalid data records (n = 9) “Preprocessed” DARTNet patients (n = 63,209) Having eGFR values less than 60 for more than three months (n = 29,585) Excluded Observation duration < 1 Excluded year (n = 5,285) Number of serum creatinine records < 10 (n = 17,158) Final CKD cohort (n = 7,142)
Clustering result
Demographic distribution in clusters 0.73% 4.56% 5.91% 80 (52) (326) (422) 9.49% (678) 14.86% (1061) 60 11.55% (825) age ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 13.72% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 (980) ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 1 ● ● 9.61% ● ● ● ● ● ● Cluster 2 ● ● ● (686) Cluster 3 ● ● ● ● ● Cluster 4 ● ● Cluster 5 ● Cluster 6 Cluster 7 ● ● Cluster 8 20 14.59% Cluster 9 14.98% ● (1042) Cluster 10 (1070) ● ● ● 2.5 5.0 7.5 10.0 cluster
Other clinical markers
Generating patient-specific disease profiles ● 40 residuals ● 35 eGFR ● ● Gaussian processes ● ● ● ● ● 30 ● ● 0 300 600 900 days from first clinical record Rasmussen, Carl Edward, and Christopher KI Williams. Gaussian processes for machine learning . Vol. 1. Cambridge: MIT press, 2006.
Generating patient-specific disease profiles Patient 391 Cluster 5 100 100 cluster's trajectory cluster's trajectory individual predicted trajectory individual predicted trajectory 80 upper and lower limit 80 upper and lower limit actual eGFR value actual eGFR value ● ● 60 eGFR 60 40 40 20 20 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 day day
Conclusion & Future Work • Clustering disease progressions – k-means approach • Generating individual prediction – Gaussian processes • Extend the approach to cope with multiple clinical markers • Give quantitative evaluation of clusters • Tightness • Separation
Thank you
Recommend
More recommend