big data algorithms with medical applications
play

Big Data Algorithms with Medical Applications Yixin Chen Outline - PowerPoint PPT Presentation

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data algorithms Clinical Big Data Our new algorithms Small data vs. Big data Small data vs. Big data VS Small data vs. Big


  1. Big Data Algorithms with Medical Applications Yixin Chen

  2. Outline Challenges to big data algorithms Clinical Big Data Our new algorithms

  3. Small data vs. Big data

  4. Small data vs. Big data VS 一般性规律 特殊性规律

  5. Small data vs. Big data Causality Association Domain Data knowledge knowledge

  6. Small data vs. Big data Models Model Big Data Quality Small Data Data Size

  7. Modeling techniques Parametric VS Non-parametric Efficiency Accuracy interpretability

  8. Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N 3 ) vs O(N 2 )) Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)

  9. Outline Challenges to big data algorithms Clinical Big Data Our new algorithms

  10. The need for clinical prediction • The ICU direct costs per day for survivors is between six and seven times those for non-ICU care. • Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care. • Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.

  11. Goal: Let Data Speak! Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital

  12. Clinical Data: high-dimensional real-time time-series data 34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, … Time/second Time/second

  13. Previous Work Medical data mining machine medical learning knowledge methods Acute Physiology Score, Chronic decision neural Modified Early Health Score , and SVM Warning SCAP and PSI APACHE score are trees networks Score (MEWS) used to predict renal failures Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data

  14. Machine learning task Challenges: • Classification of high- dimensional time series data • Irregular data gaps • measurement errors 30000 • class imbalance 25000 20000 Non-ICU 15000 ICU 10000 5000 0

  15. Solution based on existing techniques Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier ( Mao et al. KDD’12 )

  16. Solution based on existing techniques Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier ( Mao et al. KDD’12 )

  17. Desired Classifier Properties • Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification Linear SVM and Logistic Regression Interpretable and efficient but linear SVM with RBF kernels Nonlinear but not interpretable; inefficient

  18. Desired Classifier Properties Linear Kernel kNN NB LR NN SVM SVM Nonlinear classification Y N Y N N Y ability Interpretability N Y N Y Y N Direct support for mixed Y Y N N N N data types Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N

  19. Random kitchen sinks (RKS) Random Parametric, nonlinear linear feature classifier transformation 1. Transform each input x into: exp(- i w k x), k= 1, …, K, w k ~ Gaussian distribution p(w) 2. Learn a linear model ∑ α k exp(- i w k x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability

  20. Outline Challenges to big data algorithms Clinical Big Data Our new algorithms

  21. Key Idea: Hybrid Model Non-parametric, Parametric, Nonlinear Linear Feature Classifier Transformation Nonlinearity Efficiency Interpretability

  22. Desired Classifier Properties Linear Kernel kNN NB LR NN DLR SVM SVM Nonlinear classification Y N Y N N Y Y ability Interpretability N Y N Y Y N Y Direct support for mixed Y Y N N N N Y data types Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y DLR: Density-based Logistic Regression ( Chen et al., KDD’13 )

  23. Logistic Regression Each instance has D features: Assume : where τ (x) Training dataset: Optimization: maximize the overall log likelihood

  24. Problem with linear models , what should be ϕ d (x)? If we set

  25. Insights on τ (x) (Logistic regression) On the other hand: LR: Hence:

  26. Factorization in DLR Assumption:

  27. DLR Feature Transformation , where is an increasing function of

  28. Conditional Probability Estimation Categorical x d : Numerical : Kernel density estimation (smoothed histogram)

  29. Kernel density estimation Training dataset: where kernel bandwidth

  30. DLR Learning Objective: Maximize the overall log likelihood A function of

  31. Overview of DLR Initialize h and w Calculate new feature vector Update w Update h No Converged?

  32. Optimization Repeat until convergence (using a LR solver) Fix and optimize Fix and optimize (steepest gradient descent) Initial h iter 1 Iter 2 Iter 3

  33. Interpretability DLR: For example, represents a particular disease If represents the blood pressure (BP) of a patient On disease level Ranking can identify the risk factors of this disease On patient level indicates the abnormality of his BP indicates the extent of BP resulting in his disease

  34. Kernel Ideal kernel: RBF kernel: doesn’t consider the label information

  35. DLR Kernel DLR kernel: indicates same label indicates different label

  36. DLR on example data Test Data: Original LR Density-based LR

  37. Accuracy on UCI Datasets Better numerical categorical

  38. Training Time Better numerical categorical

  39. Results on clinical data Accuracy: LR: 0.9141 SVM: 0.9194 DLR: 0.9204 Early alert when the patient appears normal to the best doctors in the world

  40. DLR for real large data estimation: kernel density smoothing Still too slow for big data Testing time grows as get larger estimation: histogram No curse of dimensionality for estimation Ultra-fast training and testing

  41. DLR with Bins

  42. DLR with Bins Not smooth Not enough data

  43. Histogram KDE Smoothing where is the number of label in bin i is the number of instances in bin i

  44. Different Number of Bins 5 bins 20 bins 100 bins

  45. Results on accuracy kddcu Splice Mush w5a w8a Adult p 1K 8K 10K 50K 30K 1.26M 75 100 98.15 98.57 60.03 99.99 linearSVM 77 99.87 97.67 98.24 84.80 99.99 LR 80 99.23 97.14 97.20 75.29 N/A RBF SVM 88 99.95 98.26 98.55 85.54 99.99 DLR-b

  46. Results on efficiency kddcu Splice Mush w5a w8a Adult p 1K 8K 10K 50K 30K 1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70 0.15 0.21 0.18 0.7 2.89 55.66 LR RBF SVM 0.09 1.63 1.60 29 217 N/A 0.22 0.32 2.65 7.6 0.6 17.93 DLR-b

  47. Feature Selection Ability DLR: • l 1 -regularization: loss(w) + c∑max (w d ,0) non-smooth optimization • However, in DLR, we can simply use c ∑w d along with constraints w d ≥ 0 smooth optimization

  48. Top features selected by DLR standard deviation of heart rate ApEn of heart rate Energy of oxygen saturation LF of oxygen saturation LF of heart rate DFA of oxygen saturation Mean of heart rate HF of heart rate Inertia of heart rate Homogeneity of heart rate Energy of heart rate linear correlation of heart rate of oxygen saturation

  49. Conclusions on DLR DLR satisfies all the following: • Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/

  50. Big Data Algorithms • Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed • For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough nonlinearity/randomness

  51. Thank you

  52. 大数据时代的挑战: 人才 麦肯锡全球研究院报告:大数据人才稀缺 第

  53. RKS: Linear model over nonlinear features Random Linear Kernel kNN NB LR NN Kitchen SVM SVM Sinks Nonlinear classification Y N Y N N Y Y ability Interpretability N Y N Y Y N N Direct support for mixed Y Y N N N N N data types Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N RBF SVM: k(x,x ’ ) =

  54. Gaussian Naive Bayes Assumption: Gaussian:

  55. LR and GNB Assumption: Both GNB and LR express in a linear model GNB learns under GNB assumption LR learns using maximum likelihood of the data

  56. Motivation Assumption: NB LR

  57. Motivation GNB Assumption: Factorizing by Naïve Bayes Factorizing by

Recommend


More recommend