Big Data Algorithms with Medical Applications Yixin Chen
Outline Challenges to big data algorithms Clinical Big Data Our new algorithms
Small data vs. Big data
Small data vs. Big data VS 一般性规律 特殊性规律
Small data vs. Big data Causality Association Domain Data knowledge knowledge
Small data vs. Big data Models Model Big Data Quality Small Data Data Size
Modeling techniques Parametric VS Non-parametric Efficiency Accuracy interpretability
Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N 3 ) vs O(N 2 )) Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)
Outline Challenges to big data algorithms Clinical Big Data Our new algorithms
The need for clinical prediction • The ICU direct costs per day for survivors is between six and seven times those for non-ICU care. • Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care. • Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.
Goal: Let Data Speak! Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital
Clinical Data: high-dimensional real-time time-series data 34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, … Time/second Time/second
Previous Work Medical data mining machine medical learning knowledge methods Acute Physiology Score, Chronic decision neural Modified Early Health Score , and SVM Warning SCAP and PSI APACHE score are trees networks Score (MEWS) used to predict renal failures Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data
Machine learning task Challenges: • Classification of high- dimensional time series data • Irregular data gaps • measurement errors 30000 • class imbalance 25000 20000 Non-ICU 15000 ICU 10000 5000 0
Solution based on existing techniques Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier ( Mao et al. KDD’12 )
Solution based on existing techniques Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier ( Mao et al. KDD’12 )
Desired Classifier Properties • Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification Linear SVM and Logistic Regression Interpretable and efficient but linear SVM with RBF kernels Nonlinear but not interpretable; inefficient
Desired Classifier Properties Linear Kernel kNN NB LR NN SVM SVM Nonlinear classification Y N Y N N Y ability Interpretability N Y N Y Y N Direct support for mixed Y Y N N N N data types Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N
Random kitchen sinks (RKS) Random Parametric, nonlinear linear feature classifier transformation 1. Transform each input x into: exp(- i w k x), k= 1, …, K, w k ~ Gaussian distribution p(w) 2. Learn a linear model ∑ α k exp(- i w k x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability
Outline Challenges to big data algorithms Clinical Big Data Our new algorithms
Key Idea: Hybrid Model Non-parametric, Parametric, Nonlinear Linear Feature Classifier Transformation Nonlinearity Efficiency Interpretability
Desired Classifier Properties Linear Kernel kNN NB LR NN DLR SVM SVM Nonlinear classification Y N Y N N Y Y ability Interpretability N Y N Y Y N Y Direct support for mixed Y Y N N N N Y data types Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y DLR: Density-based Logistic Regression ( Chen et al., KDD’13 )
Logistic Regression Each instance has D features: Assume : where τ (x) Training dataset: Optimization: maximize the overall log likelihood
Problem with linear models , what should be ϕ d (x)? If we set
Insights on τ (x) (Logistic regression) On the other hand: LR: Hence:
Factorization in DLR Assumption:
DLR Feature Transformation , where is an increasing function of
Conditional Probability Estimation Categorical x d : Numerical : Kernel density estimation (smoothed histogram)
Kernel density estimation Training dataset: where kernel bandwidth
DLR Learning Objective: Maximize the overall log likelihood A function of
Overview of DLR Initialize h and w Calculate new feature vector Update w Update h No Converged?
Optimization Repeat until convergence (using a LR solver) Fix and optimize Fix and optimize (steepest gradient descent) Initial h iter 1 Iter 2 Iter 3
Interpretability DLR: For example, represents a particular disease If represents the blood pressure (BP) of a patient On disease level Ranking can identify the risk factors of this disease On patient level indicates the abnormality of his BP indicates the extent of BP resulting in his disease
Kernel Ideal kernel: RBF kernel: doesn’t consider the label information
DLR Kernel DLR kernel: indicates same label indicates different label
DLR on example data Test Data: Original LR Density-based LR
Accuracy on UCI Datasets Better numerical categorical
Training Time Better numerical categorical
Results on clinical data Accuracy: LR: 0.9141 SVM: 0.9194 DLR: 0.9204 Early alert when the patient appears normal to the best doctors in the world
DLR for real large data estimation: kernel density smoothing Still too slow for big data Testing time grows as get larger estimation: histogram No curse of dimensionality for estimation Ultra-fast training and testing
DLR with Bins
DLR with Bins Not smooth Not enough data
Histogram KDE Smoothing where is the number of label in bin i is the number of instances in bin i
Different Number of Bins 5 bins 20 bins 100 bins
Results on accuracy kddcu Splice Mush w5a w8a Adult p 1K 8K 10K 50K 30K 1.26M 75 100 98.15 98.57 60.03 99.99 linearSVM 77 99.87 97.67 98.24 84.80 99.99 LR 80 99.23 97.14 97.20 75.29 N/A RBF SVM 88 99.95 98.26 98.55 85.54 99.99 DLR-b
Results on efficiency kddcu Splice Mush w5a w8a Adult p 1K 8K 10K 50K 30K 1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70 0.15 0.21 0.18 0.7 2.89 55.66 LR RBF SVM 0.09 1.63 1.60 29 217 N/A 0.22 0.32 2.65 7.6 0.6 17.93 DLR-b
Feature Selection Ability DLR: • l 1 -regularization: loss(w) + c∑max (w d ,0) non-smooth optimization • However, in DLR, we can simply use c ∑w d along with constraints w d ≥ 0 smooth optimization
Top features selected by DLR standard deviation of heart rate ApEn of heart rate Energy of oxygen saturation LF of oxygen saturation LF of heart rate DFA of oxygen saturation Mean of heart rate HF of heart rate Inertia of heart rate Homogeneity of heart rate Energy of heart rate linear correlation of heart rate of oxygen saturation
Conclusions on DLR DLR satisfies all the following: • Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/
Big Data Algorithms • Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed • For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough nonlinearity/randomness
Thank you
大数据时代的挑战: 人才 麦肯锡全球研究院报告:大数据人才稀缺 第
RKS: Linear model over nonlinear features Random Linear Kernel kNN NB LR NN Kitchen SVM SVM Sinks Nonlinear classification Y N Y N N Y Y ability Interpretability N Y N Y Y N N Direct support for mixed Y Y N N N N N data types Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N RBF SVM: k(x,x ’ ) =
Gaussian Naive Bayes Assumption: Gaussian:
LR and GNB Assumption: Both GNB and LR express in a linear model GNB learns under GNB assumption LR learns using maximum likelihood of the data
Motivation Assumption: NB LR
Motivation GNB Assumption: Factorizing by Naïve Bayes Factorizing by
Recommend
More recommend