Effectiveness of Deep Learning Vs. Machine Learning in a Health Care Use Case RxToDx A Data Science Machine Learning/Deep Learning Show Case Dima Rekesh/Julie Zhu/Ravi Rajagopalan Optum Technology November, 2017 Analytics Machine Learning/Deep Learning Show Cases
Objective: What we have learned from a Deep Learning Health care use case? Evaluate Deep Learning on a well-known, production Machine Learning problem using an typical time series data set. Seek best practices from an industry leader (Nvidia) Impute and predict the likelihood of an individual having a medical condition using members’ previous two years prescription pharmacy claims data. 2
Deep Learning How is it different? • Multiple layers in neural network with intermediate data representations to facilitate dimensional reduction. • Interpret non-linear relationships in the data. • Derive patterns from data with very high dimensionality. Why do we care? • Ability to create value with little or no domain knowledge required. • Ability to incorporate data from across multiple, seemingly unrelated sources. • Ability to tolerate very noisy data. 3
What have we learned? Doesn’t need SME’s inputs. Eliminate manual feature engineering Predict multiple targets at a time. Higher performance with Higher volume data Capable to automate the model development 4
Results: summary and take-aways • DL proved to be more accurate than conventional ML methods • Neural nets required no manual feature engineering, pointing to a reduction in person-hours required to create and maintain them • A deep neural network was capable of predicting at least four different diseases more accurately than conventional ML models (conventional ML models can predict only one disease at a time). This points to drastic reduction in costs. • Modern GPUs are required: It takes ~24 hours to train on the full data set of 4.5 M records on the latest (Nvidia P100) GPU 5 5
Depression Impact -- $126M Rx cost per year Total annual claims at Optum - $ 965 Million (for the cohort) Depression Related claims - $ 126 Million (for the cohort) www.slideshare.net/psychiatryjfn/disorders-of-mood1 6
Deep Learning doesn’t rely on SME’s inputs SME inputs+ ML Features Deep Learning Raw Data SME inputs For Logistic Model Fpr XGB Model Old Markers Feature Importance Old Markers Feature Importance DEPR_327 0.0601 DEPR_320_32005 0.2155 number_prescribers_(10, inf] 0.0315 sum_amt_standard_cost_(621.48, inf] 0.0286 DEPR_322 0.1178 DEPR_31702 0.0258 DEPR_2 0.0977 DEPR_31705 0.0258 DEPR_32306_18_19_20_24_26 0.0512 number_rx_(9, inf] 0.0200 DEPR_31700 0.0186 DEPR_31500 0.0172 number_rx_(4, 9] 0.0172 number_rx_(3, 4] 0.0157 Drug Codes tot_drug_units_(5651, inf] 0.0143 DEPR_29702 0.0143 number_prescribers_(5, 7] 0.0129 tot_days_supply_(7, 27] 0.0129 number_prescribers_(7, 10] 0.0129 sum_amt_standard_cost_(270.11, 477.88] 0.0114 DEPR_57018 0.0114 tot_days_supply_(1027, 1551] 0.0100 DEPR_321 0.0100 DEPR_606 0.0100 Deep Learing Model takes raw data without SME inputs and Feature Engineering 7
Machine Learning Model Process Flow Chat 8 8
The revolution: Machine learn vs. Deep Learning For decades, ML relied on human-engineered features in fields as diverse as image processing (e.g. edge detection), NLP (linguistics, stop words). DL renders feature engineering obsolete. Machine Class-specific Feature Depression: Predict one Feature Learning 1 or 0 Engineering Class at a time Creation Deep Auto Depression 1 Feature Predict multiple Learning Asthma 1 Encode Creation classes at a time ATDD 1 Embedding Model … . By directly using raw data without Feature Engineering, and by predicting multiple targets at a time, Deep Learning model approach saves > 50% of model development time and resources.
Higher performance with Higher volume of data LSTM with 4.5 million records. RNN with 0.5 million records XGboost Model 10
Deep Learning vs ML model(XGBoost) – Cost Analysis (Annual Cost) Count Comparison Deep ML Model • Deep Learning identifies additional Learning 22K patients that are not identified 259K 22K 3.2K by ML model(XGBoost) Cost Comparison Deep ML model Learning • Deep Learning identifies additional USD 56 Million claims 56M 9M 806M that are not identified by ML model(XGBoost) * Includes non-depression related claims 11
Automated ML/DL platform in AI System Machine Learning/Deep Learning Robot • Autonomous, instant learning • Hyper parameter tuning • Feature Engineering/model free • Transfer sparse and highly-dimension data • Data Driven Results • Multiple Targets at a time. • Able to explain results – How & Why 1 12 2
Multi-disease predictions 13
Hypertension – 4.5M records Specialist 4 diseases 14
ATDD – 4.5M records Specialist 4 diseases 15
Depression – 4.5M records Specialist 4 diseases 16
Asthma – 4.5M records Specialist 4 diseases 17
1D Convolutional Networks: simple, fast, local Fewer weights. Observe that in images, objects are ”local” • Kernel size: 4 • Stride: 2 Feature f 1 f 2 f 3 f 4 f 5 f 6 f 7 map = 7 Input=16 time Short range 18 18
RNNs: n inputs, m outputs RNNs are pervasively used for NLP and language to language translation él fue a la esquela r r r r r r r r r state has enough He went to school information to generate one time or sequential output time 19 19
Zero paddings helpful with inputs consisting of different length sequences To neural network Input 3 3 x 8 Input 2 Input 1 zero padding Input 3 Input 2 irregular Input 1 time 20 20
Embeddings helpful with categorical, non-contiguous inputs To neural network Embedding dimension = 3 Each input is a vector of 3 numbers Hopefully ”close” vectors are really “close” becomes 3x4 Embedding transformation One-hot: Each input is a vector of 1,000 ones or zeros. Better, but a lot of 1000x4 numbers / lot of memory Each input is a number from 1 to 1,000 Input=1x4 779 and 780 are not close (e.g. codes for prescription drugs) time 21 21
Keras Learned Embeddings use a fully connected layer, learn together with rest of model To neural network Embedding dimension = 3 Each input is a vector of 3 numbers Hopefully ”close” vectors are really “close” becomes 3x4 Embedding transformation Fully connected layer Learn these weights One-hot: Each input is a vector of 1,000 ones or zeros. Better, but a lot of 1000x4 numbers / lot of memory 22 22
Word2vec embeddings: unsupervised CBOW: Continuous Bag of Words: predict the word given its context Skip-grams: predict the context (including far away words) given a word Skipgrams Window = 3 Training for co-occurrence [11,32] 11 45 Input: 11 32 Outputs [11,45] CBOW Window = 3 Input: 11 11 45 32 Output [32,45] 23 23 time
Word2vec Embeddings (CBOW) Output Layer Softmax Classifier Context Hidden Layer Probability that DCC code ∑ DCC Linear Neurons at the nearby location is “45501” (Target DCC) 0 ∑ ∑ 0 “45502” w ij 0 DCC Sequence Input w ij 0 ∑ ∑ 33907 45501 “45503” 0 . DCC = 33907 1 w ij . 0 . ∑ . 0 ∑ “83600” Word2vec Output 24 24
Embeddings: Word2Vec + LSTM • Approach 1: • Build Word2Vec model on drug sequences using gensim • Replace the drug codes with their respective vectors • Use the vectorized inputs for the LSTM model • Approach 2: • Build Word2Vec model on drug sequences using gensim • Initialize the weights of Keras Embedding layer using Word2vec output • Run Embedding + LSTM model in Keras • Observations: • Approach 1 though gave promising results wasn’t scalable – Memory constraints kicked in during Vectorization • Approach 2 gave good enough results – not enough to beat a model of pure Keras Embedding + LSTM 25 25
Rx2dx project: evaluated Network architectures Specialist as well as multi- disease networks examined Classifier (1..N classes) FC (up to 64) concatenate RNN (up to 256) Or 1-D CNN Embedding Zero padded time sequences (up to 256) Static vars time 26 26
Hardware: IBM Minsky: 4x GPU server The only server offering NVLink between CPUs and GPUs on Power Architecture • 20 cores Power 8 3.25 GHz (x8 HT) • 1024 GB RAM • 2 x 2.5” 1 TB SSDs • Mellanox QDR Infiniband This architecture will make a difference on mixed workloads with a lot of CPU to GPU communication (real time batch generation) 27 27
The Software stack nvidia-docker with frameworks enabled docker containers We predominantly used Keras + Theano Command line or web / GUI access.. Jupyter TensorFlow Theano Mxnet Torch Swift for long term reference data sets, inputs and results Swift for long term reference data sets, docker registry Nvidia-docker (base docker image with device pass-through) inputs and results Cuda drivers GPU 0 GPU 1 GPU 2 GPU 3 GPU N – The bare metal machine is loaded with cuda drivers; then one installs docker and then nvidia-docker – At this point, hundreds of open source DL – enabled containers become available for instant download – At Optum, we already have an internal docker registry that we can utilize to store and manage internal images 28 28
Recommend
More recommend