CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep - PowerPoint PPT Presentation

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries Hung Viet Pham 1 Thibaud Lutellier 1 Weizhen Qi 2 Lin Tan 3 1 University of Waterloo, Canada 2 University of Science and Technology of China, China 3 Purdue University, USA

Deep learning (DL) is pervasive Machine translation Alzheimer’s disease diagnosis Autonomous driving cars Virtual assistance 2

Correct DL systems require correct implementations DL system CRADLE CRADLE Algorithms / Models Implementations 3

DL libraries are hard to test and debug ● Intrinsic complexity MobileNetV2 ● DL system expected output is Expected output: tennis ball unknown ○ Correct programs should output expected output. ○ The ground truth is not the expected output because models are not perfect. MobileNetV2 - TensorFlow: banana Ground-truth: banana 4

Idea: Differential testing TensorFlow classification InceptionResNetV2 TensorFlow TensorFlow backend An inconsistency InceptionResNetV2 Model CNTK classification A “petri-dish” image CNTK backend InceptionResNetV2 CNTK 5

Batch_normalization bug ● The CNTK batch normalization formula was implemented incorrectly. ● The developers fixed the bug after we reported it. - return(x-mean)/(C.sqrt ( var ) +epsilon)*gamma+beta + return(x-mean)/ C.sqrt ( var +epsilon ) *gamma+beta 6

Differential testing: Challenges ● How to compare two implementations? ○ What metric to use? ○ What should be considered bugs? ● How to localize the faults? ○ How to find faults in the complex model executions? 7

Differential testing: Ideas ● Two metrics measure the severity of the inconsistency for a set of input instances. ● Localization map compares intermediate states of DL models for fault localization. 8

CRADLE: Overview Detection phase Trained models Output Output Unique & Model output comparator inconsistencies extractor Validation data Crash bugs Localization phase Inconsistency Localization Inconsistency Hidden states Hidden states bugs maps localizer extractor 9

CRADLE: Detection phase Detection phase Trained models Output Output Unique & Model output comparator inconsistencies extractor Validation data Crash bugs Localization phase Inconsistency Localization Inconsistency Hidden states Hidden states bugs maps localizer extractor 10

Output extractor ● Executes the models on different backends to obtain output ● Detects crashes CNTK classification InceptionResNetV2 Model An “petri-dish” image CNTK backend InceptionResNetV2 CNTK 11

Output comparator: Distance metrics Metrics calculate difference relatively to the ground-truth. CLASS-based (Classification) MAD-based (Regression) 12

CLASS-based distance example TensorFlow CNTK Top-5 classification Rank petri-dish,TF = 1 Rank petri-dish,CN > 5 σ petri-dish,TF = 2 5-1 = 16 σ petri-dish,CN = 0 |σ petri-dish,CN - σ petri-dish,CN | = 16 13

Inconsistency triggering input (ITI) ● An input instance triggers a distance larger than a threshold (T C and T M ) ○ E.g.,: “petri-dish” image is an ITI given T C = 8. Theano: Indian elephant TensorFlow: banana CNTK: Arabian camel TensorFlow: groom CNTK: tennis ball TensorFlow: hen CNTK: groom Theano: tennis ball Theano: hen 14

Detect inconsistency ● An inconsistency is a pair of implementations that triggers more than p % of ITIs over the validation set InceptionResNetV2 TensorFlow Validation set D_CLASS 16 p = 10% T C = 8 6 0 InceptionResNetV2 CNTK 15

CRADLE: Localization phase Detection phase Trained models Output Output Unique & Model output comparator inconsistencies extractor Validation data Crash bugs Localization phase Inconsistency Localization Inconsistency Hidden states Hidden states bugs maps localizer extractor 16

Hidden state extractor ● The “most inconsistent” input per inconsistency is used. ● The network structure + hidden states are considered as the network execution graph. ● Hidden states are output of hidden layers. Conv2D BatchNorm Activation GloAvgPool Dense 776 layers omitted TensorFlow: jean BatchNorm Activation GloAvgPool Input: jean Conv2D InceptionResNetV2 execution graph on TensorFlow 17

MAD differences BatchNorm Activation GloAvgPool Conv2D Conv2D BatchNorm Activation GloAvgPool TensorFlow: jean Dense 776 layers omitted 776 Conv2D BatchNorm Activation GloAvgPool Dense layers 𝜀 = 0.0 𝜀 = 0.0002 𝜀 = 0.1480 𝜀 = 0.0860 𝜀 = 0.0004 omitted Input: jean Conv2D BatchNorm Activation GloAvgPool Dense 776 layers omitted CNTK: mail bag BatchNorm Activation GloAvgPool Conv2D 18

Inconsistency introduction rate ● Calculate the rate of change ○ ∊ prevent division by zero ● Highlight executions with R above the third quantile Conv2D Activation Conv2D Activation 𝜀 = 0.0 𝜀 < 0.0001 𝜀 = 0.0138 𝜀 = 0.1480 Dense R = 0.0 R =-0.5497 R = 0.5530 R =-0.5173 𝜀 = 0.0004 772 R =-0.9950 layers omitted BatchNorm Conv2D BatchNorm GloAvgPool TensorFlow: jean 𝜀 = 0.0002 𝜀 = 0.0003 𝜀 = 0.3067 𝜀 = 0.0860 CNTK: mailbag R = 2048.6 R = 2.3009 R = 21.186 R =-0.4191 Input: jean InceptionResNetV2 localization map between TensorFlow and CNTK 19

Result 104 unique inconsistencies 3 backends 28 models 11 datasets 7 inconsistency bugs 5 crash bugs 20

7 inconsistency bugs fy Batch normalization BatchNormalization Conv2D variant Padding scheme Pooling scheme AveragePooling2D Parameter organization Trainable Conv 21

Localization is helpful Relevant to the causes of all 104 unique inconsistencies First One of Relevant 22

Conclusion ● CRADLE applies differential testing on DL implementations and localize faulty functions by tracking error propagation. ○ Detects 7 confirmed inconsistency bugs and 5 crash bugs ○ Helps find root causes of all 104 unique inconsistencies using localization maps ● Inconsistencies are common and widespread. ● We call for more attention to testing of DL libraries. 23

DL system overview User code High-level Keras Libraries Interface ... Low-level TensorFlow Theano CNTK Backend Libraries Hardware CPU GPU 24

Group unique inconsistency ● A group of inconsistencies with the same inconsistency pattern between the same pair of implementations ○ Inconsistency pattern is the distribution of metric distance 25

Suggested settings ● Grid search on T C , T M , and p values ● Optimal settings (most inconsistency without false negative and false positive) are: ○ CLASS-based: T C = 8 and p = 0% ○ MAD-based: T M = 0.2 and p = 0% ● Confirm using cross-validation 26

Dataset and hardware ● Dataset: ○ 11 datasets including ImageNet, MNIST, Udachi Driving Challenge 2, etc. ○ 30 pre-trained models ● Hardware: ○ Xeon E5-2695 ○ NVIDIA Titan Xp 27

Detected inconsistencies The numbers outside and (inside) brackets are the unique and (total) number of inconsistencies respectively. 28

Comparison to accuracy ● Detect inconsistency if the top-k accuracy difference is above a threshold T AC ● We pick k between 1 to 5 and T AC between 0 and 50 ● With T AC = 0, top-1 accuracy detects the most inconsistencies (305) but still missed 35 ○ E.g., for the Dog species model, the Batch_normalization bugs induce inconsistency between TensorFlow and CNTK ○ However, those backends got identical top-1 (29.9%) and top-5 (64.4%) accuracies 29

Future work ● Detect inconsistencies and bugs in training code ○ Harder since training is non-deterministic ● Generate mutated models using fuzzing to expand testing set ● Testing with only one backend with equivalent models 30

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep - PowerPoint PPT Presentation

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries Hung Viet Pham 1 Thibaud Lutellier 1 Weizhen Qi 2 Lin Tan 3 1 University of Waterloo, Canada 2 University of Science and Technology of China, China 3 Purdue

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

SWOT and interactive models - Description and adaptation of SWOT to Cradle to Cradle Islands -

Working and Learning in the World of Cradle-to-Cradle 2011- 2013 Lect.dr. Anamaria Supuran

A Micro Crowdsourcing Architecture to Localize A Micro Crowdsourcing Architecture to Localize Web

HUBS Where Primary Faculties / Institutes Comments 1. Cradle Coast All Cradle Coast TIA NW, RCS

BIG IDEAS & Cradle to Cradle A NEW GEOLOGICAL EPOCH Humans have changed the planet so much

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

A CRADLE TO CRADLE INSPIRED MORGUE! And now for something completely different Martijn

Sustainable Direction Ltd Long term whole life cycle perspectives Cradle to Cradle and BIM to

Workshop M Best Practices in Sustainability Zero Waste, Cradle-to-Cradle, Life Cycle

A Detailed Look at the R600 Backend T om Stellard November 7, 2013 1 | A Detailed Look at the

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Can We Detect Crisp Sets Based Only on How to Detect 1- . . . the Subsethood Ordering of Fuzzy

COMP 2600: Formal Methods for Software Engineeing Specification in Z: Logical Analysis of Schemas

A new approach to connect Algebra with Analysis: Relationships and Applications between

Computer Supported Modeling and Reasoning David Basin, Achim D. Brucker, Jan-Georg Smaus, and

Monomino-Domino Tatami Coverings Alejandro Erickson Joint work with Frank Ruskey at The

Dutch Optics- the TELESCOPE Extraordinary technical innovation often accompanies any intellectual

Cost-Effec*ve Recovery of an Endangered Species: The

David Shmoys Cornell University Part of this talk represent joint work with P f h lk k h

http://phet.colorado.edu/en/simulation/moving-man After class, one of you came up and told me the

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep - PowerPoint PPT Presentation

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries Hung Viet Pham 1 Thibaud Lutellier 1 Weizhen Qi 2 Lin Tan 3 1 University of Waterloo, Canada 2 University of Science and Technology of China, China 3 Purdue

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

SWOT and interactive models - Description and adaptation of SWOT to Cradle to Cradle Islands -

Working and Learning in the World of Cradle-to-Cradle 2011- 2013 Lect.dr. Anamaria Supuran

A Micro Crowdsourcing Architecture to Localize A Micro Crowdsourcing Architecture to Localize Web

HUBS Where Primary Faculties / Institutes Comments 1. Cradle Coast All Cradle Coast TIA NW, RCS

BIG IDEAS &amp; Cradle to Cradle A NEW GEOLOGICAL EPOCH Humans have changed the planet so much

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

A CRADLE TO CRADLE INSPIRED MORGUE! And now for something completely different Martijn

Sustainable Direction Ltd Long term whole life cycle perspectives Cradle to Cradle and BIM to

Workshop M Best Practices in Sustainability Zero Waste, Cradle-to-Cradle, Life Cycle

A Detailed Look at the R600 Backend T om Stellard November 7, 2013 1 | A Detailed Look at the

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Can We Detect Crisp Sets Based Only on How to Detect 1- . . . the Subsethood Ordering of Fuzzy

COMP 2600: Formal Methods for Software Engineeing Specification in Z: Logical Analysis of Schemas

A new approach to connect Algebra with Analysis: Relationships and Applications between

Computer Supported Modeling and Reasoning David Basin, Achim D. Brucker, Jan-Georg Smaus, and

Monomino-Domino Tatami Coverings Alejandro Erickson Joint work with Frank Ruskey at The

Dutch Optics- the TELESCOPE Extraordinary technical innovation often accompanies any intellectual

Cost-Effec*ve Recovery of an Endangered Species: The

David Shmoys Cornell University Part of this talk represent joint work with P f h lk k h

http://phet.colorado.edu/en/simulation/moving-man After class, one of you came up and told me the

BIG IDEAS & Cradle to Cradle A NEW GEOLOGICAL EPOCH Humans have changed the planet so much