E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work - PowerPoint PPT Presentation

Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining Conference

2 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP

3 E XAMPLE A PPLICATION : N UCLEAR T HREAT D ETECTION  Border control: vehicles are scanned  Human in the loop interpreting results prediction feedback vehicle scan

4 B OOSTED D ECISION S TUMPS  Accurate, but hard to interpret How is the prediction derived from the input? Image obtained with the Adaboost applet.

5 D ECISION T REE – M ORE I NTERPRETABLE yes no Radiation > x% no yes Payload type = ceramics yes no Uranium level > max. Consider balance admissible for ceramics of Th232, Ra226 and Co60 Threat Clear

6 M OTIVATION Many users are willing to trade accuracy to better understand the system-yielded results Need : simple, interpretable model Need : explanatory prediction process

7 E XPLANATION -O RIENTED P ARTITIONING (EOP)

8 E XPLANATION -O RIENTED P ARTITIONING (EOP) E XECUTION E XAMPLE – 3D DATA Uniform cube 2 Gaussians 5 4 3 2 1 0 -1 -2 -3 5 4 3 2 5 4 5 1 3 2 0 4 1 -1 0 3 -1 -2 -2 2 -3 -3 -4 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 4 5 (X,Y) plot

9 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

10 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

11 EOP E XECUTION E XAMPLE – 3D DATA h 1 Step 2: Choose a good classifier - call it h 1

12 EOP E XECUTION E XAMPLE – 3D DATA Step 2: Choose a good classifier - call it h 1

13 EOP E XECUTION E XAMPLE – 3D DATA OK NOT OK Step 3: Estimate accuracy of h 1 at each point

14 EOP E XECUTION E XAMPLE – 3D DATA Step 3: Estimate accuracy of h 1 for each point

15 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

16 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

17 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

18 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

19 EOP E XECUTION E XAMPLE – 3D DATA Finished first iteration

20 EOP E XECUTION E XAMPLE – 3D DATA Iterate until all data is accounted for or error cannot be decreased

21 L EARNED M ODEL – P ROCESSING QUERY [ X 1 X 2 X 3 ] yes h 1 (x 1 x 2 ) [x 1 x 2 ] in R 1 ? no yes h 2 (x 2 x 3 ) [x 2 x 3 ] in R 2 ? no yes h 3 (x 1 x 3 ) [x 1 x 3 ] in R 3 ? no Default Value

22 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified

23 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate

24 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate  Intuitive, visually appealing - hyper-rectangles/spheres

25 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP  Summary

26 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset

27 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset CART • is accurate • takes many iterations • does not uncover or leverage structure of data

28 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset EOP • equally accurate CART • uncovers structure • is accurate + o • takes many iterations • does not uncover or leverage structure of data Iteration 1 o + Iteration 2

29 C OMPARISON T O B OOSTING  What is the price of understandability?  Why boosting?  It is an [arguably] good black-box classifier  Learns an ensemble using any type of classifier  Iteratively targets data misclassified earlier  Criterion: Complexity of the resulting model = number of vector operations to make a prediction

30 C OMPARISON TO BOOSTING - S ETUP  Problem: Binary classification  10D Gaussians/uniform cubes for each class  Statistical significance: repeat experiment with several datasets and compute paired t-test p-values  Results obtained through 5-fold cross validation

31 EOP VS A DA B OOST - SVM BASE CLASSIFIERS  EOP is often less accurate, but not significantly  the reduction of complexity is statistically significant 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0.85 0.9 0.95 1 0 100 200 300 Accuracy Complexity Boosting EOP (nonparametric) Accuracy p-value: 0.832 Complexity p-value: 0.003

32 EOP ( STUMPS AS BASE CLASSIFIERS ) VS CART D ATA FROM THE UCI REPOSITORY CART EOP N. BT EOP P. V MB BCW 0 0.2 0.4 0.6 0.8 1 0 1.2 10 20 30 40 Accuracy Complexity  CART is  Parametric the most EOP yields Dataset # of Features # of Points accurate the simplest Breast Tissue 10 1006 models Vowel 9 990 MiniBOONE 10 5000 Breast Cancer 10 596

33 E XPLAINING R EAL D ATA - S PAMBASE  1 st Iteration  classier labels everything as spam  high confidence regions do enclose mostly spam and  Incidence of the word ‘your’ is low  Length of text in capital letters is high

34 E XPLAINING R EAL D ATA - S PAMBASE  2 nd Iteration  the threshold for the incidence of `your' is lowered  the required incidence of capitals is increased  the square region on the left also encloses examples that will be marked as `not spam'

35 E XPLAINING R EAL D ATA - S PAMBASE  3 rd Iteration  Classifier marks everything as spam  Frequency of ‘your’ and ‘hi’ determine the regions

36 S UMMARY  EOP maintains classification accuracy but uses less complex models when compared to Boosting  EOP with decision stumps finds less complex models than CART at the price of a small decrease in accuracy  EOP gives interpretable high accuracy regions  We are currently testing EOP in a range of practical application scenarios

37 T HANK Y OU

38 E XTRA R ESULTS

39 E XPLAINING REAL DATA - FUEL

E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work - PowerPoint PPT Presentation

Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining

P ATHWAYS E XPLAINING V ARIANCE IN AMD RISK J AKE H ALL | C ASE W ESTERN R ESERVE U NIVERSITY - B

H ow the D eontic I ssue in the M iners P uzzle D epends on an E pistemic I ssue Martin Aher 1

data analysis needs to be a sequence of steps with analysis decisions at step k dependent on

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya

Meet the Junior Counselors: Mrs. Stacy Strobel Mr. Tim Garland L through R A through D Mrs.

Unlocking Growth Through Partnership Local development, growth and regeneration through

Quality Through Best Practices April 28 & 29, 2017 CALTCM 2017 Quality Through Best

A Trip Through the NGC TensorFlow Container GTC 2019 S9256 AGENDA A Trip Through the TensorFlow

High School Success Meet the Freshman Counselors: Mr. Tim Garland A through D Mr. Scott Crosby

Avon Grove Charter School Excellence through Ingenuity www.agcharter.org Excellence through

Diversity through time... Changes in dinosaur diversity by continent Count species? genera?

Women Can Lead the World Through Just Business, Women Can Lead the World Through Just Business,

Getting Your Scouts Through the Getting Your Scouts Through the Eagle Project Eagle Project

Competitive Advantage Through their talent development strategy Through Their Talent

GRAMMAR THROUGH HUMOR BRANDY SHOOKS & WHITNEY SCHARER TEACHING GRAMMAR THROUGH HUMOR Having

MOVING INTO AND THROUGH PRISON: IMPROVED WELLBEING THROUGH THE SPARC AND PRISON VOICEMAIL

EOP-011-1 Project 2009-03 Emergency Operations Technical Conference June 24, 2014 NERC

Recognition continued: discriminative classifiers Tues April 18 Kristen Grauman UT Austin Last

DEVELOPING PROCESSING ELECTRONICS FOR BPM PROTOTYPE AT THE AWA RF PHOTO- INJECTOR

Automatic Testing Tool for OSCAR Using System-level Virtualization Geoffroy Valle 1 , Thomas

Towards Entanglement of Purification for Conformal Field Theories Kotaro Tamaoka (Osaka U.) Based

Analysis of 04 years (2002-2005) of laser data on Starlette, Stella and LAGEOS-I/II satellites for

FE FELI LIX Tes est Fi Firmware Purpos ose type with Hit Finding/Hit Finding Emulator Com

Parish Emergency Operations Planning SESSION 3: COMPLETING THE THREAT ANNEXES Parish vs School

E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work - PowerPoint PPT Presentation

Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining

P ATHWAYS E XPLAINING V ARIANCE IN AMD RISK J AKE H ALL | C ASE W ESTERN R ESERVE U NIVERSITY - B

H ow the D eontic I ssue in the M iners P uzzle D epends on an E pistemic I ssue Martin Aher 1

data analysis needs to be a sequence of steps with analysis decisions at step k dependent on

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya

Meet the Junior Counselors: Mrs. Stacy Strobel Mr. Tim Garland L through R A through D Mrs.

Unlocking Growth Through Partnership Local development, growth and regeneration through

Quality Through Best Practices April 28 &amp; 29, 2017 CALTCM 2017 Quality Through Best

A Trip Through the NGC TensorFlow Container GTC 2019 S9256 AGENDA A Trip Through the TensorFlow

High School Success Meet the Freshman Counselors: Mr. Tim Garland A through D Mr. Scott Crosby

Avon Grove Charter School Excellence through Ingenuity www.agcharter.org Excellence through

Diversity through time... Changes in dinosaur diversity by continent Count species? genera?

Women Can Lead the World Through Just Business, Women Can Lead the World Through Just Business,

Getting Your Scouts Through the Getting Your Scouts Through the Eagle Project Eagle Project

Competitive Advantage Through their talent development strategy Through Their Talent

GRAMMAR THROUGH HUMOR BRANDY SHOOKS &amp; WHITNEY SCHARER TEACHING GRAMMAR THROUGH HUMOR Having

MOVING INTO AND THROUGH PRISON: IMPROVED WELLBEING THROUGH THE SPARC AND PRISON VOICEMAIL

EOP-011-1 Project 2009-03 Emergency Operations Technical Conference June 24, 2014 NERC

Recognition continued: discriminative classifiers Tues April 18 Kristen Grauman UT Austin Last

DEVELOPING PROCESSING ELECTRONICS FOR BPM PROTOTYPE AT THE AWA RF PHOTO- INJECTOR

Automatic Testing Tool for OSCAR Using System-level Virtualization Geoffroy Valle 1 , Thomas

Towards Entanglement of Purification for Conformal Field Theories Kotaro Tamaoka (Osaka U.) Based

Analysis of 04 years (2002-2005) of laser data on Starlette, Stella and LAGEOS-I/II satellites for

FE FELI LIX Tes est Fi Firmware Purpos ose type with Hit Finding/Hit Finding Emulator Com

Parish Emergency Operations Planning SESSION 3: COMPLETING THE THREAT ANNEXES Parish vs School

Quality Through Best Practices April 28 & 29, 2017 CALTCM 2017 Quality Through Best

GRAMMAR THROUGH HUMOR BRANDY SHOOKS & WHITNEY SCHARER TEACHING GRAMMAR THROUGH HUMOR Having