e xplaining d atasets through


Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining

  1. Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining Conference

  2. 2 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP

  3. 3 E XAMPLE A PPLICATION : N UCLEAR T HREAT D ETECTION  Border control: vehicles are scanned  Human in the loop interpreting results prediction feedback vehicle scan

  4. 4 B OOSTED D ECISION S TUMPS  Accurate, but hard to interpret How is the prediction derived from the input? Image obtained with the Adaboost applet.

  5. 5 D ECISION T REE – M ORE I NTERPRETABLE yes no Radiation > x% no yes Payload type = ceramics yes no Uranium level > max. Consider balance admissible for ceramics of Th232, Ra226 and Co60 Threat Clear

  6. 6 M OTIVATION Many users are willing to trade accuracy to better understand the system-yielded results Need : simple, interpretable model Need : explanatory prediction process


  8. 8 E XPLANATION -O RIENTED P ARTITIONING (EOP) E XECUTION E XAMPLE – 3D DATA Uniform cube 2 Gaussians 5 4 3 2 1 0 -1 -2 -3 5 4 3 2 5 4 5 1 3 2 0 4 1 -1 0 3 -1 -2 -2 2 -3 -3 -4 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 4 5 (X,Y) plot

  9. 9 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

  10. 10 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

  11. 11 EOP E XECUTION E XAMPLE – 3D DATA h 1 Step 2: Choose a good classifier - call it h 1

  12. 12 EOP E XECUTION E XAMPLE – 3D DATA Step 2: Choose a good classifier - call it h 1

  13. 13 EOP E XECUTION E XAMPLE – 3D DATA OK NOT OK Step 3: Estimate accuracy of h 1 at each point

  14. 14 EOP E XECUTION E XAMPLE – 3D DATA Step 3: Estimate accuracy of h 1 for each point

  15. 15 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

  16. 16 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

  17. 17 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

  18. 18 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

  19. 19 EOP E XECUTION E XAMPLE – 3D DATA Finished first iteration

  20. 20 EOP E XECUTION E XAMPLE – 3D DATA Iterate until all data is accounted for or error cannot be decreased

  21. 21 L EARNED M ODEL – P ROCESSING QUERY [ X 1 X 2 X 3 ] yes h 1 (x 1 x 2 ) [x 1 x 2 ] in R 1 ? no yes h 2 (x 2 x 3 ) [x 2 x 3 ] in R 2 ? no yes h 3 (x 1 x 3 ) [x 1 x 3 ] in R 3 ? no Default Value

  22. 22 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified

  23. 23 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate

  24. 24 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate  Intuitive, visually appealing - hyper-rectangles/spheres

  25. 25 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP  Summary


  27. 27 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset CART • is accurate • takes many iterations • does not uncover or leverage structure of data

  28. 28 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset EOP • equally accurate CART • uncovers structure • is accurate + o • takes many iterations • does not uncover or leverage structure of data Iteration 1 o + Iteration 2

  29. 29 C OMPARISON T O B OOSTING  What is the price of understandability?  Why boosting?  It is an [arguably] good black-box classifier  Learns an ensemble using any type of classifier  Iteratively targets data misclassified earlier  Criterion: Complexity of the resulting model = number of vector operations to make a prediction

  30. 30 C OMPARISON TO BOOSTING - S ETUP  Problem: Binary classification  10D Gaussians/uniform cubes for each class  Statistical significance: repeat experiment with several datasets and compute paired t-test p-values  Results obtained through 5-fold cross validation

  31. 31 EOP VS A DA B OOST - SVM BASE CLASSIFIERS  EOP is often less accurate, but not significantly  the reduction of complexity is statistically significant 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0.85 0.9 0.95 1 0 100 200 300 Accuracy Complexity Boosting EOP (nonparametric) Accuracy p-value: 0.832 Complexity p-value: 0.003

  32. 32 EOP ( STUMPS AS BASE CLASSIFIERS ) VS CART D ATA FROM THE UCI REPOSITORY CART EOP N. BT EOP P. V MB BCW 0 0.2 0.4 0.6 0.8 1 0 1.2 10 20 30 40 Accuracy Complexity  CART is  Parametric the most EOP yields Dataset # of Features # of Points accurate the simplest Breast Tissue 10 1006 models Vowel 9 990 MiniBOONE 10 5000 Breast Cancer 10 596

  33. 33 E XPLAINING R EAL D ATA - S PAMBASE  1 st Iteration  classier labels everything as spam  high confidence regions do enclose mostly spam and  Incidence of the word ‘your’ is low  Length of text in capital letters is high

  34. 34 E XPLAINING R EAL D ATA - S PAMBASE  2 nd Iteration  the threshold for the incidence of `your' is lowered  the required incidence of capitals is increased  the square region on the left also encloses examples that will be marked as `not spam'

  35. 35 E XPLAINING R EAL D ATA - S PAMBASE  3 rd Iteration  Classifier marks everything as spam  Frequency of ‘your’ and ‘hi’ determine the regions

  36. 36 S UMMARY  EOP maintains classification accuracy but uses less complex models when compared to Boosting  EOP with decision stumps finds less complex models than CART at the price of a small decrease in accuracy  EOP gives interpretable high accuracy regions  We are currently testing EOP in a range of practical application scenarios

  37. 37 T HANK Y OU

  38. 38 E XTRA R ESULTS


More recommend