multiple instance learning
play

Multiple Instance Learning for Fast, Stable and Early RNN - PowerPoint PPT Presentation

The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS 18 1 Algorithms for the IDE - EdgeML


  1. The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS ’18 1

  2. Algorithms for the IDE - EdgeML • A library of machine learning algorithms • Trained on the cloud • Ability to run on tiniest of IoT devices Arduino Uno 2

  3. Previous Work: EdgeML Classifiers ProtoNN Fast(G)RNN Bonsai Kusupati et al., NIPS ’18 Gupta et al., ICML ’17 Kumar et al., ICML ’17 Code: https://github.com/Microsoft/EdgeML 3

  4. Previous Work: EdgeML Applications GesturePod Wake Word Patil et al., (to be submitted) (work in progress) Code: En route 4

  5. Problem 5

  6. Problem • Given time series data point, classify it as a certain class. • GesturePod: – Data: Accelerometer and gyroscope information – Task: Detect if gesture was performed 6

  7. Problem 7

  8. Problem 8

  9. Problem ProtoNN and Bonsai 9

  10. Problem Expensive! Prohibitive on IoT Devices ProtoNN and Bonsai 10

  11. RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 11

  12. RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 12

  13. RNNs are Expensive Observe how k << T. • RNN runs over longer data point – unnecessarily large T and prediction time. • Predictors must recognize signatures with different offsets - requires larger predictors. • Sequential compute. • Also lag. 13

  14. RNNs are Expensive Solution ? Approach 1 of 2 : Exploit the fact that k << T and learn a smaller classifier. How? 14

  15. How ? • STEP 1: Divide X into smaller instances . 15

  16. How ? • STEP 1: Divide X into smaller instances . 16

  17. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 17

  18. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 18

  19. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 19

  20. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. Note! Most of the instances are just noise. 20

  21. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 21

  22. How ? • STEP 1: Divide X into smaller instances . Robust Learning • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 22

  23. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data. (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 23

  24. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. 24

  25. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. Standard techniques don’t apply. • Heterogenous. • Ignores temporal structure of the data. 25

  26. How ? Exploit temporal locality with MIL/Robust learning techniques Property 1: Positive instances are clustered together. Property 2: Number of positive instances can be estimated. 26

  27. Algorithm: MI-RNN Two phase algorithm – alternates between identifying positive instances and training on the positive instances. 27

  28. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 28

  29. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 29

  30. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 30

  31. Algorithm: MI-RNN • Step 2: Train classifier on this data 31

  32. Algorithm: MI-RNN True positive instances Correctly labeled • Step 2: Train classifier on this data 32

  33. Algorithm: MI-RNN True positive instances Mislabeled instances Correctly labeled Common to all classes • Step 2: Train classifier on this data 33

  34. Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data 34

  35. Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data Classifier will be confused. Low prediction confidence. 35

  36. Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 36

  37. Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 37

  38. Algorithm: MI-RNN • Step 4: Repeat with new labels 38

  39. MI-RNN: Does It Work? 39

  40. MI-RNN: Does It Work? • Of course! 40

  41. MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data 41

  42. MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data • Experiments: Significantly improve accuracy while saving computation – Various tasks: activity recognition, audio keyword detection, gesture recognition 42

  43. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 43

  44. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 MI-RNN better than LSTM almost always Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 44

  45. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim GesturePod-6 8 - 98.00 50 (Gesture detection) 32 94.04 99.13 48 97.13 98.43 MI-RNN better than LSTM almost always DSA-19 32 84.56 87.01 28 (Activity detection) 48 85.35 89.60 64 85.17 88.11 45

  46. MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - 46

  47. MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - MI-RNN achieves same or better accuracy with ½ or ¼ of LSTM hidden dim. 47

  48. MI-RNN in Action Synthetic MNIST: Detecting the presence of Zero. 48

  49. MI-RNN in Action 49

  50. RNNs are Expensive Solution ? Approach 2 of 2 : Early Prediction How? 50

  51. Can we do even better? • For a lot of cases, looking only at a small prefix is enough to classify/reject. Early Prediction 51

  52. Can we do even better? • Existing work: – Assumes pretrained classifier and uses secondary classifiers – Template matching approaches – Separate policy for early classification • Not feasible! 52

  53. Early Prediction Our Approach Inference: Predict at each step – stop as soon as prediction confidence is high. Training: Incentivize early prediction by rewarding correct and early detections. 53

  54. Algorithm: E-RNN Regular Loss: Early Loss: 54

  55. Algorithm: E-RNN Regular Loss: Incentivizes early and consistent prediction. Early Loss: 55

  56. E-RNN: How well does it work? 56

  57. E-RNN: How well does it work? • Abysmally bad  57

  58. E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. 58

  59. E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. • Gets confused easily due to common prefixes! Positive datapoint Negative datapoint 59

Recommend


More recommend