Model Assertions for Monitoring and Improving ML Models Daniel Kang* - - PowerPoint PPT Presentation

model assertions for monitoring and improving ml models
SMART_READER_LITE
LIVE PREVIEW

Model Assertions for Monitoring and Improving ML Models Daniel Kang* - - PowerPoint PPT Presentation

Model Assertions for Monitoring and Improving ML Models Daniel Kang* , , Deepti Raghavan*, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab http://dawn.cs.stanford.edu/ 1 Machine learning is deployed in mission-critical settings with


slide-1
SLIDE 1

Model Assertions for Monitoring and Improving ML Models

Daniel Kang*, , Deepti Raghavan*, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab

http://dawn.cs.stanford.edu/

1

slide-2
SLIDE 2

Machine learning is deployed in mission-critical settings with few checks

» Errors can have life-changing consequences » No standard way of quality assurance!

Tesla’s autopilot repeatedly accelerated towards lane dividers

2

Uber autonomous vehicle involved in fatal crash

slide-3
SLIDE 3

Software 1.0 is also deployed in mission- critical settings!

Important software goes through rigorous engineering / QA process » Assertions » Unit tests » Regression tests » Fuzzing » …

Software powers medical devices, etc.

3

slide-4
SLIDE 4

This talk: Model assertions

a method for checking outputs of models for both runtime monitoring and improving model quality

Our research: Can we design QA methods that work across the ML deployment stack?

4

slide-5
SLIDE 5

Key insight: models can make systematic errors

Boxes of cars should not highly overlap Cars should not flicker in and

  • ut of video

5

We can specify errors in models without knowing root causes or fixes!

(see paper for examples)

slide-6
SLIDE 6

6

“As the [automated driving system] changed the classification of the pedestrian several times—alternating between vehicle, bicycle, and an other — the system was unable to correctly predict the path of the detected object,” the board’s report states.

slide-7
SLIDE 7

Model assertions at deployment time

Frame 1 Frame 2 Frame 3

assert(cars should not flicker in and out)

Corrective action Runtime monitoring

7

slide-8
SLIDE 8

Model assertions at train time

Set of inputs that triggered assertion Model retraining Active learning Weak supervision via correction rules Human- generated labels Weak labels

8

slide-9
SLIDE 9

Outline

» Using model assertions » Ov Overview » For active learning » For weak supervision » For monitoring

» Model assertions API & examples

» Evaluation of model assertions

9

slide-10
SLIDE 10

Model assertions in context

Many users, potentially not the model builders, can collaboratively add assertions

10

slide-11
SLIDE 11

Outline

» Using model assertions » Overview » Fo For active learni ning ng » For weak supervision » For monitoring

» Model assertions API & examples

» Evaluation of model assertions

11

slide-12
SLIDE 12

How should we select data points to label for active learning?

» Many assertions can flag the same data point » The same assertion can flag many data points » Which points should we label?

12

Assertion 1 Assertion 2

slide-13
SLIDE 13

How should we select data points to label for active learning?

» We designed a bandit algorithm for data selection (BAL) » Idea: select model assertions with highest reduction in assertions triggered

13

slide-14
SLIDE 14

Outline

» Using model assertions » Overview » For active learning » Fo For weak k sup upervision » For monitoring

» Model assertions API & examples

» Evaluation of model assertions

14

slide-15
SLIDE 15

Correction rules for weak supervision: flickering

Frame 1 Frame 2 Frame 3

Frame two is filled in from surrounding frames

15

slide-16
SLIDE 16

Automatic correction rules: consistency API

16

Identifier Time stamp Attribute 1 (gender) Attribute 2 (hair color) 1 1 M Brown 1 2 M Black 1 4 F Brown 2 5 M Grey

Propose ‘M’ as an updated label

slide-17
SLIDE 17

Outline

» Using model assertions » Mo Model a asse ssertions A s API & & e exa xamples » Evaluation of model assertions

17

slide-18
SLIDE 18

Specifying model assertions: black-box functions over model inputs and outputs

18

def flickering( recent_frames: List[PixelBuf], recent_outputs: List[BoundingBox] ) -> Float

Model assertion inputs are a history

  • f inputs and predictions

Model assertions output a severity score, where a 0 is an abstension

slide-19
SLIDE 19

Predictions from different AV sensors should agree

19

slide-20
SLIDE 20

Assertions can be specified in little code

20

def sensor_agreement(lidar_boxes, camera_boxes): failures = 0 for lidar_box in lidar_boxes: if no_overlap(lidar_box, camera_boxes): failures += 1 return failures

slide-21
SLIDE 21

Specifying model assertions: consistency API

21

Identifier Time stamp Attribute 1 (gender) Attribute 2 (hair color) 1 1 M Brown 1 2 M Black 1 4 F Brown 2 5 M Grey

Transitions cannot happen too quickly Attributes with the same identifier must agree

slide-22
SLIDE 22

Model assertions for TV news analytics

Overlapping boxes in the same scene should agree on attributes Automatically specified via consistency assertions

22

slide-23
SLIDE 23

Model assertions for ECG readings

Normal AF Normal Classifications should not change from normal to AF and back within 30 seconds Automatically specified via consistency assertions

23

slide-24
SLIDE 24

Outline

» Using model assertions » Model assertions and examples » Evaluation of model assertions

» Ev Evaluation se setup » Evaluating the precision of model assertions (monitoring) » Evaluating the accuracy gains from model assertions (training)

24

slide-25
SLIDE 25

Evaluation setup: datasets and tasks

25

Setting Task Model Assertions Visual analytics Object detection SSD Flicker, appear, multibox Autonomous vehicles Object detection SSD, VoxelNet Consistency, multibox ECG analysis AF detection ResNet-34 Consistency TV news Identifying TV news hosts Several Consistency

slide-26
SLIDE 26

Evaluation Setup: Examples

Security camera footage,

  • riginal SSD

26

Point cloud data (NuScenes) Medical time series data

slide-27
SLIDE 27

Outline

» Model assertions and examples » Using model assertions » Evaluation of model assertions

» Evaluation setup » Ev Evaluating the precisi sion of model asse ssertions s (monitoring) » Evaluating the accuracy gains from model assertions (training)

27

slide-28
SLIDE 28

Evaluating Model Assertion Precision: Can assertions catch mistakes?

v

Assertion True Positive Rate Flickering 96% Multibox 100% Appearing 88% LIDAR 100% ECG 100%

28

slide-29
SLIDE 29

Outline

» Model assertions and examples » Using model assertions » Evaluation of model assertions

» Evaluation setup » Evaluating the precision of model assertions (monitoring) » Ev Evaluating the accuracy gains s from model asse ssertions s (training)

29

slide-30
SLIDE 30

Evaluating Model Quality after Retraining: Metrics

» Video analytics: box mAP » Autonomous vehicle sensing: box mAP » AF classification: accuracy

30

slide-31
SLIDE 31

» Finetuned model with 100 100 exam examples les each each rou

  • und

» 3 assertions to choose frames from:

» Flickering » Multibox » Appearing

» Compare against:

» Random sampling » Uncertainty sampling » Randomly sampling from assertions

Evaluating Model Quality after Retraining (multiple assertions): Can collecting training data via assertions improve model quality via active learning?

31

slide-32
SLIDE 32

Model assertions can be used for active learning more efficiently than alternatives (video analytics)

Using assertions

  • utperforms uncertainty

and random sampling

32

Our bandit algorithm

  • utperforms uniformly

sampling from assertions

slide-33
SLIDE 33

Model assertions also outperform on autonomous vehicle datasets (NuScenes)

33

Using assertions

  • utperforms uncertainty

and random sampling

slide-34
SLIDE 34

Evaluating Model Quality after Retraining: Can correction rules improve model quality without human labeling via weak supervision?

.45

Using weak supervision to label training examples caught by assertions improves model quality.

Full experimental details in paper

34

slide-35
SLIDE 35

Further results in paper

» Model assertions can find high confidence errors » Model assertions for validating human labels (video analytics) » Active learning results with a single model assertion (ECG)

Incorrect annotation from Scale AI

35

slide-36
SLIDE 36

Future work

36

» What is the language to specify model assertions? » How can we choose thresholds in model assertions automatically? » How can we apply model assertions to other domains such as text?

slide-37
SLIDE 37

Conclusion: Assertions can be Useful in ML!

ddkang@stanford.edu

No standard way of doing quality assurance for ML » Model assertions can be used for:

» Monitoring ML at deployment time » Improving models at train time

» Preliminary results show significant model improvement

37