Model Assertions for Monitoring and Improving ML Models
Daniel Kang*, , Deepti Raghavan*, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab
http://dawn.cs.stanford.edu/
1
Model Assertions for Monitoring and Improving ML Models Daniel Kang* - - PowerPoint PPT Presentation
Model Assertions for Monitoring and Improving ML Models Daniel Kang* , , Deepti Raghavan*, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab http://dawn.cs.stanford.edu/ 1 Machine learning is deployed in mission-critical settings with
Daniel Kang*, , Deepti Raghavan*, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab
http://dawn.cs.stanford.edu/
1
Tesla’s autopilot repeatedly accelerated towards lane dividers
2
Uber autonomous vehicle involved in fatal crash
Software powers medical devices, etc.
3
4
Boxes of cars should not highly overlap Cars should not flicker in and
5
We can specify errors in models without knowing root causes or fixes!
(see paper for examples)
6
“As the [automated driving system] changed the classification of the pedestrian several times—alternating between vehicle, bicycle, and an other — the system was unable to correctly predict the path of the detected object,” the board’s report states.
Frame 1 Frame 2 Frame 3
assert(cars should not flicker in and out)
Corrective action Runtime monitoring
7
Set of inputs that triggered assertion Model retraining Active learning Weak supervision via correction rules Human- generated labels Weak labels
8
9
Many users, potentially not the model builders, can collaboratively add assertions
10
11
» Many assertions can flag the same data point » The same assertion can flag many data points » Which points should we label?
12
Assertion 1 Assertion 2
» We designed a bandit algorithm for data selection (BAL) » Idea: select model assertions with highest reduction in assertions triggered
13
14
Frame 1 Frame 2 Frame 3
15
16
Identifier Time stamp Attribute 1 (gender) Attribute 2 (hair color) 1 1 M Brown 1 2 M Black 1 4 F Brown 2 5 M Grey
17
18
def flickering( recent_frames: List[PixelBuf], recent_outputs: List[BoundingBox] ) -> Float
Model assertion inputs are a history
Model assertions output a severity score, where a 0 is an abstension
19
20
def sensor_agreement(lidar_boxes, camera_boxes): failures = 0 for lidar_box in lidar_boxes: if no_overlap(lidar_box, camera_boxes): failures += 1 return failures
21
Identifier Time stamp Attribute 1 (gender) Attribute 2 (hair color) 1 1 M Brown 1 2 M Black 1 4 F Brown 2 5 M Grey
Transitions cannot happen too quickly Attributes with the same identifier must agree
Overlapping boxes in the same scene should agree on attributes Automatically specified via consistency assertions
22
Normal AF Normal Classifications should not change from normal to AF and back within 30 seconds Automatically specified via consistency assertions
23
» Ev Evaluation se setup » Evaluating the precision of model assertions (monitoring) » Evaluating the accuracy gains from model assertions (training)
24
25
Setting Task Model Assertions Visual analytics Object detection SSD Flicker, appear, multibox Autonomous vehicles Object detection SSD, VoxelNet Consistency, multibox ECG analysis AF detection ResNet-34 Consistency TV news Identifying TV news hosts Several Consistency
Security camera footage,
26
Point cloud data (NuScenes) Medical time series data
» Evaluation setup » Ev Evaluating the precisi sion of model asse ssertions s (monitoring) » Evaluating the accuracy gains from model assertions (training)
27
v
Assertion True Positive Rate Flickering 96% Multibox 100% Appearing 88% LIDAR 100% ECG 100%
28
» Evaluation setup » Evaluating the precision of model assertions (monitoring) » Ev Evaluating the accuracy gains s from model asse ssertions s (training)
29
30
» Flickering » Multibox » Appearing
» Random sampling » Uncertainty sampling » Randomly sampling from assertions
31
Using assertions
and random sampling
32
Our bandit algorithm
sampling from assertions
33
Using assertions
and random sampling
.45
Using weak supervision to label training examples caught by assertions improves model quality.
Full experimental details in paper
34
Incorrect annotation from Scale AI
35
36
» What is the language to specify model assertions? » How can we choose thresholds in model assertions automatically? » How can we apply model assertions to other domains such as text?
ddkang@stanford.edu
» Monitoring ML at deployment time » Improving models at train time
37