FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu*, Mayee Chen*, Fred Sala, Sarah Hooper, Kayvon Fatahalian, Chris Ré Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods. ICML 2020. * Denotes Equal Contribution
The Training Data Bottleneck in ML Collecting training data can be slow and expensive Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Weak Supervision - A Response def L_1(comment) : return SPAM if “http” in comment def L_2(comment) : return NOT SPAM if “love” in comment User-De fi ned Functions Crowd Workers External Knowledge Bases How to best use multiple noisy sources of supervision? Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Data Programming: Unifying Weak Supervision [1] X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model [1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018 . Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Data Programming: Unifying Weak Supervision [1] X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model [1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018 . Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Weak Supervision: A Response Modeling labeling functions critical, but can be slow… X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model Expensive SGD iterations Introduction Background Parameter Recovery Theoretical Analysis Evaluation
FlyingSquid: Reduce the Turnaround Time ▪ Background: labeling functions and graphical models ▪ Closed-form solution to model parameters, no SGD ▪ Theoretical bounds and guarantees ∴ ⇒ ∎ ▪ Run orders of magnitude faster, without losing accuracy; weakly-supervised online learning Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Background Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Problem Setup S 1 : 풳 → λ 1 ∈ {±1,0} { ̂ { X i } n Y i } n f w : 풳 → 풴 i =1 i =1 S m : 풳 → λ m ∈ {±1,0} Unlabeled Data m Labeling Functions Probabilistic Downstream Labels End Model We want to learn the joint distribution P( λ , Y), without observing Y! Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Model Labeling Functions with Latent Graphical Models Hidden Variable (True Label) λ 1 λ 2 λ 4 λ 5 λ 7 λ 8 Y Y 1 Y 2 Y 3 λ 1 λ 2 λ 3 λ 4 λ 5 λ 3 λ 6 λ 9 Observed Variables (Labeling Function Outputs) Temporal dependencies Technical problem: learning parameters of these graphical models Main challenge: recover accuracies μ of labeling functions [2] Varma et al. Learning Dependency Structures for Weak Supervision Models. ICML 2019. Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Parameter Recovery Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Existing Iterative Approaches Can Be Slow SGD over loss function Sala et al. 2019 Ratner et al. 2016 Safranchik et al. 2020 Zhan et al. 2019 Ratner et al. 2018 Ratner et al. 2019 Bach et al. 2019 (Gibbs) Disadvantages: SGD can take a long time, many hyperparameters (learning rate, momentum, etc) to tune Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Solve Triplets of Labeling Function Parameters at a Time λ 1 μ ( λ 1 ,Y 1 ) Y 1 λ 2 E[ λ 1 Y 1 ]E[ λ 2 Y 1 ] = E[ λ 1 λ 2 ] λ 3 μ ( λ 2 ,Y 1 ) E[ λ 2 Y 1 ]E[ λ 3 Y 1 ] = E[ λ 2 λ 3 ] μ ( λ 3 ,Y 1 ) λ 4 E[ λ 3 Y 1 ]E[ λ 1 Y 1 ] = E[ λ 3 λ 1 ] Y 2 λ 5 μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) λ 7 Y 3 λ 8 λ 9 Latent Variable Solve Triplets Label Model Model of Labeling Functions Method of moments: break problem up into pieces, get closed-form solutions Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Triplets of Conditionally-Independent Labeling Functions Moment = E [ λ i λ j ] μ i μ j = N i , j Unobservable Observable accuracy parameters agreements Form triplets of these equations: Get closed-form solutions: μ 1 μ 4 = N 1,4 | μ 1 | = N 1,4 N 1,5 / N 4,5 μ 1 μ 5 = N 1,5 ⇒ | μ 4 | = N 1,4 N 4,5 / N 1,5 μ 4 μ 5 = N 4,5 | μ 5 | = N 4,5 N 1,5 / N 1,4 All we need to do is count how often the labeling functions agree - no SGD! Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Theoretical Analysis Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Bounding Sampling Error (Informal) Theorem 1: How Sampling Error Scales in n Error in parameter estimate μ − μ ∥ 2 ] ≤ O ( n − 1/2 ) E [ ∥ ̂ Number of unlabeled data points Theorem 2: Optimal Scaling Rate Y μ − μ ∥ 2 ] ≥ Ω ( n − 1/2 ) E [ ∥ ̂ ⇒ Bound is Tight λ 1 λ 2 λ 3 λ 4 λ 5 Best Possible Scaling Rate Conditionally-Independent with Unlabeled Data Labeling Functions Introduction Background Parameter Recovery Theoretical Analysis Evaluation
̂ ̂ End Model Generalization Error (Informal) Theorem 3: End Model Generalization Error f ̂ If you use parameters to generate labels and train an end model , Y μ w End model generalization error w , X , Y ) − L ( w *, X , Y )] = O ( n − 1/2 ) E [ L ( ̂ This is the same asymptotic rate as with supervised data! More theory nuggets (check out our paper for details): ▪ We can achieve these rates even with model misspeci fi cation (graph is incorrect) ▪ Bounds for distributional drift over time in the online setting Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Evaluation & Implications Introduction Background Parameter Recovery Theoretical Analysis Evaluation
We run faster, and get high quality Label model training times (s): Snorkel Temporal FlyingSquid Snorkel Benchmarks 3.0 -- 0.06 Video Tasks 41.5 292.3 0.20 End model accuracies (F1): Snorkel Temporal FlyingSquid Snorkel Benchmarks 74.6 -- 77.0 Video Tasks 47.4 75.2 76.2 Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Re-training in the End Model Training Loop No SGD -> re-train the label model in the training loop of an end model X 1 X 2 X 3 Loss Gradients PyTorch integration: FlyingSquid loss layer Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Speedups enable online learning Online learning: re-train on a rolling window Adapt to distributional drift over time Introduction Background Parameter Recovery Theoretical Analysis Evaluation
Thank you! ∴ ⇒ ∎ Contact: Dan Fu (danfu@cs.stanford.edu, @realDanFu) Dan Fu Mayee Chen Code: https://github.com/HazyResearch/ fl yingsquid Blog Post (Towards Interactive Weak Supervision with FlyingSquid): http://hazyresearch.stanford.edu/ fl yingsquid Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods: https://arxiv.org/abs/2002.11955 Fred Sala Sarah Hooper
Recommend
More recommend