flyingsquid speeding up weak supervision with triplet
play

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan - PowerPoint PPT Presentation

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu*, Mayee Chen*, Fred Sala, Sarah Hooper, Kayvon Fatahalian, Chris R Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher R. Fast


  1. FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu*, Mayee Chen*, Fred Sala, Sarah Hooper, Kayvon Fatahalian, Chris Ré Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods. ICML 2020. * Denotes Equal Contribution

  2. The Training Data Bottleneck in ML Collecting training data can be slow and expensive Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  3. Weak Supervision - A Response def L_1(comment) : return SPAM if “http” in comment def L_2(comment) : return NOT SPAM if “love” in comment User-De fi ned Functions Crowd Workers External Knowledge Bases How to best use multiple noisy sources of supervision? Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  4. Data Programming: Unifying Weak Supervision [1] X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model [1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018 . Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  5. Data Programming: Unifying Weak Supervision [1] X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model [1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018 . Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  6. Weak Supervision: A Response Modeling labeling functions critical, but can be slow… X 1 λ 1 μ ( λ 1 ,Y 1 ) def S_1 : Y 1 λ 2 label +1 if Bernie on screen Y 1 0.95 λ 3 μ ( λ 2 ,Y 1 ) μ ( λ 3 ,Y 1 ) λ 4 def S_2 : X 2 Y 2 Y 2 0.87 λ 5 label +1 if background blue μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) def S_3 : Y 3 0.09 λ 7 return CROWD_WORKER_VOTE Y 3 X 3 λ 8 λ 9 Unlabeled Input Labeling Functions Latent Variable Label Model Probabilistic End Model Model Training Labels 1 2 3 Model labeling Use the probabilistic Users write labeling function behavior to labels to train a functions de-noise them downstream model Expensive SGD iterations Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  7. FlyingSquid: Reduce the Turnaround Time ▪ Background: labeling functions and graphical models ▪ Closed-form solution to model parameters, no SGD ▪ Theoretical bounds and guarantees ∴ ⇒ ∎ ▪ Run orders of magnitude faster, without losing accuracy; weakly-supervised online learning Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  8. Background Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  9. Problem Setup S 1 : 풳 → λ 1 ∈ {±1,0} { ̂ { X i } n Y i } n f w : 풳 → 풴 i =1 i =1 S m : 풳 → λ m ∈ {±1,0} Unlabeled Data m Labeling Functions Probabilistic Downstream Labels End Model We want to learn the joint distribution P( λ , Y), without observing Y! Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  10. Model Labeling Functions with Latent Graphical Models Hidden Variable (True Label) λ 1 λ 2 λ 4 λ 5 λ 7 λ 8 Y Y 1 Y 2 Y 3 λ 1 λ 2 λ 3 λ 4 λ 5 λ 3 λ 6 λ 9 Observed Variables (Labeling Function Outputs) Temporal dependencies Technical problem: learning parameters of these graphical models Main challenge: recover accuracies μ of labeling functions [2] Varma et al. Learning Dependency Structures for Weak Supervision Models. ICML 2019. Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  11. Parameter Recovery Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  12. Existing Iterative Approaches Can Be Slow SGD over loss function Sala et al. 2019 Ratner et al. 2016 Safranchik et al. 2020 Zhan et al. 2019 Ratner et al. 2018 Ratner et al. 2019 Bach et al. 2019 (Gibbs) Disadvantages: SGD can take a long time, many hyperparameters (learning rate, momentum, etc) to tune Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  13. Solve Triplets of Labeling Function Parameters at a Time λ 1 μ ( λ 1 ,Y 1 ) Y 1 λ 2 E[ λ 1 Y 1 ]E[ λ 2 Y 1 ] = E[ λ 1 λ 2 ] λ 3 μ ( λ 2 ,Y 1 ) E[ λ 2 Y 1 ]E[ λ 3 Y 1 ] = E[ λ 2 λ 3 ] μ ( λ 3 ,Y 1 ) λ 4 E[ λ 3 Y 1 ]E[ λ 1 Y 1 ] = E[ λ 3 λ 1 ] Y 2 λ 5 μ ( λ 4 , λ 5 ,Y 2 ) λ 6 μ ( λ 6 ,Y 2 ) λ 7 Y 3 λ 8 λ 9 Latent Variable Solve Triplets Label Model Model of Labeling Functions Method of moments: break problem up into pieces, get closed-form solutions Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  14. Triplets of Conditionally-Independent Labeling Functions Moment = E [ λ i λ j ] μ i μ j = N i , j Unobservable Observable accuracy parameters agreements Form triplets of these equations: Get closed-form solutions: μ 1 μ 4 = N 1,4 | μ 1 | = N 1,4 N 1,5 / N 4,5 μ 1 μ 5 = N 1,5 ⇒ | μ 4 | = N 1,4 N 4,5 / N 1,5 μ 4 μ 5 = N 4,5 | μ 5 | = N 4,5 N 1,5 / N 1,4 All we need to do is count how often the labeling functions agree - no SGD! Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  15. Theoretical Analysis Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  16. Bounding Sampling Error (Informal) Theorem 1: How Sampling Error Scales in n Error in parameter estimate μ − μ ∥ 2 ] ≤ O ( n − 1/2 ) E [ ∥ ̂ Number of unlabeled data points Theorem 2: Optimal Scaling Rate Y μ − μ ∥ 2 ] ≥ Ω ( n − 1/2 ) E [ ∥ ̂ ⇒ Bound is Tight λ 1 λ 2 λ 3 λ 4 λ 5 Best Possible Scaling Rate Conditionally-Independent with Unlabeled Data Labeling Functions Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  17. ̂ ̂ End Model Generalization Error (Informal) Theorem 3: End Model Generalization Error f ̂ If you use parameters to generate labels and train an end model , Y μ w End model generalization error w , X , Y ) − L ( w *, X , Y )] = O ( n − 1/2 ) E [ L ( ̂ This is the same asymptotic rate as with supervised data! More theory nuggets (check out our paper for details): ▪ We can achieve these rates even with model misspeci fi cation (graph is incorrect) ▪ Bounds for distributional drift over time in the online setting Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  18. Evaluation & Implications Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  19. We run faster, and get high quality Label model training times (s): Snorkel Temporal FlyingSquid Snorkel Benchmarks 3.0 -- 0.06 Video Tasks 41.5 292.3 0.20 End model accuracies (F1): Snorkel Temporal FlyingSquid Snorkel Benchmarks 74.6 -- 77.0 Video Tasks 47.4 75.2 76.2 Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  20. Re-training in the End Model Training Loop No SGD -> re-train the label model in the training loop of an end model X 1 X 2 X 3 Loss Gradients PyTorch integration: FlyingSquid loss layer Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  21. Speedups enable online learning Online learning: re-train on a rolling window Adapt to distributional drift over time Introduction Background Parameter Recovery Theoretical Analysis Evaluation

  22. Thank you! ∴ ⇒ ∎ Contact: Dan Fu (danfu@cs.stanford.edu, @realDanFu) Dan Fu Mayee Chen Code: https://github.com/HazyResearch/ fl yingsquid Blog Post (Towards Interactive Weak Supervision with FlyingSquid): http://hazyresearch.stanford.edu/ fl yingsquid Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods: https://arxiv.org/abs/2002.11955 Fred Sala Sarah Hooper

Recommend


More recommend