follow the leader with dropout perturbations
play

Follow the Leader with Dropout Perturbations Tim van Erven COLT, - PowerPoint PPT Presentation

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech Kotowski Manfred Warmuth Neural Network Neural Network Dropout Training Stochastic gradient descent Randomly remove every hidden/input


  1. Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech Kotłowski Manfred Warmuth

  2. Neural Network

  3. Neural Network

  4. Dropout Training ● Stochastic gradient descent ● Randomly remove every hidden/input unit with probability 1/2 before each gradient descent update [Hinton et al., 2012]

  5. Dropout Training ● Very successful in e.g. image classification, speech recognition ● Many people trying to analyse why it works [Wager, Wang, Liang, 2013]

  6. Prediction with Expert Advice ● Every round : 1. (Randomly) choose expert 2. Observe expert losses 3. Our loss is Goal: minimize expected regret Loss of the best expert where

  7. Follow-the-Leader ● Deterministically choose the expert that has predicted best in the past: is the leader. ● Can be fooled: regret grows linearly in T for adversarial data

  8. Dropout Perturbations is the perturbed leader

  9. Dropout Perturbations for Binary Losses ● For losses in it works: for any dropout probability ● No tuning required!

  10. Dropout Perturbations for Binary Losses ● For losses in it works: for any dropout probability ● No tuning required! ● But it does not work for continuous losses in [0,1]: there exist losses such that

  11. Binarized Dropout Perturbations: Continuous Losses ● The right generalization: for losses in [0,1]

  12. Small Regret for IID Data If loss vectors are – independent, identically distributed between trials, – with a gap between expected loss of best expert and the rest, then regret is constant : w.h.p. ● Algorithms that rely on doubling trick for or do not get this.

  13. Instance of Follow-the-Perturbed Leader ● Follow-the-Perturbed-Leader [Kalai,Vempala,2005] : We have data-dependent perturbations that differ between experts . ● Standard analysis: bound probability of leader change in the be-the-leader lemma. ● Elegant simple bound for perturbations of Kalai&Vempala, but not for us.

  14. Related Work: RWP ● Random walk perturbation [Devroye et al. 2013] : for a centered Bernoulli variable ● Equivalent to dropout if ● But perturbations do not adapt to data, so no -bound

  15. Proof Outline ● Find worst-case loss sequence

  16. Proof Outline ● Find worst-case loss sequence : e.g. for 3 experts with cumulative losses 1 , 3 and 5

  17. Proof Outline ● Find worst-case loss sequence : e.g. for 3 experts with cumulative losses 1 , 3 and 5 1. Cumulative losses approximately equal: apply lemma from RWP roughly once per K rounds 2. Expert 1 much smaller cum. loss: Hoeffding

  18. Summary ● Simple algorithm: Follow-the-leader on losses that are perturbed by binarized dropout ● No tuning necessary ● On any losses: ● On i.i.d. loss vectors with gap between best expert and rest: w.h.p.

  19. Many Open Questions To discuss at the poster ! ● Can we use dropout for: – Tracking the best expert? – Combinatorial settings (e.g. online shortest path)? ● Need to reuse randomness between experts ● What about variations on the dropout perturbations? – Drop the whole loss vector at once?

  20. References ● Hinton, Srivastava, Krizhevsky, Sutskever, Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. ● Wager, Wang, Liang. Dropout training as adaptive regularization. NIPS, 2013. ● Kalai, Vempala. Effcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. ● Devroye, Lugosi, Neu. Prediction by random-walk perturbation. COLT, 2013. ● Van Erven, Kotłowski, Warmuth. Follow the leader with dropout perturbations. COLT, 2014.

Recommend


More recommend