online algorithms learning optimization with no regret
play

Online Algorithms: Learning & Optimization with No Regret. - PowerPoint PPT Presentation

Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1 CS/CNS/EE 253 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning:


  1. Online Algorithms: Learning & Optimization with No Regret. CS/CNS/EE 253 Daniel Golovin 1 CS/CNS/EE 253

  2. The Setup Optimization: ● Model the problem (objective, constraints) ● Pick best decision from a feasible set. Learning: ● Model the problem (objective, hypothesis class) ● Pick best hypothesis from a feasible set. 2 CS/CNS/EE 253

  3. Online Learning/Optimization Choose an action Get f t ( x t ) and feedback x t 2 X f t : X ! [0 ; 1] ● Same feasible set X in each round t ● Different Reward Models: ● Stochastic, Arbitrary but Oblivious, Adaptive and Arbitrary 3 CS/CNS/EE 253

  4. Concrete Example: Commuting Pick a path x t from home to school. Pay cost f t ( x t ) := P e 2 x t c t ( e ) Then see all edge costs for that round. Dealing with Limited Feedback: later in the course. 4 CS/CNS/EE 253

  5. Other Applications ● Sequential decision problems ● Streaming algorithms for optimization/learning with large data sets ● Combining weak learners into strong ones (“boosting”) ● Fast approximate solvers for certain classes of convex programs ● Playing repeated games 5 CS/CNS/EE 253

  6. Binary prediction with a perfect expert ● n hypotheses (“experts”) h 1 ; h 2 ; : : : ; h n ● Guaranteed that some hypothesis is perfect. ● Each round, get a data point p t and classifications h i ( p t ) 2 f 0 ; 1 g ● Output binary prediction x t , observe correct label ● Minimize # mistakes Any Suggestions? 6 CS/CNS/EE 253

  7. A Weighted Majority Algorithm ● Each expert “votes” for it's classification. ● Only votes from experts who have never been wrong are counted. ● Go with the majority # mistakes M · log 2 ( n ) Weights w it = I ( h i correct on ¯rst t rounds ). W t = P i w it . W 0 = n , W T ¸ 1 Mistake on round t implies W t +1 · W t = 2 So 1 · W T · W 0 = 2 M = n= 2 M 7 CS/CNS/EE 253

  8. Weighted Majority [Littlestone & Warmuth '89] What if there's no perfect expert? ● Each expert i has a weight w(i) , “votes” for it's classification in {-1, 1}. Go with the weighted majority, predict sign( ∑ i w i x i ). Halve weights of wrong experts. Let m = # mistakes of best expert. How many mistakes M do we make? Weights w it = (1 = 2) ( # mistakes by i on ¯rst t rounds) . Let W t := P i w it . Note W 0 = n , W T ¸ (1 = 2) m Mistake on round t implies W t +1 · 3 4 W t So (1 = 2) m · W T · W 0 (3 = 4) M = n ¢ (3 = 4) M Thus (4 = 3) M · n ¢ 2 m and M · 2 : 41( m + log 2 ( n )). 8 CS/CNS/EE 253

  9. Can we do better? M · 2 : 41( m + log 2 ( n )) Experts\Time 1 2 3 4 e 1 ´ ¡ 1 0 1 0 1 e 2 ´ 1 1 0 1 0 ● No deterministic algorithm can get M < 2m. ● What if there are more than 2 choices? 9 CS/CNS/EE 253

  10. Regret “Maybe all one can do is hope to end up with the right regrets.” – Arthur Miller ● Notation: Define loss or cost functions c t and define the regret of x 1 , x 2 , ... , x T as X T X T c t ( x ¤ ) c t ( x t ) ¡ R T = t =1 t =1 P T where x ¤ = argmin x 2 X t =1 c t ( x ) A sequence has \no-regret" if R T = o ( T ). ● Questions: ● How can we improve Weighted Majority? ● What is the lowest regret we can hope for? 10 CS/CNS/EE 253

  11. The Hedge/WMR Algorithm* [Freund & Schapire '97] X Hedge( ² ) p t ( i ) := w it = w jt Initialize w i 0 = 1 for all i . j In each round t : Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each i , set w i;t +1 = w it (1 ¡ ² ) c t ( x ( e i ;t )) ● How does this compare to WM? * Pedantic note: Hedge is often called “Randomized Weighted Majority”, and abbreviated “WMR”, though WMR was published in the context of binary classification, unlike Hedge. 11 CS/CNS/EE 253

  12. The Hedge/WMR Algorithm X Hedge( ² ) p t ( i ) := w it = w jt Initialize w i 0 = 1 for all i . j In each round t : Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each i , set w i;t +1 = w it (1 ¡ ² ) c t ( x ( e i ;t )) Randomization Influence shrinks exponentially with cumulative loss. Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. 12 CS/CNS/EE 253

  13. Hedge Performance Theorem: Let x 1 ; x 2 ; : : : be the choices of Hedge( ² ). Then " T # µ ¶ X 1 OPT T + ln( n ) E · c t ( x t ) 1 ¡ ² ² t =1 P T where OPT T := min i t =1 c t ( x ( e i ; t )). ³p ´ p ln( n ) = OPT OPT ln( n )) If ² = £ , the regret is £( 13 CS/CNS/EE 253

  14. Hedge Analysis Intuitively: Either we do well on a round, or total weight drops, and total weight can't drop too much unless every expert is lousy. Let W t := P i w it . Then W 0 = n and W T +1 ¸ (1 ¡ ² ) OPT . X w it (1 ¡ ² ) c t ( x it ) W t +1 = (1) i X W t p t ( i )(1 ¡ ² ) c t ( x it ) [def of p t ( i )] = (2) i X [Bernoulli's ineq] · W t p t ( i ) (1 ¡ ² ¢ c t ( x it )) (3) If x > ¡ 1 ; r 2 (0 ; 1) i then (1 + x ) r · 1 + rx = W t (1 ¡ ² ¢ E [ c t ( x t )]) (4) [1 ¡ x · e ¡ x ] · W t ¢ exp ( ¡ ² ¢ E [ c t ( x t )]) (5) 14 CS/CNS/EE 253

  15. Hedge Analysis à ! T X E [ c t ( x t )] W T +1 =W 0 · exp ¡ ² t =1 à ! X T E [ c t ( x t )] W 0 =W T +1 ¸ exp ² t =1 Recall W 0 = n and W T +1 ¸ (1 ¡ ² ) OPT . " T # µ W 0 ¶ X ¡ OPT ¢ ln(1 ¡ ² ) · 1 · ln( n ) E c t ( x t ) ² ln W T +1 ² ² t =1 + OPT · ln( n ) 1 ¡ ² ² 15 CS/CNS/EE 253

  16. Lower Bound ³p ´ p ln( n ) = OPT OPT ln( n )) If ² = £ , the regret is £( Can we do better? Let c t ( x ) » Bernoulli(1/2) for all x and t . Let Z i := P T t =1 c t ( x ( e i ; t )). Then Z i » Bin( T; 1 = 2) is roughly normally distributed, p with ¾ = 1 T . 2 P [ Z i · ¹ ¡ k¾ ] = exp ¡ ¡ £( k 2 ) ¢ We get about ¹ = T= 2, best choice is likely p p OPT ln( n )). to get ¹ ¡ £( T ln( n )) = ¹ ¡ £( 16 CS/CNS/EE 253

  17. What have we shown? ● Simple algorithm that learns to do nearly as ● Simple algorithm that learns to do nearly as well as best fixed choice. well as best fixed choice. ● Hedge can exploit any pattern that the best choice ● Hedge can exploit any pattern that the best choice does. does. ● Works for Adaptive Adversaries. ● Works for Adaptive Adversaries. ● Suitable for playing repeated games. Related ideas ● Suitable for playing repeated games. Related ideas appearing in Algorithmic Game Theory literature. appearing in Algorithmic Game Theory literature. 17 CS/CNS/EE 253

  18. Related Questions ● Optimize and get no-regret against richer classes of strategies/experts: – All distributions over experts – All sequences of experts that have K transitions [Auer et al '02] – Various classes of functions of input features [Blum & Mansour '05] ● E.g., consider time of day when choosing driving route. – Arbitrary convex set of experts, metric space of experts, etc, with linear, convex, or Lipschitz costs. [Zinkevich '03, Kleinberg et al '08] – All policies of a K-state initially unknown Markov Decision Process that models the world. [Auer et al '08] R n – Arbitrary sets of strategies in with linear costs that we can optimize offline. [Hannan'57, Kalai & Vempala '02] 18 CS/CNS/EE 253

  19. Related Questions ● Other notions of regret (see e.g., [Blum & Mansour '05]) ● Time selection functions: – get low regret on mondays, rainy days, etc. ● Sleeping experts: – if rule “if(P) then predict Q” is right 90% of the time it applies, be right 89% of the time P applies. ● Internal regret & swap regret: – If you played x 1 , ..., x T then have no regret against g ( x 1 ), ..., g ( x T ) for every g:X→X 19 CS/CNS/EE 253

  20. Sleeping Experts [Freund et al '97, Blum '97, Blum & Mansour '05] ● if rule “if(P) then predict Q” is right 90% of the time it applies, be right 89% of the time P applies. Get this for every rule simultaneously. ● Idea: Generate lots of hypotheses that “specialize” on certain inputs, some good, some lousy, and combine them into a great classifier. ● Many applications: ● Document classification, Spam filtering, Adaptive Uis, ... – if (“physics” in D) then classify D as “science”. ● Predicates can overlap. 20 CS/CNS/EE 253

  21. Sleeping Experts ● Predicates can overlap ● E.g., predict college major given the classes C you're enrolled in? – if(ML-101, CS-201 in C) then CS – if(ML-101, Stats-201 in C) then Stats ● What do we predict for students enrolled in ML-101, CS-201, and Stats-201? 21 CS/CNS/EE 253

  22. Sleeping Experts [Algorithm from Blum & Mansour '05] SleepingExperts( ¯ , E , F ) Input: ¯ 2 (0 ; 1), experts E , time selection functions F Initialize w 0 e;f = 1 for all e 2 E ; f 2 F . In each round t : e = P Let w t f f ( t ) w t e;f . Let W t = P e w t e . Let p t e = w t e =W t . Choose expert e t from categorical distribution p t Select x t = x ( e t ; t ), the advice/prediction of e t . For each e 2 E ; f 2 F e;f ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w t +1 e;f = w t 22 CS/CNS/EE 253

  23. Sleeping Experts [Algorithm from Blum & Mansour '05] e;f ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w t +1 e;f = w t X Ensures total sum of weights w t e;f · nm for all t can never increase. e;f Y ¯ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )]) w T e;f = t ¸ 0 P t ¸ 0 [ f ( t )( c t ( e ) ¡ ¯ E [ c t ( e t )])] = ¯ · nm 23 CS/CNS/EE 253

Recommend


More recommend