cs286r presentation james burns march 7 2006 calibrated
play

CS286r Presentation James Burns March 7, 2006 Calibrated Learning - PDF document

CS286r Presentation James Burns March 7, 2006 Calibrated Learning and Correlated Equi- librium by Dean Foster and Rakesh Vohra Regret in On-Line Decision Problem by Dean Foster and Rakesh Vohra Outline Correlated Equlibria


  1. CS286r Presentation James Burns March 7, 2006 • Calibrated Learning and Correlated Equi- librium – by Dean Foster and Rakesh Vohra • Regret in On-Line Decision Problem – by Dean Foster and Rakesh Vohra

  2. Outline • Correlated Equlibria • Forecasts and Calibration • Calibration and Correlated Equilibria • Loss and Regret • Existence of a no regret forecasting scheme • Further results and discussion

  3. Correlated Equilibria • Motivation – Difficult to find learning rules that guar- antee convergence to NE – CE are easy to compute – Consistent with Bayesian perspective (Au- mann 87) – CEs can pareto dominate NEs- relevant? • Drawback – Problem of multiplicity of equilibrium is worse!

  4. Forecasts • f ( t ) = { p 1 ( t ) , · · · , p n ( t ) } • p j ( t ) is forecasted probability that event j occurs at time t • N ( p, t ) be the number of times that f gen- erates the forecast p up to time t

  5. Calibration • Let χ ( j, t ) = 1 if event j occurs at time t • We now define ρ ( p, j, t ) as the empirical fre- quency of action j given the forecast p  0 N ( p, t ) = 0 if  ρ ( p, j, t ) = I f ( s )= p χ ( j,s ) � t otherwise s =1  N ( p,t ) • For the forecasting scheme to be calibrated we require: | ρ ( p, j, t ) − p j | N ( p, t ) � lim = 0 t t →∞ p

  6. Example: Forecasting the Weather • Pick a forecasting scheme to predict whether it will rain or not • f ( t ) = p ( t ) is forecasted probability that it will rain at time t • N ( p, t ) be the number of times that f ( t ) = p up to the time t • ρ ( p, t ) is the frequency with which it rained given that it was forecasted to rain with probability p For the forecasting scheme to be calibrated we require: p | ρ ( p, t ) − p | N ( p,t ) • lim t →∞ = 0 � t

  7. How does fictitious play fit in? • Fictitious play is a particular forecast scheme that requires the forecast to be equal to an agent’s prior updated by the unconditioned empirical frequency of events • This means that if the forecast converges, we have t p j ( t ) → 1 � χ ( j, s ) t s =1 where χ ( j, s ) = 1 if event j occurs at time t • In fictitious play forecasts converge to em- pirical frequencies, whereas calibration re- quires that forecasts converge to empirical frequencies conditioned on the forecasts .

  8. Calibrated Forecasts and Correlated Equilib- rium • Consider a two player game G . We can characterize a CE in the set of all CE of the game G, π ( G ), by the induced joint distribution over the agents strategy sets S (1) × S (2). • We denote this joint distribution by D ( x, y ). Further, let D t ( x, y ) be the empirical fre- quency that ( x, y ) is played up to time t .

  9. • Theorem 1 (VF, 97) If each player uses a forecast that is calibrated against the oth- ers sequence of plays, and then makes a best response to this forecast, then, min x ∈ S (1) ,y ∈ S (2) | D t ( x, y ) − D ( x, y ) | → 0 max D ∈ π ( G ) • Important assumption: players use a de- terministic tie breaking rule in making best responses. • What does this actually claim?

  10. Outline of Proof • D t ( x, y ) lies in the ( nm − 1)-dimensional unit simplex-hence closed and bounded • BW theorem implies that D t ( x, y ) has a convergent subsequence D t i ( x, y ) • Let D ∗ be the limit of D t i ( x, y ), we show D ∗ is a Correlated Equilibrium • Basic Argument: Show that the vector whose yth component is D ∗ ( x, y ) / � c ∈ S (2) D ∗ ( x, c ) is in the set of mixtures over S (2) for which x is a best response. This will hold because the forecasting rule is calibrated.

  11. • Missing! If theorem does not hold there must be a sequence D t j ( x, y ) such that | D t j ( x, y ) − D ( x, y ) | > ǫ for some ǫ and all t . However, this subse- quence must have itself a convergent sub- sequence that, from above, must converge to a CE, contradicting our assumption.

  12. Calibration and CE continued • Theorem 2 (VF, 97) For almost every game the set of distributions which calibrated learn- ing rules can converge to is identical to the set of correlated equilibrium. – Proof is constructive – Is this theorem useful? what can it re- ally tell us?

  13. • Theorem 3 (VF, 97) There exists a ran- domized forecast that player 1 can use such that no matter what learning rule player 2 uses, player 1 will be calibrated. – Proof does give algorithm for construct- ing a randomized forecast rule that is calibrated, but not intuitive. – Based on a regret-measure. – Each step in procedure requires com- puting an invariant vector of increasing size

  14. • We consider an ODP in which an agent incurs a loss in every period as a function of the decision made and the state of the world in that period. The objective of the agent is to minimize the total loss incurred. e.g. guessing a sequence of 0s and 1s.

  15. Loss • Notation – Let D = { d 1 , d 2 , · · · , d n } set of possible decisions at time t – L j t ≤ 1 loss incurred at time t from tak- ing action j – We represent a decision making scheme S by the probability vectors w t where w j t the probability that decision j is chosen at time t . • Define L(S), the expected loss from using scheme S over T periods T w j t L j � � t t =1 d j ∈ D

  16. Regret • We now compare the loss under the scheme S with the loss that would have been in- curred had a different scheme been used. • In particular, we consider the change in loss from replacing an action d j with another action d i . • Given a scheme S that uses decision d j in period t with probability w t j , define the pair- wise regret of switching from decision d j to d i as T R j → i w j ( t )( L j t − L i � ( S ) = t ) T t =1

  17. • Define the regret incurred by S from using decision d j up to time T as � + R j R j → i � � T ( S ) = ( S ) T i ∈ D � + = max { 0 , R j → i R j → i � where ( S ) ( S ) } T T • Define the regret from using S as R j � R T ( S ) = T ( S ) j ∈ D • We say that the scheme S has the no in- ternal regret property if its expected regret is small R T ( S ) = o ( T )

  18. Existence of a No-Regret Scheme • Proof for case where | D | = 2 • We have defined � + ( R i → j � � � R T ( S ) = ( S ) T i ∈ D j ∈ D � + = � + = 0 � � R 0 → 0 R 1 → 1 • But ( S ) ( S ) T T • Goal: to show that the time average of � + and � + go to zero. � � R 1 → 0 R 0 → 1 ( S ) ( S ) T T

  19. • Define the following game – Agent chooses between strategy ”0” and strategy ”1” in each period – Payoffs are vectors with the payoff for using strategy ”0” in period t is ( L 0 t − L 1 t , 0) and (0 , L 1 t − L 0 t ) for using strategy ”1” • Suppose that the agent follows a scheme that chooses strategy ”0” with probability w t , then the time averaged payoffs at round T are �� T t =1 w t ( L 0 t − L 1 � T t =1 (1 − w t )( L 1 t − L 0 � t ) t ) , T T . • Note that we have defined the payoffs such that the time averaged payoffs are equal to ( R 1 → 0 ( S ) /T, R 0 → 1 ( S ) /T ) as defined above. T T

  20. • Blackwells Approachability Theorem: A con- vex set is approachable iff every tangent hyperplane of G is approachable. • Our target set is nonpositive orthant- that is we want R 1 → 0 ( S ) /T ≤ 0 T and R 0 → 1 ( S ) /T ≤ 0 T • If the payoff vector is not in the nonpositive orthant then we consider the line separat- ing the payoff vector from the target set. The line l is given by � + x + � + y = 0 � R 0 → 1 � R 1 → 0 ( S ) ( S ) T T

  21. • The agent must choose ”0” with probabil- ity p such the expected payoff vector L 0 T +1 − L 1 L 1 T +1 − L 0 � � � � �� p , (1 − p ) T +1 T +1 lies on the line l . • This requires: � + p � + (1 − p ) R 0 → 1 L 0 T +1 − L 1 R 1 → 0 L 1 T +1 − L 0 � � � � � � ( S ) + ( S ) = 0 T T +1 T T +1 � + R 0 → 1 � ( S ) T • Which yields: p = ( S ) ] + [ R 1 → 0 ( S ) + ] − [ R 0 → 1 T T • Not what is in paper!

  22. • We have solved for p that will in expecta- tion be on the line separating the payoff vector from the target set. From Black- well’s Theorem the target set is approach- able. We have found a no-regret scheme. • This result can be generalized to D > 2 but will require solving a system of equations.

  23. Further Results: • The existence of a no-regret scheme im- plies the existence of an almost calibrated forecast scheme • If all agents in a game play a no-regret strategy, play will converge to correlated equilibrium.

  24. Further Reading • A Simple Adaptive Procedure Leading to Correlated Equilibrium - Hart and Mas-Colell 2000 • A General Class of Adaptive Strategies- Hart and MasColell 2001 • A General Class of No-Regret Learning Al- gorithms and Game-Theoretic Equilibria- Greenwald, Jafari and Marks

Recommend


More recommend