learning in games with noisy payoff observations
play

LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and - PowerPoint PPT Presentation

LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and motivation Preliminaries The core scheme Learning with noisy feedback Mario Bravo Panayotis Mertikopoulos Universidad de Santiago de Chile CNRS Laboratoire


  1. LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and motivation Preliminaries The core scheme Learning with noisy feedback Mario Bravo  Panayotis Mertikopoulos   Universidad de Santiago de Chile  CNRS – Laboratoire d’Informatique de Grenoble ADGO 2016 – Santiago, January 28, 2016 P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  2. Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback Background and motivation Preliminaries The core scheme Learning with noisy feedback P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  3. Example: a trader chooses asset proportions in an investment portfolio. Example: asset placements determine returns. Example: change asset proportions based on performance. Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. ▸ Learning: the agents adjust their decisions and the process continues. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  4. Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. Example: a trader chooses asset proportions in an investment portfolio. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. Example: asset placements determine returns. ▸ Learning: the agents adjust their decisions and the process continues. Example: change asset proportions based on performance. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  5. Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. Example: a trader chooses asset proportions in an investment portfolio. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. Example: asset placements determine returns. ▸ Learning: the agents adjust their decisions and the process continues. Example: change asset proportions based on performance. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  6. Example: in high-frequency trading (HFT), decision times . Example: the SEC requires small differences in HFT orders to reduce volatility. Example: volatility estimates highly inaccurate at the time-scale. Motivation Background and motivation Preliminaries The core scheme Learning with noisy feedback ▸ In many applications, decisions taken at very fast time-scales. ▸ Regulations/physical constraints limit changes in decisions. ▸ Fast time-scales have adverse effects on quality of feedback. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  7. Motivation Background and motivation Preliminaries The core scheme Learning with noisy feedback ▸ In many applications, decisions taken at very fast time-scales. Example: in high-frequency trading (HFT), decision times ≈  µ s . ▸ Regulations/physical constraints limit changes in decisions. Example: the SEC requires small differences in HFT orders to reduce volatility. ▸ Fast time-scales have adverse effects on quality of feedback. Example: volatility estimates highly inaccurate at the  µ s time-scale. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  8. The Flash Crash of 2010 Background and motivation Preliminaries The core scheme Learning with noisy feedback A trillion-dollar NYSE crash (and partial rebound) that lasted 35 minutes (14:42–15:07) Aggressive selling due to imperfect volatility estimates induced a huge drop in liquidity and precipitated the crash (Vuorenmaa and Wang, 2014) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  9. Background and motivation Preliminaries The core scheme Learning with noisy feedback What this talk is about : Examine the robustness of a class of continuous-time learning schemes with noisy feedback. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  10. Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback Background and motivation Preliminaries The core scheme Learning with noisy feedback P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  11. Mixed strategies yield expected payoffs Strategy profiles: Payoff vector of player : where is the payoff to the -th action of player in the mixed strategy profile . Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α  , . . . , α n ) ↦ u k ( α  , . . . , α N ) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  12. Payoff vector of player : where is the payoff to the -th action of player in the mixed strategy profile . Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α  , . . . , α n ) ↦ u k ( α  , . . . , α N ) ▸ Mixed strategies x k ∈ X k ≡ ∆ ( A k ) yield expected payoffs u k ( x  , . . . , x N ) = ∑ α  . . . ∑ α N x , α  ⋯ x N , α N u k ( α  , . . . , α N ) ▸ Strategy profiles: x = ( x  , . . . , x N ) ∈ X ≡ ∏ k X k P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  13. Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α  , . . . , α n ) ↦ u k ( α  , . . . , α N ) ▸ Mixed strategies x k ∈ X k ≡ ∆ ( A k ) yield expected payoffs u k ( x  , . . . , x N ) = ∑ α  . . . ∑ α N x , α  ⋯ x N , α N u k ( α  , . . . , α N ) ▸ Strategy profiles: x = ( x  , . . . , x N ) ∈ X ≡ ∏ k X k ▸ Payoff vector of player k : v k ( x ) = ( v k α ( x )) α ∈ A k where v k α ( x ) = v k ( α ; x − k ) is the payoff to the α -th action of player k in the mixed strategy profile x ∈ X . P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  14. Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ? u k ( α ; x − k ( s )) − u k ( x ( s )) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  15. Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ?  u k ( α ; x − k ( s )) − u k ( x ( s )) ds t ∫ P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

  16. Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ?  u k ( α ; x − k ( s )) − u k ( x ( s )) ds t α ∈ A k ∫ max P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012

Recommend


More recommend