LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and motivation Preliminaries The core scheme Learning with noisy feedback Mario Bravo Panayotis Mertikopoulos Universidad de Santiago de Chile CNRS – Laboratoire d’Informatique de Grenoble ADGO 2016 – Santiago, January 28, 2016 P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback Background and motivation Preliminaries The core scheme Learning with noisy feedback P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Example: a trader chooses asset proportions in an investment portfolio. Example: asset placements determine returns. Example: change asset proportions based on performance. Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. ▸ Learning: the agents adjust their decisions and the process continues. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. Example: a trader chooses asset proportions in an investment portfolio. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. Example: asset placements determine returns. ▸ Learning: the agents adjust their decisions and the process continues. Example: change asset proportions based on performance. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Learning in Games When does the agents’ learning process lead to a “reasonable” outcome? Background and motivation Preliminaries The core scheme Learning with noisy feedback The basic context: ▸ Decision-making: agents choose actions, each seeking to optimize some objective. Example: a trader chooses asset proportions in an investment portfolio. ▸ Payoffs : rewards are determined by the decisions of all interacting agents. Example: asset placements determine returns. ▸ Learning: the agents adjust their decisions and the process continues. Example: change asset proportions based on performance. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Example: in high-frequency trading (HFT), decision times . Example: the SEC requires small differences in HFT orders to reduce volatility. Example: volatility estimates highly inaccurate at the time-scale. Motivation Background and motivation Preliminaries The core scheme Learning with noisy feedback ▸ In many applications, decisions taken at very fast time-scales. ▸ Regulations/physical constraints limit changes in decisions. ▸ Fast time-scales have adverse effects on quality of feedback. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Motivation Background and motivation Preliminaries The core scheme Learning with noisy feedback ▸ In many applications, decisions taken at very fast time-scales. Example: in high-frequency trading (HFT), decision times ≈ µ s . ▸ Regulations/physical constraints limit changes in decisions. Example: the SEC requires small differences in HFT orders to reduce volatility. ▸ Fast time-scales have adverse effects on quality of feedback. Example: volatility estimates highly inaccurate at the µ s time-scale. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
The Flash Crash of 2010 Background and motivation Preliminaries The core scheme Learning with noisy feedback A trillion-dollar NYSE crash (and partial rebound) that lasted 35 minutes (14:42–15:07) Aggressive selling due to imperfect volatility estimates induced a huge drop in liquidity and precipitated the crash (Vuorenmaa and Wang, 2014) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Background and motivation Preliminaries The core scheme Learning with noisy feedback What this talk is about : Examine the robustness of a class of continuous-time learning schemes with noisy feedback. P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback Background and motivation Preliminaries The core scheme Learning with noisy feedback P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Mixed strategies yield expected payoffs Strategy profiles: Payoff vector of player : where is the payoff to the -th action of player in the mixed strategy profile . Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α , . . . , α n ) ↦ u k ( α , . . . , α N ) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Payoff vector of player : where is the payoff to the -th action of player in the mixed strategy profile . Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α , . . . , α n ) ↦ u k ( α , . . . , α N ) ▸ Mixed strategies x k ∈ X k ≡ ∆ ( A k ) yield expected payoffs u k ( x , . . . , x N ) = ∑ α . . . ∑ α N x , α ⋯ x N , α N u k ( α , . . . , α N ) ▸ Strategy profiles: x = ( x , . . . , x N ) ∈ X ≡ ∏ k X k P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Game setup Background and motivation Preliminaries The core scheme Learning with noisy feedback Throughout this talk, we focus on finite games: ▸ Finite set of players : N = { , . . . , N } ▸ Finite set of actions per player: A k = { α k , , α k , , . . . } ▸ Reward of player k determined by corresponding payoff function u k ∶ ∏ k A k → R : ( α , . . . , α n ) ↦ u k ( α , . . . , α N ) ▸ Mixed strategies x k ∈ X k ≡ ∆ ( A k ) yield expected payoffs u k ( x , . . . , x N ) = ∑ α . . . ∑ α N x , α ⋯ x N , α N u k ( α , . . . , α N ) ▸ Strategy profiles: x = ( x , . . . , x N ) ∈ X ≡ ∏ k X k ▸ Payoff vector of player k : v k ( x ) = ( v k α ( x )) α ∈ A k where v k α ( x ) = v k ( α ; x − k ) is the payoff to the α -th action of player k in the mixed strategy profile x ∈ X . P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ? u k ( α ; x − k ( s )) − u k ( x ( s )) P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ? u k ( α ; x − k ( s )) − u k ( x ( s )) ds t ∫ P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run. NB: unilateral definition, no need for a game Regret Background and motivation Preliminaries The core scheme Learning with noisy feedback Suppose players follow a trajectory of play x ( t ) (based on some learning/adjustment rule, to be discussed later). How does x k ( t ) compare on average to the “best possible” action α k ∈ A k ? u k ( α ; x − k ( s )) − u k ( x ( s )) ds t α ∈ A k ∫ max P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble Sunday, October 7, 2012
Recommend
More recommend