searching for arms
play

Searching for Arms Daniel Fershtman Alessandro Pavan October 1, - PowerPoint PPT Presentation

Searching for Arms Daniel Fershtman Alessandro Pavan October 1, 2019 Motivation Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives


  1. Searching for Arms Daniel Fershtman Alessandro Pavan October 1, 2019

  2. Motivation Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives vs searching for new ones

  3. Motivation Example Consumer sequentially explores different alternatives within“consideration set” , while expanding consideration set through search Firm interviews candidates, while searching for additional suitable candidates to interview Researcher splits time on several ongoing projects of unknown return, while also searching for new projects Difference experimentation: directed search: undirected

  4. This Paper Multi-armed bandit problem with endogenous set of arms Optimal policy: index policy (with special index for Search) Extension to problems with irreversible choice (based on partial information) Weitzman: special case where set of boxes exogenous and uncertainty resolved after first inspection

  5. Search Index Definition �� τ − 1 s ) | ω S � s =0 δ s ( r π s − c π E G S ( ω S ) = sup �� τ − 1 � s =0 δ s | ω s E τ,π Recursive representation E χ ∗ �� τ ∗ − 1 δ s ( r s − c s ) | ω S � s =0 G S ( ω S ) = E χ ∗ �� τ ∗ − 1 � δ s | ω s s =0 χ ∗ : policy selecting physical arm with highest Gittins index ( among those brought by new search ) if such index higher than search index and search otherwise τ ∗ : first time search index + indexes of all physical arms brought by new search fall below value of search index at time search launched

  6. Difficulties Opportunity cost of search depends on entire composition of current choice set e.g., profitability of searching for additional candidates depends on observable covariates of current candidates (gender, education, etc.) and past interviews Non-stationarity in search technology search outcome may depend on type and number of arms previously found past search costs Search competes with its own“descendants”(i.e., with arms discovered through past searches) correlation Treating search as“meta arm”requires decisions within meta arm invariant to info outside meta arm bandit problems with meta arms (e.g., arms that can be activated with different intensities –“super-processes” ) rarely admit index solution

  7. Literature Bandits Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky (1995), Keller and Rady (1999)... Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017) Bandits with time-varying set of alternatives Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)... Sequential search for best alternative (Pandora’s problem) Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016), Doval (2018)... Experimentation before irreversible choice Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)... ⇒ KEY DIFFERENCE: Endogeneity of set of arms

  8. Plan 1 Model Optimal policy 2 Dynamics 3 Proof of main theorem 4 Applications 5 Extensions 6 irreversible choice search frictions multiple search arms no discounting

  9. Model

  10. Model: Environment Discrete time: t = 0 , ..., ∞ Available“physical”arms in period t : I t = { 1 , ..., n t } ( I 0 exogenous) At each t , DM pull arm among I t search for new arms opt-out: arm i = 0 (fixed reward equal to outside option) Pulling arm i ∈ I t reward r i ∈ R transition to new“state” Search costly stochastic set of new arms I t +1 \ I t

  11. Model: “Physical”Arms “State”of physical arm: ω P = ( ξ, θ ) ∈ Ω P ξ ∈ Ξ : persistent“type” θ ∈ Θ: evolving state Example: ξ : type of research project/idea (theory, empirical, experimental) θ = ( σ m ) : history of signals about project’s impact r : utility from working on project H ω P : distribution over Ω P , given ω P Reward: r ( ω P ) Usual assumptions: Arm’ state“frozen”when not pulled time-autonomous processes evolution of arms’ states independent across arms, conditional on arms’ types

  12. Model: Search Technology State of search technology: ω S = (( c 0 , E 0 ) , ( c 1 , E 1 ) , ..., ( c m , E m )) ∈ Ω S m : number of past searches c k : cost of k ’th search E k = ( n k ( ξ ) : ξ ∈ Ξ): result of k -th search n k ( ξ ) ∈ N : number of arms of type ξ found H ω S : joint distribution over ( c , E ), given ω S Key assumptions independence of calendar time independence of arms’ idiosyncratic shocks, θ Correlation though ξ

  13. Model: Search Technology Stochasticity in search technology: learning about alternatives not yet in consideration set evolution of DM’s ability to find new alternatives e.g., limited set of outside alternatives fatigue/experience

  14. Model: states and policies Period- t state: S t ≡ ( ω S t , S P t ) ω S t : state of search technology t ≡ ( S t ( ω P ) : ω P ∈ Ω P ) state of physical arms S P t ( ω P ): number of physical arms in state ω P ∈ Ω P S P Definition eliminates dependence on calendar time, while keeping track of relevant information Policy χ describes feasible decisions at all histories Policy χ optimal if it maximizes expected discounted sum of net payoffs     ∞ ∞ � � E χ δ t  |S 0 x jt r jt − c t y t    t =0 j =1

  15. Plan 1 Model Optimal policy 2 Dynamics 3 Proof of main theorem 4 Applications 5 Extensions 6 Irreversible choice Search frictions multiple search arms no discounting

  16. Optimal Policy

  17. Indexes for Physical Arms Index for“physical”arms: �� τ − 1 s =0 δ s r s | ω P � E G P ( ω P ) ≡ sup �� τ − 1 � s =0 δ s | ω P E τ> 0 τ : stopping time Interpretations: maximal expected discounted reward, per unit of expected discounted time (Gittins) annuity that makes DM indifferent between stopping right away and continuing with option to retire in the future (Whittle) fair charge (Weber)

  18. Index for Search Index for search: �� τ − 1 s ) | ω S � s =0 δ s ( r π s − c π E G S ( ω S ) ≡ sup �� τ − 1 � s =0 δ s | ω s E π,τ τ : stopping time π : choice among arms discovered AFTER t and FUTURE searches r π s , c π s : stochastic rewards/costs, under rule π Interpretation: fair (flow) price for visiting“casinos”found stochastically over time, playing in them, and continue searching for other casinos Definition: accommodates for correlation among arms found over time compatible with possibility that search lasts indefinitely and brings unbounded set of alternatives

  19. Index policy Definition Index policy selects at each t “search”iff G ∗ ( S P G S ( ω S t ) ≥ t ) � �� � maximal index among available physical arms otherwise, it selects any“physical”arm with index G ∗ ( S P t )

  20. Optimality of index policy Theorem 1 Index policy optimal in bandit problem with search for new arms

  21. Implications of Index Policy Each period DM must assign task to a worker Each worker can be ξ =Male of ξ =Female different processes over signals/rewards Probability search brings Male: .8 Fixing value of highest index, optimality of searching for new candidates same no matter whether you have 49 M and 1 F, or 25 M and 25 F Given highest physical index G ∗ ( S P t ), composition of set of physical arms irrelevant for decision to search However, opportunity cost of search (value of continuing with current agents) depends on number of M and F (and past outcomes) Maximal index among current arms NOT sufficient statistics for state of current arms when it comes to continuation payoff with current arms

  22. Plan 1 Model Optimal policy 2 Dynamics 3 Proof of main theorem 4 Applications 5 Extensions 6 Irreversible choice Search frictions multiple search arms no discounting

  23. Dynamics

  24. Dynamics under index policy Stationary search technology: H ω S = H S all ω S if DM searches at t , all physical arms present at t never pulled again ( search=replacement ) Result extends to“Improving search technologies” : physical arms required to pass more stringent tests over time Deteriorating search technology: e.g., finite set of arms DM may return to arms present before last search

  25. Plan 1 Model Optimal policy 2 Dynamics 3 Proof of main theorem 4 Applications 5 Extensions 6 Irreversible choice Search frictions multiple search arms no discounting

  26. Proof of Main Theorem

  27. Proof of Theorem 1: Road Map Characterization of payoff under index policy 1 representation uses“ timing process ”based on optimal stopping in indexes definition: physical arms: stop when index drops below its initial value (Mandelbaum, 1986) search: stop when search index and all indexes of newly arrived arms smaller than value of search index when search began 2 Dynamic programming payoff function under index policy solves dynamic programming equation

  28. Proof: Step 1 κ ( v |S ) ∈ N ∪ {∞} : minimal time until all indexes (search/existing arms/newly found arms) weakly below v ∈ R + Lemma 1 � ∞ E δ κ ( v |S 0 ) V ( S 0 ) = [1 − ] dv � �� � � �� � 0 expected discounted payoff under time till all indexes index policy, starting drop weakly below v from state S 0

Recommend


More recommend