G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem Xinshi Chen 1 , Shuang Li 1 , Hui Li 2 , Shaohua Jiang 2 , Yuan Qi 2 , Le Song 1,2 1 Georgia Tech, 2 Ant Financial ICML 2019
RL for Recommendation System display items display items system … … … choice choice state at 𝑢 + 1 state at 𝑢 + 2 state at 𝑢 user A user’s interest evolves over time based on what she observes. • Recommender’s action can significantly influence such evolution. • A RL based recommender can consider user’s long term interest. •
display items display items system … … … choice choice state at 𝑢 + 1 state at 𝑢 + 2 state at 𝑢 user Challenges reward=? reward=? Training of RL policy requires 1. User is the environment lots of interactions with users e.g. (1) For AlphaGo Zero , 4.9 million games of self-play were generated for training. (2) RL for Atari game takes more than 50 hours on GPU for training. 2. The reward function (a user’s interest) is unknown
Our solution We propose A G enerative A dversarial User Model • - to model user’s action - to recover user’s reward Use GAN User Model as a simulator to pre-train the RL policy offline • simulated interaction GAN User Model system Simulated Environment RL policy
Generative Adversarial User Model 2 components: User’s reward 𝒔(𝒕 𝒖 , 𝒃 𝒖 ) displayed items - 𝑏 - is clicked item. • 𝑡 - is user’s experience (state). • User’s behavior 𝝔(𝒕 𝒖 , 𝒖 ) 𝑏 - ∼ 𝜚 𝑡 - , - choice 𝒖 contains items displayed by the system. • act 𝑏 - ∼ 𝜚 to maximize her expected reward. • 𝒔(𝒕 𝒖 , 𝒃 𝒖 ) reward 𝜚 ∗ (𝑡 - , - ) = arg max 𝔽 : 𝑠 𝑡 - , 𝑏 - − 𝑆 𝜚 /𝜃 • :
Generative Adversarial Training In analogy to GAN: 𝝔 (behavior) acts as a generator • 𝒔 (reward) acts as a discriminator • Jointly learned via a mini-max formulation : G G 𝑠 𝑡 - , 𝑏 - - - min max 𝔽 : D − 𝑆 𝜚 /𝜃 − D 𝑠(𝑡 -CHI , 𝑏 -CHI ) C : -EF -EF
Model Parameterization 2 architectures for aggregating historical information (i.e. state 𝑡 - ) (1) LSTM 𝑢−𝑛 𝒈 ∗ 𝑢−1 ⋯ 𝒈 ∗ we weight t matr trix ix ℎ 𝑢−1 (2) Position Weight 𝑥 11 ⋯ 𝑥 1𝑜 𝑢 𝑠 𝑗 concat = ⋮ ⋮ ⋮ × 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑜 𝑢 𝒈 𝑗
Set Recommendation RL policy ∗ ∗ ∗ 𝑏 O 𝑏 F 𝑏 N set recommendation … display 𝑙 items all available 𝐿 items ∗ = arg max ∗ , 𝑏 N ∗ , … 𝑏 O Q R ,…,Q S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , … , 𝑏 O ) 𝑏 F combinatorial action space 𝑳 Intractable computation! 𝒍
Set Recommendation RL policy We design a cascading Q network to compute the optimal action with linear complexity: ∗ = arg max ∗ , 𝑏 N ∗ , … 𝑏 O Q R ,…,Q S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , … , 𝑏 O ) 𝑏 F decompose ∗ = arg max 𝑹 𝟐∗ 𝑡 - , 𝑏 F ≔ max Q R 𝑹 𝟐∗ (𝑡 - , 𝑏 F ) Q X:S 𝑅(𝑡 - , 𝑏 F , 𝑏 N:O ) 𝑏 F ∗ = arg max 𝑹 𝟑∗ 𝑡 - , 𝑏 F , 𝑏 N ≔ max ∗ , 𝑏 N ) Q X 𝑹 𝟑∗ (𝑡 - , 𝑏 F Q [:S 𝑅(𝑡 - , 𝑏 F , 𝑏 N , 𝑏 \:O ) 𝑏 N … … ∗ = arg max ∗ , 𝑏 N ∗ , … , 𝑏 O ) Q S 𝑹 𝒍∗ (𝑡 - , 𝑏 F 𝑏 O
Set Recommendation RL policy: Cascading DQN ∗ ∗ 𝑏 1 𝑏 2 ∗ 𝑏 𝑙 … Argmax Argmax Argmax 𝑅 1 (𝑡, 𝑏 1 ; 𝜄 1 ) 𝑅 2 (𝑡, 𝑏 1 ∗ , 𝑏 2 ; 𝜄 2 ) ∗ 𝑅 𝑙 (𝑡, 𝑏 1:𝑙−1 , 𝑏 𝑙 ; 𝜄 𝑙 ) … 𝑏 2 𝑏 𝑙 𝑏 1 𝑡
Experiments Predictive Performance of User Model Recommendation Policy Based On User Model
Experiments Cascading-DQN policy pre-trained over a GAN User Model can quickly achieve a high CTR even when it is applied to a new set of users.
Thanks! Poster: Pacific Ballroom #252, Tue, 06:30 PM Contact: xinshi.chen@gatech.edu
Recommend
More recommend