A Preference-Based Bandit Framework for Personalized Recommendation Maryam Tavakol and Ulf Brefeld Paderborn, Nov 8, 2016
Introduction Personalized Recommendation Preference Learning Multi-armed bandits 2
Recommendation 3
Recommendation 4
Preference Model • Item i : { Shirt , Blue , Women , Cheap } • Item k : { Polo shirt , White , Women , Expensive } Item i ≻ Item k : { Shirt-Polo shirt, Blue-White, Women-Women, Cheap-Expensive } z i � k := z i − z k 5
Payoff Model • Personalized model + average component User 1 User 1 + User 2 + … + User m User 2 … User m E [ r t,i � k | u t = u j ] = β > t z i � k + θ > z i � k 6
Personalized Recommendation with Qualitative Bandit • For t = 1, …, T: 1. T he world generates some context 2. The learner chooses an action 3. The world reacts with a reward • Choosing the arm with the highest mean reward + confidence interval (General case of LinUCB) 7
Unified Optimization • Solving the objective function in dual space • With arbitrary loss function • Using Fenchel-Legendre conjugate 8
Squared Loss − 1 2 C α > α + r > α max α � 1 2 α > [ ZZ > + 1 X φ j ⌦ φ > j ) � ZZ > ] α µ ( j • The problem reduces to standard quadratic optimization • Model parameters ( , ), are obtained from θ β j α 9
Squared Loss • In the contextual bandit framework: • Mean: β > t z i � k + θ > z i � k • Confidence bound: q z > i � k ( Z > Z + λ I ) � 1 z i � k c 10
Algorithm 11
Summary • Personalized recommendation • Pairwise learning in bandit framework • Optimization in dual space • Learning algorithm for squared loss 12
Thanks for your attention Questions? Email: tavakol@leuphana.de
Recommend
More recommend