Unbiased Offline Recommender Evaluation for Missing-Not-At-Random Implicit Feedback Serge Belongie Deborah Estrin Lo Longqi Yang Yuan Xuan Chenyang Wang Yin Cui Funders: 1
Offline Evaluation of Recommendation Algorithm user-item interactions recommendation algorithms ( , ) … ( , ) R … … rewards ( , ) 2
Offline Evaluation of Recommendation Algorithm user interaction history recommendation algorithms Pr Pros: ( , ) … • Cost effective. • Efficient. ( , ) R • Iterate faster. • Experiment before deployment. … … rewards ( , ) 3
Offline Evaluation of Recommendation Algorithm user interaction history recommendation algorithms Pr Pros: ( , ) … • Cost effective. • Efficient. ( , ) R • Iterate faster. • Experiment before deployment. … … rewards ( , ) Co Cons: • The data is Missing-Not-At-Random (MNAR) 4
Of Offline E e Evaluation on procedure item " user ! interacted user ! with item " 5
Of Offline E e Evaluation on procedure train/test 6
Of Offline E e Evaluation on procedure 1. Train and validate a 2. Averaged performance over held- recommendation model out (user, item) interaction pairs (Average-Over-All) 7
Of Offline E e Evaluation on procedure Rating-based recommendation systems Implicit feedback-based recommendation systems 1. Train and validate a 2. Averaged performance over held- recommendation model out (user, item) interaction pairs (policy) ! (Average-Over-All) 8
Previous work: Av Average-Ov Over er-Al All is is bia biased fo for r ra rating ting-ba based d re recommenda ndatio tion n systems, be becaus use ra rating tings are re MN MNAR [Marlin et al. 09], [Schnabel et al. 16], [Steck 10], [Steck 11], and [Steck 13] 9
Previous work: Av Average-Ov Over er-Al All is is bia biased fo for r ra rating ting-ba based d re recommenda ndatio tion n systems, be becaus use ra rating tings are re MN MNAR [Marlin et al. 09], [Schnabel et al. 16], [Steck 10], [Steck 11], and [Steck 13] Previous work: Av Average-Ov Over er-Al All is is unb unbiased fo for r im implic plicit it fe feedba dback-ba based d re recommenda datio ion systems, be because im implic plicit it fe feedba dback is is mi missing uniforml mly at random. [Lim 15] 10
This work: Av Average-Ov Over er-Al All is is bia biased fo for r im implic plicit it fe feedba dback-ba based d re recommenda datio ion systems, be because im impl plic icit it fe feedbac dback k is is NO NOT mi missing uniforml mly at random . 11
This work: Av Average-Ov Over er-Al All is is bia biased fo for r im implic plicit it fe feedba dback-ba based d re recommenda datio ion systems, be because im impl plic icit it fe feedbac dback k is is NO NOT mi missing uniforml mly at random. trending tr re recommendation Popularity bias (Users are more likely to be exposed to popular items) 12
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 13
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 14
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 15
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 16
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 17
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 (over observations) Any sensible Algorithm 1 0.8 0 evaluation Performance Algorithm 2 0.75 0.75 Performance 18
A Hypothetical Example Popular Items Long-tail Items # of liked items 1 : 10 (over all items) : # of liked items 10 1 Average- (over observations) Over-All Algorithm 1 0.8 0 Performance Algorithm 2 0.75 0.75 Performance 19
<latexit sha1_base64="ZyVWiPK9uixN125Ap0AIBWyDfRc=">ACEHicdVDPaxNBGJ1Nq42xaqzHXoYGsaewG3YxuQW81FuF5gckIXw7+yUZMjO7zHxbCF/ghf/FS8eWopXj978b5ykEWzRBwOP97H8F5aKOkoDH8FlYPDJ0+Pqs9qz49fvHxVf3Sd3lpBfZErnI7TMGhkgZ7JEnhsLAIOlU4SJcftv7gGq2TubmiVYETDXMjZ1IAeWlaf/eRUHMLZinN3HGfzaQgzHi64mA4qHluJS30tN4Im504iaOEh804jJMo8iRqd8J2wqNmuEOD7XE5rf8cZ7koNRoSCpwbRWFBkzVYkLhpjYuHRYgljDHkacGNLrJeldow96JeOz3PpniO/UvxNr0M6tdOovNdDCPfa24r+8Umz9mQtTVESGnH/0axUnHK+XYdn0qIgtfIEhC8uBRcLsOAnsa7mR/jTlP+f9FvNyPNPrUY3s9RZafsjJ2ziL1nXbBLlmPCfaZfWU37Db4EnwL7oLv96eVYJ95wx4g+PEb612dHA=</latexit> <latexit sha1_base64="ZyVWiPK9uixN125Ap0AIBWyDfRc=">ACEHicdVDPaxNBGJ1Nq42xaqzHXoYGsaewG3YxuQW81FuF5gckIXw7+yUZMjO7zHxbCF/ghf/FS8eWopXj978b5ykEWzRBwOP97H8F5aKOkoDH8FlYPDJ0+Pqs9qz49fvHxVf3Sd3lpBfZErnI7TMGhkgZ7JEnhsLAIOlU4SJcftv7gGq2TubmiVYETDXMjZ1IAeWlaf/eRUHMLZinN3HGfzaQgzHi64mA4qHluJS30tN4Im504iaOEh804jJMo8iRqd8J2wqNmuEOD7XE5rf8cZ7koNRoSCpwbRWFBkzVYkLhpjYuHRYgljDHkacGNLrJeldow96JeOz3PpniO/UvxNr0M6tdOovNdDCPfa24r+8Umz9mQtTVESGnH/0axUnHK+XYdn0qIgtfIEhC8uBRcLsOAnsa7mR/jTlP+f9FvNyPNPrUY3s9RZafsjJ2ziL1nXbBLlmPCfaZfWU37Db4EnwL7oLv96eVYJ95wx4g+PEb612dHA=</latexit> <latexit sha1_base64="ZyVWiPK9uixN125Ap0AIBWyDfRc=">ACEHicdVDPaxNBGJ1Nq42xaqzHXoYGsaewG3YxuQW81FuF5gckIXw7+yUZMjO7zHxbCF/ghf/FS8eWopXj978b5ykEWzRBwOP97H8F5aKOkoDH8FlYPDJ0+Pqs9qz49fvHxVf3Sd3lpBfZErnI7TMGhkgZ7JEnhsLAIOlU4SJcftv7gGq2TubmiVYETDXMjZ1IAeWlaf/eRUHMLZinN3HGfzaQgzHi64mA4qHluJS30tN4Im504iaOEh804jJMo8iRqd8J2wqNmuEOD7XE5rf8cZ7koNRoSCpwbRWFBkzVYkLhpjYuHRYgljDHkacGNLrJeldow96JeOz3PpniO/UvxNr0M6tdOovNdDCPfa24r+8Umz9mQtTVESGnH/0axUnHK+XYdn0qIgtfIEhC8uBRcLsOAnsa7mR/jTlP+f9FvNyPNPrUY3s9RZafsjJ2ziL1nXbBLlmPCfaZfWU37Db4EnwL7oLv96eVYJ95wx4g+PEb612dHA=</latexit> <latexit sha1_base64="ZyVWiPK9uixN125Ap0AIBWyDfRc=">ACEHicdVDPaxNBGJ1Nq42xaqzHXoYGsaewG3YxuQW81FuF5gckIXw7+yUZMjO7zHxbCF/ghf/FS8eWopXj978b5ykEWzRBwOP97H8F5aKOkoDH8FlYPDJ0+Pqs9qz49fvHxVf3Sd3lpBfZErnI7TMGhkgZ7JEnhsLAIOlU4SJcftv7gGq2TubmiVYETDXMjZ1IAeWlaf/eRUHMLZinN3HGfzaQgzHi64mA4qHluJS30tN4Im504iaOEh804jJMo8iRqd8J2wqNmuEOD7XE5rf8cZ7koNRoSCpwbRWFBkzVYkLhpjYuHRYgljDHkacGNLrJeldow96JeOz3PpniO/UvxNr0M6tdOovNdDCPfa24r+8Umz9mQtTVESGnH/0axUnHK+XYdn0qIgtfIEhC8uBRcLsOAnsa7mR/jTlP+f9FvNyPNPrUY3s9RZafsjJ2ziL1nXbBLlmPCfaZfWU37Db4EnwL7oLv96eVYJ95wx4g+PEb612dHA=</latexit> Formalize Reward ! Item rankings predicted by an algorithm Z ) = 1 1 R ( ˆ c ( ˆ X X Z u,i ) Ideal evaluation: |U| |S u | u ∈ U i ∈ S u 20
Recommend
More recommend