Statistical Learning and Optimization Based on Comparative Judgments data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January 10, 2013
Learning from Comparative Judgements Humans are much more reliable and L. L. Thurstone consistent at making comparative judgements, than at giving numerical ratings or evaluations Bijmolt and Wedel (1995) Stewart, Brown, and Chater (2005) Is model A better than B? answers = bits data model space space active learning
Machine Learning from Human Judgements Document Classification Recommendation Systems labels Optimizing Experimentation Challenge: Computing is cheap, but human assistance/guidance is expensive experiments Goal: Optimize such systems with as little human involvement as possible data scientist
Learning from Paired Comparisons minimizing a 1. Derivative Free Optimization convex function using Human Subjects 2 2. Ranking from 1 ranking objects that 4 Pairwise Comparisons 6 embed into a low- 3 7 dimensional space 5
Optimization Based on Human Judgements Human oracles can provide convex function to be minimized function values or comparisons, but not function gradients Methods that don’t use gradients are called Derivative Free Optimization (DFO)
A Familiar Application better worse optimal prescription spherical cylindrical correction correction
Personalized Search Profile vector w A ∈ R d Results ← SEARCH(query = “ sebastian bach ” , w A ) w A = w old w A = w new Sebastian Bach Johann (1968-current) Sebastian Bach - Heavy Metal Singer (1685-1750) - Frontman of “Skid Row” - Composer
Optimization Based on Pairwise Comparisons Assume that the (unknown) function f to be optimized is strongly convex with Lipschitz gradients f ( y ) • • f ( x ) The function will be minimized by asking pairwise comparisons of the form: Is f ( x ) > f ( y ) ? Assume that the answers are probably correct: for some δ > 0 P (answer = sign( f ( x ) − f ( y ))) ≥ 1 2 + δ
Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum 0 , y + begin with large interval [ y − 0 ]; midpoint y 0 is estimate of minimizer y + y − y 0 0 0
Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum 0 , y 0 ] and [ y 0 , y + split intervals [ y − 0 ] and compare function values at these points with f ( y 0 ) y + y − y 0 0 0
Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum reduce to smallest interval containing minimum of these points y + y + y − y − y 1 y 0 1 1 0 0
Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum repeat... y + y − y 1 1 1
Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum repeat... y + y + y − y − y 2 y 1 2 2 1 1
Convergence Analysis If we want error := E [ f ( x k ) − f ( x ∗ )] ≤ ✏ , we must solve k ≈ d log 1 ✏ line searches (standard coordinate descent bound) and each must be at least p ✏ d accurate Noiseless Case: each line search requires 1 2 log( d ✏ ) comparisons ⇒ total of n ≈ d log 1 ✏ log d ✏ comparisons − p n � � ⇒ ✏ ≈ exp d Noisy Case: probably correct answers to comparisons: P (answer = sign( f ( x ) − f ( y ))) ≥ 1 take majority vote of repeated 2 + δ comparisons to mitigate noise Bounded Noise ( δ ≥ δ 0 > 0 ): line searches require C log d ✏ comparisons, − p n � � where C > 1 / 2 depends on δ 0 ⇒ ✏ ≈ exp d C Unbounded Noise ( δ ∝ | f ( x ) − f ( y ) | ): � d � 2 comparisons ⇒ ✏ ≈ q d 3 line searches require n ✏
Lower Bounds f 0 ( x ) = | x + ✏ | 2 f 1 ( x ) = | x − ✏ | 2 l l l l + ✏ − ✏ x y For unbounded noise, � ∝ | f ( x ) − f ( y ) | , Kullback-Leibler Divergence between response to f 0 ( x ) > f 0 ( y )? vs. f 1 ( x ) > f 1 ( y )? is O ( ✏ 4 ), and KL Divergence between n responses is O ( n ✏ 4 ) with ✏ ∼ n − 1 / 4 • KL Divergence = constant • squared distance between minima ∼ n − 1 / 2 f ( x n ) − f ( x ∗ ) ≥ n − 1 / 2 � � ≥ constant ⇒ P q matches O ( n − 1 / 2 ) upper bound of algorithm d n in R d Jamieson, Recht, RN (2012)
A Surprise Could we do better with function evaluations (e.g., ratings instead of comparisons)? suppose we can obtain noisy function evaluations of the form: f ( x ) + noise f ( y ) < f ( x ) f ( x ) = 10 function values seem to provide f ( z ) < f ( x ) much more information than f ( y ) = 9 comparisons alone f ( z ) = 1 z x y q lower bound on optimization error evaluations give at best a small d 2 with noisy function evaluations improvement over comparisons n O. Shamir (2012) upper bound on optimization error q d 3 see Agrawal, Dekel, Xiao (2010) with noisy pairwise comparisons for similar upper bounds for function evals n if we could measure noisy gradients (and function is Nemirovski et al 2009 strongly convex), then O ( d n ) convergence rate is possible
Preference Learning Bartender : “What beer would you like?” Philippe : “Hmm... I prefer French wine” Bartender : “Try these two samples. Do you prefer A or B?” Philippe : “B” Bartender : “Ok try these two: C or D?” ....
Ranking Based on Pairwise Comparisons Consider 10 beers ranked from best to worst: D < G < I < C < J < E < A < H < B < F A B C D E F G H I J A 0 1 -1 -1 -1 1 -1 1 -1 -1 B -1 0 -1 -1 -1 1 -1 -1 -1 -1 C 1 1 0 -1 1 1 -1 1 -1 1 D 1 1 1 0 1 1 1 1 1 1 E 1 1 -1 -1 0 1 -1 1 -1 -1 F -1 -1 -1 -1 -1 0 -1 -1 -1 -1 G 1 1 1 -1 1 1 0 1 1 1 H -1 1 -1 -1 -1 1 -1 0 -1 -1 I 1 1 1 -1 1 1 -1 1 0 1 J 1 1 -1 -1 1 1 -1 1 -1 0 Which pairwise comparisons should we ask? How many are needed? Assumption : responses to pairwise comparisons are consistent with ranking
Ranking Based on Pairwise Comparisons Consider 10 beers ranked from best to worst: D < G < I < C < J < E < A < H < B < F select m pairwise comparisons at random almost all pairs must be compared, perfect recovery: i.e., about n ( n − 1) / 2 comparisons fraction of pairs misordered ≤ c n log n approximate recovery: m adaptive selection: binary insertion sort also requires n log n comparisons That’s a lot of beer! Problem: n ! possible rankings requires n log n bits of information
Low-Dimensional Assumption: Beer Space Suppose beers can be embedded (according to characteristics) into a low-dimensional Euclidean space. A B w Philippe’s latent preferences in “beer space” (e.g, hoppiness, lightness, maltiness,...) C ⌅ x i � W ⌅ < ⌅ x j � W ⌅ ⇤ x i ⇥ x j F E D G
Recommend
More recommend