data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January - PowerPoint PPT Presentation

Statistical Learning and Optimization Based on Comparative Judgments data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January 10, 2013

Learning from Comparative Judgements Humans are much more reliable and L. L. Thurstone consistent at making comparative judgements, than at giving numerical ratings or evaluations Bijmolt and Wedel (1995) Stewart, Brown, and Chater (2005) Is model A better than B? answers = bits data model space space active learning

Machine Learning from Human Judgements Document Classification Recommendation Systems labels Optimizing Experimentation Challenge: Computing is cheap, but human assistance/guidance is expensive experiments Goal: Optimize such systems with as little human involvement as possible data scientist

Learning from Paired Comparisons minimizing a 1. Derivative Free Optimization convex function using Human Subjects 2 2. Ranking from 1 ranking objects that 4 Pairwise Comparisons 6 embed into a low- 3 7 dimensional space 5

Optimization Based on Human Judgements Human oracles can provide convex function to be minimized function values or comparisons, but not function gradients Methods that don’t use gradients are called Derivative Free Optimization (DFO)

A Familiar Application better worse optimal prescription spherical cylindrical correction correction

Personalized Search Profile vector w A ∈ R d Results ← SEARCH(query = “ sebastian bach ” , w A ) w A = w old w A = w new Sebastian Bach Johann (1968-current) Sebastian Bach - Heavy Metal Singer (1685-1750) - Frontman of “Skid Row” - Composer

Optimization Based on Pairwise Comparisons Assume that the (unknown) function f to be optimized is strongly convex with Lipschitz gradients f ( y ) • • f ( x ) The function will be minimized by asking pairwise comparisons of the form: Is f ( x ) > f ( y ) ? Assume that the answers are probably correct: for some δ > 0 P (answer = sign( f ( x ) − f ( y ))) ≥ 1 2 + δ

Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum 0 , y + begin with large interval [ y − 0 ]; midpoint y 0 is estimate of minimizer y + y − y 0 0 0

Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum 0 , y 0 ] and [ y 0 , y + split intervals [ y − 0 ] and compare function values at these points with f ( y 0 ) y + y − y 0 0 0

Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum reduce to smallest interval containing minimum of these points y + y + y − y − y 1 y 0 1 1 0 0

Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum repeat... y + y − y 1 1 1

Optimization based on Pairwise Comparisons 1 Optimization with Pairwise Comparisons 0.8 x 1 • • x 2 initialize: x 0 = random point 0.6 for n = 0 , 1 , 2 , . . . 0.4 x 4 • • 1) select one of d coordinates uniformly at random 0.2 x 3 • •• and consider line along coordinate that passes x n 0 0.2 2) minimize along coordinate using pairwise 0.4 comparisons and binary search 0.6 3) x n +1 = approximate minimizer x 0 • 0.8 − 1 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1 line search iteratively reduces interval containing minimum repeat... y + y + y − y − y 2 y 1 2 2 1 1

Convergence Analysis If we want error := E [ f ( x k ) − f ( x ∗ )] ≤ ✏ , we must solve k ≈ d log 1 ✏ line searches (standard coordinate descent bound) and each must be at least p ✏ d accurate Noiseless Case: each line search requires 1 2 log( d ✏ ) comparisons ⇒ total of n ≈ d log 1 ✏ log d ✏ comparisons − p n � � ⇒ ✏ ≈ exp d Noisy Case: probably correct answers to comparisons: P (answer = sign( f ( x ) − f ( y ))) ≥ 1 take majority vote of repeated 2 + δ comparisons to mitigate noise Bounded Noise ( δ ≥ δ 0 > 0 ): line searches require C log d ✏ comparisons, − p n � � where C > 1 / 2 depends on δ 0 ⇒ ✏ ≈ exp d C Unbounded Noise ( δ ∝ | f ( x ) − f ( y ) | ): � d � 2 comparisons ⇒ ✏ ≈ q d 3 line searches require n ✏

Lower Bounds f 0 ( x ) = | x + ✏ | 2 f 1 ( x ) = | x − ✏ | 2 l l l l + ✏ − ✏ x y For unbounded noise, � ∝ | f ( x ) − f ( y ) | , Kullback-Leibler Divergence between response to f 0 ( x ) > f 0 ( y )? vs. f 1 ( x ) > f 1 ( y )? is O ( ✏ 4 ), and KL Divergence between n responses is O ( n ✏ 4 ) with ✏ ∼ n − 1 / 4 • KL Divergence = constant • squared distance between minima ∼ n − 1 / 2 f ( x n ) − f ( x ∗ ) ≥ n − 1 / 2 � � ≥ constant ⇒ P q matches O ( n − 1 / 2 ) upper bound of algorithm d n in R d Jamieson, Recht, RN (2012)

A Surprise Could we do better with function evaluations (e.g., ratings instead of comparisons)? suppose we can obtain noisy function evaluations of the form: f ( x ) + noise f ( y ) < f ( x ) f ( x ) = 10 function values seem to provide f ( z ) < f ( x ) much more information than f ( y ) = 9 comparisons alone f ( z ) = 1 z x y q lower bound on optimization error evaluations give at best a small d 2 with noisy function evaluations improvement over comparisons n O. Shamir (2012) upper bound on optimization error q d 3 see Agrawal, Dekel, Xiao (2010) with noisy pairwise comparisons for similar upper bounds for function evals n if we could measure noisy gradients (and function is Nemirovski et al 2009 strongly convex), then O ( d n ) convergence rate is possible

Preference Learning Bartender : “What beer would you like?” Philippe : “Hmm... I prefer French wine” Bartender : “Try these two samples. Do you prefer A or B?” Philippe : “B” Bartender : “Ok try these two: C or D?” ....

Ranking Based on Pairwise Comparisons Consider 10 beers ranked from best to worst: D < G < I < C < J < E < A < H < B < F A B C D E F G H I J A 0 1 -1 -1 -1 1 -1 1 -1 -1 B -1 0 -1 -1 -1 1 -1 -1 -1 -1 C 1 1 0 -1 1 1 -1 1 -1 1 D 1 1 1 0 1 1 1 1 1 1 E 1 1 -1 -1 0 1 -1 1 -1 -1 F -1 -1 -1 -1 -1 0 -1 -1 -1 -1 G 1 1 1 -1 1 1 0 1 1 1 H -1 1 -1 -1 -1 1 -1 0 -1 -1 I 1 1 1 -1 1 1 -1 1 0 1 J 1 1 -1 -1 1 1 -1 1 -1 0 Which pairwise comparisons should we ask? How many are needed? Assumption : responses to pairwise comparisons are consistent with ranking

Ranking Based on Pairwise Comparisons Consider 10 beers ranked from best to worst: D < G < I < C < J < E < A < H < B < F select m pairwise comparisons at random almost all pairs must be compared, perfect recovery: i.e., about n ( n − 1) / 2 comparisons fraction of pairs misordered ≤ c n log n approximate recovery: m adaptive selection: binary insertion sort also requires n log n comparisons That’s a lot of beer! Problem: n ! possible rankings requires n log n bits of information

Low-Dimensional Assumption: Beer Space Suppose beers can be embedded (according to characteristics) into a low-dimensional Euclidean space. A B w Philippe’s latent preferences in “beer space” (e.g, hoppiness, lightness, maltiness,...) C ⌅ x i � W ⌅ < ⌅ x j � W ⌅ ⇤ x i ⇥ x j F E D G

data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January - PowerPoint PPT Presentation

Statistical Learning and Optimization Based on Comparative Judgments data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January 10, 2013 Learning from Comparative Judgements Humans are much more reliable and L. L. Thurstone consistent

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Improving protein secondary structure prediction based on short subsequences with local

B uchi Complementation via Alternating Automata Fabian Reiter July 16, 2012 B uchi

x = y; List list1; n

Knots-quivers correspondence, lattice paths, and rational knots Marko Sto si c 1 CAMGSD,

Real Physics from Unphysical Simulations Steven G. Johnson MIT Applied Mathematics, MIT

Nuclear RG perspective on SRC and EMC physics Dick Furnstahl Department of Physics Ohio State

Applications of Renormalization Group Methods in Nuclear Physics 6 Dick Furnstahl Department

BPS-State Counting: Quiver Invariant, Abelianisation & Mutation S.-J.L. , Z.-L.Wang, and P.Yi

data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January - PowerPoint PPT Presentation

Statistical Learning and Optimization Based on Comparative Judgments data Rob Nowak www.ece.wisc.edu/~nowak OSL, Les Houches, January 10, 2013 Learning from Comparative Judgements Humans are much more reliable and L. L. Thurstone consistent

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Improving protein secondary structure prediction based on short subsequences with local

B uchi Complementation via Alternating Automata Fabian Reiter July 16, 2012 B uchi

x = y; List list1; n

Knots-quivers correspondence, lattice paths, and rational knots Marko Sto si c 1 CAMGSD,

Real Physics from Unphysical Simulations Steven G. Johnson MIT Applied Mathematics, MIT

Nuclear RG perspective on SRC and EMC physics Dick Furnstahl Department of Physics Ohio State

Applications of Renormalization Group Methods in Nuclear Physics 6 Dick Furnstahl Department

BPS-State Counting: Quiver Invariant, Abelianisation &amp; Mutation S.-J.L. , Z.-L.Wang, and P.Yi

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

BPS-State Counting: Quiver Invariant, Abelianisation & Mutation S.-J.L. , Z.-L.Wang, and P.Yi