Ac#ve Learning and Search on Low-Rank Matrices Dougal J. - PowerPoint PPT Presentation

Ac#ve ¡Learning ¡and ¡Search ¡ on ¡Low-‑Rank ¡Matrices ¡ Dougal ¡J. ¡Sutherland ¡ with ¡Barnabás ¡Póczos ¡and ¡Jeff ¡Schneider ¡

Collabora#ve ¡predic#on ¡ • “NeHlix ¡problem”: ¡how ¡can ¡we ¡predict ¡whether ¡users ¡ will ¡like ¡movies? ¡ • Basic ¡idea: ¡similar ¡users ¡should ¡have ¡similar ¡feelings ¡ about ¡similar ¡items ¡ • Actually: ¡assume ¡the ¡ra#ngs ¡matrix ¡is ¡low ¡rank ¡ 0.1 ¡ 2.5 ¡ Alice ¡ 3 ¡ 2 ¡ 3 ¡ 5 ¡ 2 ¡ Alice ¡ ≈ 0.1 ¡ 0.0 ¡ 0.6 ¡ 0.5 ¡ 0.5 ¡ ⋅ Bob ¡ 1.1 ¡ 3.6 ¡ Bob ¡ 4 ¡ 2 ¡ 5 ¡ 5 ¡ 3 ¡ 1.1 ¡ 0.8 ¡ 1.2 ¡ 1.9 ¡ 0.6 ¡ Carlos ¡ 5 ¡ 4 ¡ 5 ¡ 5 ¡ 3 ¡ 0.4 ¡ 4.7 ¡ Carlos ¡ Item ¡latent ¡factors ¡ V T Ra#ngs ¡matrix ¡ R User ¡latent ¡factors ¡ U

Widely ¡applicable ¡ Erikkson ¡& ¡van ¡den ¡Hengel, ¡CVPR ¡2010 ¡ Adams, ¡Dahl, ¡& ¡Murray, ¡UAI ¡2010 ¡

Ac$ve ¡collabora#ve ¡predic#on ¡ In ¡prac#ce, ¡we ¡rarely ¡have ¡a ¡fixed ¡training ¡set. ¡ Some#mes ¡we ¡can ¡choose ¡to ¡query ¡specific ¡points; ¡we ¡ want ¡the ¡algorithm ¡to ¡tell ¡us ¡which ¡ones ¡to ¡try. ¡

Overall ¡process ¡ Par#ally ¡ observed ¡ input ¡ R O Imputed ¡ Point ¡to ¡query ¡ complete ¡ matrix ¡ ˆ R

Learning ¡goals ¡ Predic'on : ¡minimize ¡predic#on ¡error ¡on ¡unknown ¡entries ¡ ¡ h R ij ) 2 | ( i, j ) 62 O i ( R ij � ˆ min E ¡ Model : ¡minimize ¡uncertainty ¡in ¡the ¡distribu#on ¡of ¡models ¡ ¡ min H [model | R O ] ¡ Magnitude ¡Search : ¡query ¡largest-‑valued ¡points ¡possible ¡ ¡ X R ij max ¡ ( i,j ) ∈ A Search : ¡query ¡as ¡many ¡posi#ve ¡points ¡as ¡possible ¡ X max ( R ij ∈ +) ( i,j ) ∈ A

Probabilis#c ¡Matrix ¡Factoriza#on ¡ Genera#ve ¡model ¡for ¡matrices ¡of ¡fixed ¡rank ¡ D ¡ (Salakhutdinov ¡& ¡Mnih, ¡NIPS ¡2007) ¡ ¡ Alice ¡ 0.1 ¡ 2.5 ¡ 2 ¡ 3 ¡ 3 ¡ 5 ¡ 2 ¡ Alice ¡ 0.1 ¡ 0 ¡ 0.6 ¡ 0.5 ¡ 0.5 ¡ ≈ ⋅ Bob ¡ 1.1 ¡ 3.6 ¡ Bob ¡ 4 ¡ 2 ¡ 5 ¡ 5 ¡ 3 ¡ 1.1 ¡ 0.8 ¡ 1.2 ¡ 1.9 ¡ 0.6 ¡ Carlos ¡ 5 ¡ 4 ¡ 5 ¡ 5 ¡ 3 ¡ 0.4 ¡ 4.7 ¡ Carlos ¡ Item ¡latent ¡factors ¡ V T User ¡latent ¡factors ¡ U Ra#ngs ¡matrix ¡ R U T i V j , σ 2 � 0 , σ 2 0 , σ 2 � � � � � R ij ∼ N U i ∼ N U I D V j ∼ N V I D 1 1 1 R � UV T � k 2 k U k 2 k V k 2 � � ln p ( U, V | R O ) = 2 σ 2 k I � F + F + F + C 2 σ 2 2 σ 2 U V

PMF ¡Limita#ons ¡ 1 1 1 R � UV T � � k 2 k U k 2 k V k 2 � ln p ( U, V | R O ) = 2 σ 2 k I � F + F + F + C 2 σ 2 2 σ 2 U V • PMF ¡is ¡only ¡really ¡suited ¡to ¡a ¡point ¡es#mate ¡of ¡ U , ¡ V ¡ ¡ • To ¡do ¡ac#ve ¡learning, ¡we ¡need ¡some ¡informa#on ¡ about ¡our ¡uncertainty ¡in ¡the ¡model ¡and/or ¡the ¡ predic#ons ¡

Varia#onal ¡PMF ¡ One ¡way ¡to ¡get ¡posterior ¡distribu#on ¡info: ¡ • Approximate ¡joint ¡distribu#on ¡ p ( U , ¡ V ) ¡ with ¡a ¡ parametric ¡family ¡ q ( U , ¡ V ) ¡ • Find ¡best ¡parameters ¡by ¡minimizing ¡KL ¡divergence ¡ q ( U, V ) Z KL ( q k p ) = q ( U, V ) ln p ( U, V | R O ) d { U, V } = � H [ q ] � E q [ln p ( U, V | R O )] N D M D 1 1 X X X X E q [ U 2 E q [ V 2 = � H [ q ] � C + ik ] + jk ] 2 σ 2 2 σ 2 U V i =1 k =1 j =1 k =1 D ! N M D D 1 X X X X X E q [ U ki V kj ] + R 2 + E q [ U ki V kj U ` i V ` j ] � 2 R ij ij 2 σ 2 i =1 j =1 ` =1 k =1 k =1

Varia#onal ¡PMF: ¡full ¡normal ¡ • One ¡op#on: ¡normal ¡over ¡vector ¡of ¡entries ¡in ¡ U, ¡ V ¡ – Expecta#ons ¡we ¡need ¡are ¡in ¡closed ¡form ¡(Isserlis’ ¡Thm.) ¡ – Can ¡op#mize ¡with ¡projected ¡gradient ¡descent ¡ – O( D 2 ¡( N + M ) 2 ) ¡memory, ¡O( D 3 ¡( N + M ) 3 ) ¡#me ¡to ¡project ¡ U 11 ¡ U 12 ¡ U 21 ¡ U 22 ¡ U 32 ¡ U 32 ¡ V 11 ¡ V 12 ¡ V 21 ¡ V 22 ¡ U 11 ¡ U 11 ¡ U 12 ¡ U 12 ¡ U 21 ¡ U 21 ¡ U 22 ¡ U 22 ¡ U 31 ¡ U 31 ¡ Mean ¡ µ ¡ cov ¡ Σ ¡ U 32 ¡ D ( N + M ) ¡ ( D ( N + M )) 2 ¡ U 32 ¡ V 11 ¡ V 11 ¡ V 12 ¡ V 12 ¡ V 21 ¡ V 21 ¡ V 22 ¡ V 22 ¡

Varia#onal ¡PMF: ¡fully ¡factorized ¡ • Another: ¡assume ¡each ¡element ¡of ¡ U ¡and ¡ V ¡is ¡independent ¡ – (Silva ¡& ¡Carin, ¡KDD ¡2012) ¡ – O( D ¡( N + M )) ¡memory, ¡projec#on ¡is ¡trivial ¡ U 11 ¡ U 12 ¡ U 21 ¡ U 22 ¡ U 32 ¡ U 32 ¡ V 11 ¡ V 12 ¡ V 21 ¡ V 22 ¡ U 11 ¡ U 11 ¡ U 12 ¡ U 12 ¡ U 21 ¡ U 21 ¡ U 22 ¡ U 22 ¡ U 31 ¡ U 31 ¡ Mean ¡ µ ¡ diagonal ¡cov ¡ Σ ¡ U 32 ¡ D ( N + M ) ¡ ( D ( N + M )) ¡ U 32 ¡ V 11 ¡ V 11 ¡ V 12 ¡ V 12 ¡ V 21 ¡ V 21 ¡ V 22 ¡ V 22 ¡

Varia#onal ¡PMF: ¡matrix ¡normal ¡ • In ¡between: ¡matrix ¡normal ¡over ¡stacked ¡ U, ¡ V ¡ – Decompose ¡cov ¡into ¡user/item ¡covariance ¡+ ¡latent ¡d ¡covariance ¡ – Expecta#ons ¡/ ¡gradient ¡descent ¡basically ¡the ¡same ¡ – O( D 2 ¡+ ¡( N + M ) 2 ) ¡memory, ¡O( D 3 ¡+ ¡( N + M ) 3 ) ¡#me ¡to ¡project ¡ U 11 ¡ U 12 ¡ U 1 ¡ U 2 ¡ U 3 ¡ V 1 ¡ V 2 ¡ U 21 ¡ U 1 ¡ U 22 ¡ 1 ¡ 2 ¡ U 2 ¡ U 31 ¡ Mean ¡ µ ¡ ⊗ 1 ¡ U 3 ¡ U 32 ¡ D ( N + M ) ¡ 2 ¡ V 1 ¡ V 11 ¡ V 2 ¡ V 12 ¡ V 21 ¡ column ¡cov ¡ Ω ¡ row ¡cov ¡ Σ ¡ D 2 ¡ V 22 ¡ ( N + M ) 2 ¡

Markov ¡chain ¡Monte ¡Carlo ¡ Another ¡way ¡to ¡get ¡posterior ¡info ¡for ¡PMF ¡is ¡to ¡get ¡ samples ¡from ¡it ¡(approximately, ¡asympto#cally…). ¡ BPMF ¡ (Salakhutdinov ¡& ¡Mnih, ¡ICML ¡2008) ¡lets ¡normal ¡priors ¡on ¡ U ¡ and ¡ V ¡have ¡arbitrary ¡means/covariances, ¡with ¡ Gaussian-‑Wishart ¡hyperpriors. ¡ – Can ¡sample ¡through ¡Gibbs ¡ – We ¡use ¡Hamiltonian ¡MCMC ¡with ¡the ¡ N o-‑ U -‑ T urn ¡ S ampler ¡ (Hoffman ¡& ¡Gelman, ¡JMLR ¡in ¡press) ¡

Myopic ¡selec#on ¡criteria ¡ – Predic'on: ¡element ¡with ¡highest ¡variance ¡(uncertainty ¡sampling) ¡ ¡ arg max ( i,j ) Var[ R ij ] – Model: ¡? ¡ ¡ – Magnitude ¡search: ¡element ¡with ¡highest ¡mean ¡ ¡ arg max ( i,j ) E [ R ij ] ¡ – Search: ¡element ¡with ¡highest ¡probability ¡of ¡being ¡posi#ve ¡ arg max ( i,j ) P [ R ij ∈ +]

Lookahead ¡criteria ¡ Integrate ¡over ¡possible ¡outcomes ¡ (Garneq ¡et ¡al., ¡ICML ¡2012) ¡ ¡ ¡ Z dˆ P ( R ij = x ) E [ f ( q ) | R O , R ij = x ] x – Predic'on: ¡ ¡entropy ¡of ¡predicted ¡matrix ¡ ¡ f ( q ) = H [ R ] – Model: ¡entropy ¡of ¡posterior ¡over ¡ U ¡and ¡ V ¡ ¡ f ( q ) = H [ U, V ] – Magnitude ¡search: ¡mean ¡of ¡found ¡elements ¡ ¡ f ( q ) = R ij + ( k,l ) ∈ P − ( i,j ) E [ R kl ] max – Search: ¡ expected ¡number ¡of ¡posi#ves ¡found ¡ f ( q ) = ( R ij ∈ +) + ( k,l ) ∈ P − ( i,j ) P ( R kl ∈ +) max

Other ¡work ¡ • Only ¡deals ¡with ¡ Predic'on ¡goal ¡ ¡ • Substan#al ¡amount ¡of ¡work ¡on ¡ac#ve ¡learning ¡for ¡ recommender ¡systems, ¡especially ¡the ¡new ¡user ¡case ¡ ¡ • Liqle ¡for ¡general ¡matrix ¡factoriza#on ¡serngs: ¡ – Silva ¡& ¡Carin, ¡KDD ¡2012 ¡ • assumes ¡fully ¡factorized ¡distribu#on: ¡more ¡limited ¡model ¡ • handles ¡much ¡larger ¡datasets ¡ – Rish ¡& ¡Tesauro, ¡ISAIM ¡2008 ¡workshop ¡ • uses ¡max-‑margin ¡matrix ¡factoriza#on ¡ • picks ¡points ¡near ¡the ¡boundary ¡

Toy ¡problems ¡ Matrix ¡normal ¡varia#onal ¡ MCMC ¡ 700 1 . 8 600 1 . 6 500 1 . 4 400 1 . 2 300 1 . 0 200 0 . 8 0 . 6 100 0 . 4 0 Var[ R ij ] Var q [ R ij ] − 50 . 0 − 60 − 50 . 1 − 70 − 50 . 2 − 80 − 50 . 3 − 90 − 50 . 4 − 100 − 50 . 5 − 110 − 50 . 6 − 120 − 50 . 7 − 130 − 50 . 8 E [ H [ R ]] E q [ H [ U, V ]]

Toy ¡problems ¡ Predic'on ¡results ¡on ¡10x10 ¡rank-‑4 ¡matrices, ¡vals ¡1 ¡to ¡5. ¡

Toy ¡problems ¡ Search ¡ results ¡on ¡10x10 ¡rank-‑4 ¡matrices, ¡vals ¡1 ¡to ¡5. ¡

MovieLens ¡ Most ¡of ¡MovieLens-‑100k: ¡472 ¡users ¡x ¡413 ¡movies, ¡~60k ¡ ra#ngs. ¡Start ¡with ¡5% ¡known; ¡test ¡on ¡a ¡different ¡5%. ¡

Ac#ve Learning and Search on Low-Rank Matrices Dougal J. - PowerPoint PPT Presentation

Ac#ve Learning and Search on Low-Rank Matrices Dougal J. Sutherland with Barnabs Pczos and Jeff Schneider Collabora#ve predic#on NeHlix problem:

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

10. Learning to Rank Outline 10.1. Why Learning to Rank (LeToR)? 10.2. Pointwise, Pairwise,

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Learning to rank search results Voting algorithms, rank combination methods Web Search Andr

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Symmetric rank distance codes Kai-Uwe Schmidt Otto-von-Guericke University Magdeburg, Germany 1

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Recitations for 10-701 Randomized Algorithm for matrices Mu Li April 9, 2013 Low-rank

Amy Pepin, MSW, LICSW, CPS JSI Research & Training, Inc. December 2 nd , 2015 Behavioral

STU TUDY Y AI AIM To assess multilevel factors influencing a rural countys capacity to

Substance Abuse and ADHD Among Adolescents and Young Adults Prevalence and Developmental

Six x Les essons sons Lea earned ned From m Can annabis nabis Legalization in Washington

Arc consistency (ac) Simple algorithm: ac3 (1977) V4 V3 V4 + V2 = 5 V2 V3 6 V1 V2 V1

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

AC DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks Ke Keqiang He He ,

Parity Helps to Compute Majority Igor Carboni Oliveira Rahul Santhanam Srikanth Srinivasan

Ac#ve Learning and Search on Low-Rank Matrices Dougal J. - PowerPoint PPT Presentation

Ac#ve Learning and Search on Low-Rank Matrices Dougal J. Sutherland with Barnabs Pczos and Jeff Schneider Collabora#ve predic#on NeHlix problem:

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

10. Learning to Rank Outline 10.1. Why Learning to Rank (LeToR)? 10.2. Pointwise, Pairwise,

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Learning to rank search results Voting algorithms, rank combination methods Web Search Andr

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Symmetric rank distance codes Kai-Uwe Schmidt Otto-von-Guericke University Magdeburg, Germany 1

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Recitations for 10-701 Randomized Algorithm for matrices Mu Li April 9, 2013 Low-rank

Amy Pepin, MSW, LICSW, CPS JSI Research &amp; Training, Inc. December 2 nd , 2015 Behavioral

STU TUDY Y AI AIM To assess multilevel factors influencing a rural countys capacity to

Substance Abuse and ADHD Among Adolescents and Young Adults Prevalence and Developmental

Six x Les essons sons Lea earned ned From m Can annabis nabis Legalization in Washington

Arc consistency (ac) Simple algorithm: ac3 (1977) V4 V3 V4 + V2 = 5 V2 V3 6 V1 V2 V1

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

AC DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks Ke Keqiang He He ,

Parity Helps to Compute Majority Igor Carboni Oliveira Rahul Santhanam Srikanth Srinivasan

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Amy Pepin, MSW, LICSW, CPS JSI Research & Training, Inc. December 2 nd , 2015 Behavioral