A log-linear model with latent features for dyadic prediction - PowerPoint PPT Presentation

A log-linear model with latent features for dyadic prediction Aditya Krishna Menon and Charles Elkan University of California, San Diego December 17, 2010

Outline Dyadic prediction: definition and goals A simple log-linear model for dyadic prediction Adding latent features to the log-linear model Experimental results Conclusion

The movie rating prediction problem ◮ Given users’ ratings of movies they have seen, predict ratings on the movies they have not seen ◮ Popular solution strategy is collaborative filtering: leverage everyone’s ratings to determine individual users’ tastes

Generalizing the problem: dyadic prediction ◮ In dyadic prediction, our training set is { (( r i , c i ) , y i ) } n i =1 , where each pair ( r i , c i ) is called a dyad, and each y i is a label ◮ Goal : Predict the label y ′ for a new dyad ( r ′ , c ′ ) ◮ Matrix completion with r i ’s as rows and c i ’s as columns c 2 c 1 . . . c n r 1 r 2 ? ? . . . ? ? r m ◮ The choice of r i , c i and y i yields different problems ◮ In movie rating prediction, r i = user ID, c i = movie ID, and y i is the user’s rating of the movie

Different instantiations of dyadic prediction ◮ Dyadic prediction captures problems in a range of fields: ◮ Collaborative filtering: will a user like a movie? ◮ Link prediction: do two people know each other? ◮ Item response theory: how will a person respond to a multiple choice question? ◮ Political science: how will a senator vote on a bill? ◮ . . . ◮ Broadly, two major ways to instantiate different problems: ◮ r i , c i could be unique identifiers, feature vectors, or both ◮ y i could be ordinal (e.g. 1 – 5 stars), or nominal (e.g. { friend, colleague, family } )

Proposed desiderata of a dyadic prediction model ◮ Bolstered by the Netflix challenge, there has been significant effort on improving the accuracy of dyadic prediction models ◮ However, other factors have not received as much attention: ◮ Predicting well-calibrated probabilities over the labels, e.g. Pr[ Rating = 5 stars | user, movie ] ◮ Essential when we want to make decisions based on users’ predicted preferences ◮ Ability to handle nominal labels in addition to ordinal ones ◮ e.g. user-user interactions of { friend, colleague, family } , user-item interactions of { viewed, purchased, returned } , . . . ◮ Allowing both unique identifiers and feature vectors ◮ Helpful for accuracy and cold-start dyads respectively ◮ Want them to complement each other’s strengths

This work ◮ We are interested in designing a simple yet flexible dyadic prediction model meeting these desiderata ◮ To this end, we propose a log-linear model with latent features (LFL) ◮ Mathematically simple to understand and train ◮ Able to exploit the flexibility of the log-linear framework ◮ Experimental results show that our model meets the new desiderata without sacrificing accuracy

The log-linear framework ◮ Given inputs x ∈ X and labels y ∈ Y , a log-linear model assumes the probability exp( � i w i f i ( x, y )) p ( y | x ; w ) = � y ′ exp( � i w i f i ( x, y ′ )) where w is a vector of weights, and each f i : X × Y → R is a feature function ◮ Freedom to pick f i ’s means this is a very flexible class of model ◮ Captures logistic regression, CRFs, . . . ◮ A useful basis for a dyadic prediction model: ◮ Directly models probabilities of labels given examples ◮ Natural mechanism for combining identifiers and side-information descriptions of the inputs x ◮ Labels y can be nominal

A simple log-linear model for dyadic prediction ◮ For a dyad x with members ( r ( x ) , c ( x )) that are unique identifiers, we can construct sets of indicator feature functions: f 1 ry ′ ( x, y ) = 1 [ r ( x ) = r, y = y ′ ] f 2 cy ′ ( x, y ) = 1 [ c ( x ) = c, y = y ′ ] f 3 y ′ ( x, y ) = 1 [ y = y ′ ] ◮ For simplicity, we’ll call each r ( x ) a user, each c ( x ) a movie, and each y a rating ◮ Using these feature functions yields the probability model exp( α y r ( x ) + β y c ( x ) + γ y ) p ( y | x ; w ) = y ′ exp( α y ′ r ( x ) + β y ′ c ( x ) + γ y ′ ) � where w = { α y r } ∪ { β y c } ∪ { γ y } for simplicity ◮ α y r ( x ) = affinity of user r ( x ) for rating y , and so on

Incorporating side-information into the model ◮ If the dyad x has a vector s ( x ) of side-information, we can simply augment our probability model to use this information: c ( x ) + γ y +( δ y ) T s ( x )) exp( α y r ( x ) + β y p ( y | x ; w ) = y ′ exp( α y ′ r ( x ) + β y ′ c ( x ) + γ y ′ +( δ y ′ ) T s ( x )) � ◮ Additional weights { δ y } used to exploit the extra information ◮ Corresponds to adding more feature functions based on s ( x )

Are we done? ◮ This log-linear model is conceptually and practically simple ◮ Parameters can be learnt by optimizing conditional log-likelihood using stochastic gradient descent ◮ But some questions remain: ◮ Is it rich enough to be a useful method? ◮ Is it suitable for ordinal labels? ◮ In fact, the model is not sufficiently expressive: there is no interaction between users’ and movies’ weights ◮ The ranking of all movies c 1 , . . . , c n according to the probability p ( y | x ; w ) is independent of the user!

Capturing interaction effects: the LFL model ◮ To explicitly model interactions between users and movies, we modify the probability distribution: exp( α y r ( x ) + β y c ( x ) + γ y ) p ( y | x ; w ) = y ′ exp( α y ′ r ( x ) + β y ′ � c ( x ) + γ y ′ ) For each rating value y , we keep a matrix α y ∈ R | R |× K of weights, and similarly for movies Thus user r has an associated vector α y r ∈ R K , so that r ( x ) ) T β y p ( y | x ; w ) ∝ exp(( α y c ( x ) + γ y ) We think of α y r ( x ) , β y c ( x ) as latent feature vectors, and so we call the model latent feature log-linear orLFL

Capturing interaction effects: the LFL model ◮ To explicitly model interactions between users and movies, we modify the probability distribution: exp( α y r ( x ) β y c ( x ) + γ y ) p ( y | x ; w ) = y ′ exp( α y ′ r ( x ) β y ′ � c ( x ) + γ y ′ ) For each rating value y , we keep a matrix α y ∈ R | R |× K of weights, and similarly for movies Thus user r has an associated vector α y r ∈ R K , so that r ( x ) ) T β y p ( y | x ; w ) ∝ exp(( α y c ( x ) + γ y ) We think of α y r ( x ) , β y c ( x ) as latent feature vectors, and so we call the model latent feature log-linear orLFL

Capturing interaction effects: the LFL model ◮ To explicitly model interactions between users and movies, we modify the probability distribution: exp( � K k =1 α y r ( x ) k β y c ( x ) k + γ y ) p ( y | x ; w ) = y ′ exp( � K k =1 α y ′ r ( x ) k β y ′ � c ( x ) k + γ y ′ ) ◮ For each rating value y , we keep a matrix α y ∈ R | R |× K of weights, and similarly for movies ◮ Thus user r has an associated vector α y r ∈ R K , so that r ( x ) ) T β y p ( y | x ; w ) ∝ exp(( α y c ( x ) + γ y ) ◮ We think of α y r ( x ) , β y c ( x ) as latent feature vectors, and so we call the model latent feature log-linear or LFL

LFL and matrix factorization ◮ The LFL model is a matrix factorization, but in log-odds := log p ( y | ( r, c ); w ) space: if P yy ′ p ( y ′ | ( r, c ); w ) , then rc P yy ′ = ( α y ) T β y − ( α y ′ ) T β y ′ ◮ Fixing some y 0 as the base class with α y 0 ≡ β y 0 ≡ 0 : Q y := P yy 0 = ( α y ) T β y ◮ Therefore, we have a series of factorizations, one for each possible rating y ◮ We will combine these factorizations in a slightly different way than in standard collaborative filtering

Using the model: prediction and training ◮ The model’s prediction, and in turn the training objective, both depend on whether the labels y i are nominal or ordinal ◮ In both cases, as with the simple model, we can use stochastic gradient descent for large-scale optimization ◮ We’ll study both cases in turn under the following setup: Input . Matrix X with observed entries O , with X rc being the training set label for dyad ( r, c ) Output . Prediction matrix ˆ X with unobserved entries filled in

Prediction and training: nominal labels ◮ For nominal labels, we predict the mode of the distribution: ˆ X rc = argmax y p ( y | ( r, c ); w ) ◮ We use conditional log-likelihood as the objective, which does not impose any structure on the labels: λ α F + λ β � � 2 || α y || 2 2 || β y || 2 Obj nom = − log p ( X rc | ( r, c ); w ) + F y ( r,c ) ∈O ◮ We use ℓ 2 regularization of parameters to prevent overfitting

A log-linear model with latent features for dyadic prediction - PowerPoint PPT Presentation

A log-linear model with latent features for dyadic prediction Aditya Krishna Menon and Charles Elkan University of California, San Diego December 17, 2010 Outline Dyadic prediction: definition and goals A simple log-linear model for dyadic

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Computable dyadic subbases Arno Pauly and Hideki Tsuiki Second Workshop on Mathematical Logic and

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Log-Linear Models for History-Based Parsing Michael Collins, Columbia University Log-Linear

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

Latent Class Models: The Latent Class Logit Model Accouting for unobserved heterogeneity:

TWO-DAY DYADIC DATA ANALYSIS WORKSHOP Randi L. Garcia Smith College UCSF January 9 th and 10 th

Properties of Spherical Fibonacci Points. Johann S. Brauchart j.brauchart@tugraz.at 22. April

Chem 204 1. Exam logis3cs 2. So7 ma8er Exam I

tt r r st

The Administrative and Clinical Dyad Clarifying Roles and Prioritizing Effectively Together Amy

Leader adership i in healthcar care: e: to dyad o ad or n not to d dyad? ad? Marianne

iCook 4-H: A Program to Promote Culinary Skills, Family Meals, and Physical Activity Together for

1 Innovative Solutions for Engaging and Retaining Top Physician Leaders Alan J Conrad, MD, MMM,