user profiling in text-based recommender systems based on distributed word representations . � Steklov Institute of Mathematics at St. Petersburg � National Research University Higher School of Economics, St. Petersburg � Kazan (Volga Region) Federal University, Kazan, Russia � Deloitte Analytics Institute, Moscow, Russia April 7, 2016 Anton Alekseev � and Sergey I. Nikolenko �������
intro: word embeddings .
overview . • Very brief overview of the paper: • we want to recommend full-text items to users; • in the input data, users like full-text items, and we’d like to construct thematic user profiles based on this; • to do so, we cluster the word embeddings of keywords; • then we propose a conceptual way to weigh down meaningless clusters of common words. 3
word embeddings . • In this work, we construct user profiles based on texts. • To do so, we used distributed word representations (word embeddings). • Distributed word representations map each word occurring in the dictionary to a Euclidean space, attempting to capture semantic relationships between the words as geometric relationships in the Euclidean space. 4
word embeddings . • Started back in (Bengio et al., 2003), exploded after the works of Bengio et al. and Mikolov et al. (2009–2011), now used everywhere. • Basic idea: • shallow neural networks trained to reconstruct contexts by words or words by context; • skip-gram: predict contextual words � by the word � , ��� � �� ; • CBOW: predict the word � by its context � , ��� � �� ; • Glove: train a decomposition of the matrix of cooccurrences. • Word embeddings serve as building blocks for neural network approaches to NLP. 4
word embeddings . • Two main architectures: CBOW skip-gram • We use CBOW embeddings trained on a very large Russian dataset (thanks to Nikolay Arefyev and Alexander Panchenko!). 4
methods .
tf-idf document profiles . • We begin with baseline approaches. • Using distributed representations trained on a huge Russian corpus, we: • clustered the word vectors, resulting in semantic clustering; • used a vector representation for the documents as weighted sum (with tf-idf weights) for the words; • stored baseline user profiles based on simple weighted sums of their likes in this document representation; • trained baseline recommender algorithms that use these profiles: ranking by cosine similarity, user-based and item-based collaborative filtering. 6
new ideas and results . • Main problem: • we have a clustering in the word vector space � � , which can be also applied to documents represented as vectors in � � ; • we also have a users � documents matrix; • how do we better compress it to individual user profiles? • We have tried decomposing this matrix with SVD and pLSA, but with no good results. Two problems: • there are only likes in the dataset, no dislikes; • “junk” clusters with common words always fill user profiles, whatever we did. 7
new ideas and results . • We can use the following natural idea: • represent a document as a vector of cluster likelihoods ��� � �� ; • treat each user independently; • for every user, construct a logistic regression problem that models the probability of like with weights corresponding to clusters; • train logistic regression; its weights constitute the user profile. • But it also seems to suffer from the same problems: where do we get negative examples for the regression, and what do we do with “junk” clusters? 7
new ideas and results . • We solve both problems with one stroke: • train several (hundred) balanced logistic regressions, choosing negative examples uniformly at random among not-liked items; • then use the weights statistics (e.g., mean and variance) as user profile; • this way, logistic regression is always balanced; • also, now junk clusters with common words with often appear in negative examples too, so they will have significantly higher variance than informative clusters! • Having constructed these profiles, how do we make recommendations? 7
new ideas and results . • Recommender algorithm: • from the posterior distribution of weights (we used normal distribution with posterior mean and variance), sample several (hundred) different weight combinations; • predict the probabilities of likes for all these combinations; • rank according to mean predicted like probability. 7
sample user profile 427 0.042 associate attitude seems quite horoscope ideal religious face era... 413 0.406 0.080 feel glad remember worrying offended jealous inhale pity envy suffer autumn... 0.385 366 0.073 hijack bombing raid to steal loot bomb 798 0.385 0.080 uro missile air defense mine RL submarine Vaenga Red Banner Pacific Fleet... 0.396 youtube blog net mail facebook player online yandex user tor ado... . 0.165 # � � Words 867 0.772 hours two-hour break minute half-hour five-minute two-hour ten-hour... 0.010 424 0.833 0.202 kissing call cry silent scream laughing nod dare restrain angry slam... 837 0.399 8
experimental evaluation .
algorithms . • So far we are comparing three baseline algorithms and our regression-based algorithm: (1) cosine: find nearest documents to a linear user profile with respect to cosine proximity; (2) user-based collaborative filtering: find � nearest neighbors for a user and recommend documents according to their likes; (3) item-based collaborative filtering: find � nearest neighbors for a document and recommend documents similar to the ones a user liked; (4) regression-based algorithm: sample weights according to the posterior distribution, recommend according to average results. 10
evaluation: metrics 4.813 3.851 2 Item-based CF 0.495 0.780 0.523 2.493 3 0.101 Regression 0.530 0.796 0.562 2.667 5.153 • Demo... 1.418 0.686 . Top10 • In experimental evaluation, regression-based recommender clearly outperforms all other methods. Algorithm AUC NDCG Top1 Top5 0 0.456 Cosine 0.514 0.779 0.511 2.471 4.757 1 User-based CF 11
thank you! . Thank you for your attention! 12
Recommend
More recommend