Bayesian Modeling for Analyzing Online Content and Users Bin Bi Computer Science Department University of California, Los Angeles bbi@cs.ucla.edu
Online Content Explosion By Domo 2
Information Overload sheer amount of online content Curse Blessing • Investigating topics of interest • Confusion • Checking facts • Sub-optimum decisions • Getting advice about problems • Dissatisfaction 3
Goal : � Learning to discover high-quality information q Two schemes Discern good content from the bad 1. Identify users who generate high-quality content 2. q Two domains Social media 1. Search engine 2. 4
Influencer Discovery on Microblogs healthy food SKIT Michelle Obama Jimmy Oliver … Scalable Topic-specific Influence Analysis on Microblogs Bin Bi, Yuanyuan Tian, Yannis Sismanis, Andrey Balmin, Junghoo Cho [WSDM 2014] 5
Motivation q Huge amount of textual and social information produced by popular microblogging sites • Twitter had over 500 million users creating over 340 million tweets daily, reported in March 2012 q A popular resource for marketing campaigns • Monitor opinions of consumers • Launch viral advertising q Identifying social influencers is crucial for market analysis 6
Existing Social Influence Analysis q Major existing influence analyses are only based on network structures • e.g., Influence maximization work [Kleinberg et al , KDD’03] • Valuable textual content is ignored • Only one global influence score is computed for each user q But we need • Topic-specific influence based on both network and content • Differentiate influence in different aspects of life (topics) 7
Existing Topic-Specific Influence Analysis q Separate analysis of content and networks • e.g. Topic-sensitive PageRank (TSPR), TwitterRank Text analysis on content Topic collection Influence analysis on network Problem : Content and links are often correlated in microblog networks • A user tends to follow another who tweets in similar topics 8
Solution Overview q Goal : Identify topic-specific key healthy food SKIT influencers on microblogs Michelle Obama Jimmy Oliver q Leverage both content and network … q Tight coupling of content and network analysis q Followship-LDA (FLDA) Model • Our topic-specific influence model for microblog networks • Solid probabilistic foundation in Bayesian modeling 9
Review of Typical LDA q Latent Dirichlet Allocation (LDA) [Blei et al , JMLR ’03] is a generative topic model for latent topic discovery in a text corpus q Intuition: from Blei’s slides 10
Review of Typical LDA (cont’d) Topic assignments: z Per-document topic distribution: θ Per-topic word distribution: ϕ • Each topic is a distribution over words • Each document is a distribution over topics • Each word is drawn from one of those topics 11
Review of Typical LDA (cont’d) Topic assignments: z Per-document topic distribution: θ Per-topic word distribution: ϕ • In reality, we only observe the documents • The other structure are hidden variables 12
Statistical Modeling q Generative probabilistic modeling • Treats data as observations • Contains hidden variables • Specifies a probabilistic procedure by which the observations are generated q Inference Input Output Observed data Values of hidden variables 13
Graphical Model for LDA α Generative Process for LDA θ q For each document d § Sample θ d from Dirichlet ( α ) § For each word position in d z • Sample a topic z from θ d • Sample a word w from φ z w φ β N m K M θ d : Topic distribution for document d M : Number of documents φ z : : Word distribution for topic z N m : Number of words in document m K : Number of topics 14
Topic-specific Influence Analysis on Microblogs q Intuition ① Each microblog user has tweet text and a set of followees Content Followees Alice web, organic, veggie, Michelle Obama, web, organic, veggie, cookie, cloud, …… Mark Zuckerberg, cookie, cloud, …… Barack Obama ② A user tweets in multiple topics • Alice tweets about technology and food ③ A topic is a distribution over words • web and cloud are more likely to appear in the technology topic 15
Topic-specific Influence Analysis on Microblogs q Intuition (cont’d) ④ A user follows another for different reasons • Content-based : follow for similar topics • Content-independent : follow for popularity ⑤ Topics of content-based followships differ from each other • Mark Zuckerberg is more likely to be followed for the technology topic • Topic-specific influence : the probability of a user being followed for a given topic 16
Followship-LDA (FLDA) q FLDA : A fully-Bayesian generative model specifically designed for microblog networks • Specifies a stochastic process by which the content and links of each user are generated • Introduces hidden structure • Topics, reasons of a user following another, topic-specific influence, etc. • Inference: reverse the generative process • What hidden structure is most likely to have generated the observed data? 17
Hidden Structure in FLDA q Per-user topic distribu?on Topic θ Tech Food Poli ϕ q Per-topic word distribu?on Alice 0.8 0.19 0.01 q Per-user followship preference µ Bob 0.6 0.3 0.1 User Mark … … … q Per-topic followee distribu?on σ Michelle … … … Barack … … … q Global followee distribu?on π Preference Word web cookie veggie organic congress … Follow for Topic Follow for Pop. Topic Tech 0.3 0.1 0.001 0.001 0.001 … Alice 0.75 0.25 Food 0.001 0.15 0.3 0.1 0.001 … User Bob 0.5 0.5 Poli 0.005 0.001 0.001 0.002 0.25 … Mark … … Michelle … … User Barack … … Alice Bob Mark Michelle Barack Topic User Tech 0.1 0.05 0.7 0.05 0.1 Alice Bob Mark Michelle Barack Food 0.04 0.1 0.06 0.75 0.05 Global 0.005 0.001 0.244 0.25 0.5 Poli 0.03 0.02 0.05 0.1 0.8 Topic-specific influence Global popularity
Graphical Model for FLDA Plate Notation For the m th user: user q Pick a topic distribu?on θ m link tweet q Pick a followship preference µ m q For the n th word posi?on • Pick a topic z from topic distribu?on θ m • Pick a word w from word distribu?on φ z q For the l th followee • Choose the cause based on followship preference µ m • If content-related Tech: 0.8, Food: 0.19, Poli: 0.01 Tech Tech: 0.8, Food: 0.19, Poli: 0.01 § Pick a topic z from topic distribu?on θ m Follow for content: 0.75, not: 0.25 Follow for content Follow for content: 0.75, not: 0.2 § Pick a followee from per-topic followee distribu?on σ z Alice • Otherwise Content : , organic, … web § Pick a followee from global followee distribu?on π Followees : , Barack Obama, … Michelle Obama 19
Bayesian Learning for FLDA q Gibbs Sampling • A Markov chain Monte Carlo algorithm Begin with some initial value for each latent variable Iteratively sample each variable conditioned on the current values of the other variables The samples are used to approximate the posterior distribution 20
Gibbs Sampler for FLDA q Derived conditionals for FLDA Gibbs Sampler • Prob. topic of n th word of m th user given the current values of all others • Prob. topic of l th content-based followship of m th user given the current values of all others • Prob. l th followship of m th user is independent of content given the current values of all others 21
Gibbs Sampler for FLDA (cont’d) q In each pass of data, for the m th user • Sample latent variables from respec?ve condi?onals • Keep counters while sampling w,z : # ?mes w is assigned to z for m th user • c m e,z : # ?mes e is assigned to z for m th user • d m q Es?mate distribu?ons for hidden structure per-user topic distribution per-topic followee distribution (influence) 22
Distributed Gibbs Sampling for FLDA q Challenge : Gibbs Sampling process is sequential • Each sample step relies on the most recent values of all other variables. q Observation : dependency between variable assignments is weak q Solution : Distributed Gibbs Sampling for FLDA • Relax the sequential requirement of Gibbs Sampling • Implemented on Spark 23
Search Framework for Topic-Specific Influencers q SKIT: search framework for topic-specific key influencers Input: free text • SKIT healthy food Output: ranked list of key influencers • Michelle Obama Jimmy Oliver … ⓵ Derivation of interested topics from input text ⓶ Compute influence score for each user ⓷ Sort users by influence scores and return the top influencers 24
Recommend
More recommend