Applying Link-based Classification to Label Blogs Graham Cormode - PowerPoint PPT Presentation

Applying Link-based Classification to Label Blogs Graham Cormode Smriti Bhagat, Irina Rozenbaum

Blogs as Multigraphs Many interesting new data sources are best modelled as multigraphs, with multiple attributes and link types. “Blogs” are an important emerging example of such data: � Intersect with web, email, chat data, social networks � React rapidly to major news, defining opinion and identifying articles of interest � Raise problems of trustworthiness, finding leaders, classifying for expertise and bias We study labeling problems on these large multigraphs

timestamp headline text profile data links author tags reader comments Commenter id and timestamp Static links: “blogroll”

Personal info A/S/L: Age, Sex, Location Free-text info Links to friends on same host Instant messenger and email ids

Learning Labels on Multigraphs Webpage � Blogs, blog links, web links, comments etc. 31 implicitly define a (massive) multigraph 22 � We focus on problems 33 of learning labels Blog ? � Our focus is on Blog properties of the blog Entry author such as age � As with all supervised learning, cannot always trust the training data… apparently some people lie about their age

Prior Work on (Multi)graph Learning � Relational learning: classify objects represented by Relational Database (see work by Getoor et al) � Typically builds complex models e.g. Relational Markov Networks on relatively small examples (few thousand nodes) � Our problem is also an instance of semi-supervised learning (input is mix of labelled and unlabeled examples) � Several works apply matrix decomposition, does not scale well to massive (multi)graphs � Some work on similar labelling problems on web graph in addition to text (Chakrabarti et al., 1998)

Simple Learning on Graphs Local: Iterative Global: Nearest Neighbor Hypothesis: Nodes point to other Hypothesis: Nodes with similar nodes with similar labels neighborhoods have similar labels (homophily) (co-citation regularity) Similar 18 18 18 18 18 Label is computed 18 from the votes by 18 19 18 18 32 its neighbors 20 31 19 20 31 29 � Labels are computed � Label is inferred by searching iteratively using weighted for similar neighborhoods of voting by neighbors labeled nodes

Extend Learning to Multigraphs Iterative: Nearest Neighbor: Pseudo Labels Set Similarity Hypothesis: Web pages link Hypothesis: Distance computation similar communities of bloggers is improved with additional features 19 18 w3 18 18 18 w2 18 w w1 18 ? ? 18 20 20 20 18 19 18 � Webpages assigned a � Augment distance with pseudo label, based on similarity between sets of votes by its neighbors neighboring web-nodes

Implementation Issues � Preliminary experiments guided choice of settings: – Choice of similarity function for NN classifier: used correlation coefficient between vectors of adjacent labels – Smoothed feature vector with triangular kernel because of continuity of ages – In multigraph case with additional features, extended by blending with Jacard coefficient of set similarity of features – Iterative algorithm allocates label based on majority voting � Experimented with variety of edge combinations: Friends only, blog only, blog+friends, blog+web

Data Collection Summary 400K profiles crawled 300K profiles crawled 780K profiles crawled 50K (12.5%) labeled 124K (41%) labeled 500K (64%) labeled 41K blog nodes 200K blog nodes 535K blog nodes 190K blog links 404K blog links 3000K blog links 331K web nodes 289K web nodes 74K web nodes 997K web links 1089K web links 895K web links Median: 4 blog links Median: 2 blog links Median: 5 blog links Median: 3 web links Median: 4 web links Median: 2 web links Most popular weblinks Most popular weblinks Most popular weblinks 1. news.google.com 1. maps.google.com 1. members.msn.com 2. picasa.google.com 2. www.myspace.com 2. wwp.icq.com 3. en.wikipedia.org 3. photobucket.com 3. edit.yahoo.com B B 4. www.flickr.com 4. www.youtube.com 4. www.gottem.net W 5. www.statcounter.com 5. quizilla.com 5. www.crazyarcades.com � 50GB of data collected

Accuracy on Age Label � Similar results on age for both methods, some data sets are “easier” than others, due to density and connectivity � Local algorithm takes few seconds to assign labels, NN takes tens of minutes (due to exhaustive comparisons)

Multigraph Labeling for Age � Adding web links and using pseudo labels does not significantly change accuracy, but increases coverage � Assigned age reflects webpage, e.g. bands slipknot (17) vs. Radiohead (28), but also demographics of blog network

Learning Location Labels � Local algorithm predicts country and continent with high (80%+) accuracy over all data sets, validating hypothesis � Errors come from over-representing common labels: N. America has high recall, low precision, Africa vice-versa.

Conclusions � Analyzed performance of simple classifiers for blog data using link and label information only – Richness of setting leads to many details: choice of distance, smoothing and voting functions, etc. – Links alone still hold a lot of information: 80% accuracy, better than naïve use of standard classifiers � Simple models are quite limited, do not extend easily – Work better for some labels, rely on hypotheses – Open to apply and scale richer models (Relational Markov Networks) to blogs � Need to understand benefit of additional attributes – in our expts, extra features did not seem to help

Applying Link-based Classification to Label Blogs Graham Cormode - PowerPoint PPT Presentation

Applying Link-based Classification to Label Blogs Graham Cormode Smriti Bhagat, Irina Rozenbaum Blogs as Multigraphs Many interesting new data sources are best modelled as multigraphs, with multiple attributes and link types. Blogs are

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

BLOGS ARE ECHO CHAMBERS: BLOGS ARE ECHO CHAMBERS Eric Gilbert | Tony Bergstrom | Karrie Karahalios

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Publishing Online Lecture 6 COMPS CI111/ 111G Todays lecture Blogs Wikis Blogs

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

On-line Hierarchical Multi-label Classification last 6 months Jesse Read jesse.read@gmail.com

Learning Context-dependent Label Permutations for Multi-label Classification Jinseok Nam Amazon

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

Work on Multi-label Classification Jesse Read Supervised by Bernhard Pfahringer

Bag-of-features for category classification for category classification Cordelia Schmid

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language

Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap:

Statistically Based Model Comparison Techniques H. T. Banks Center for Research in Scientific

I n d u s t r y e x p e r i e n c e a n d p o s i t i o n t o T i O 2 c l a s s i f i c a t i o

Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry Zhu, Aarti Singh Supervised

Neural Methods for Semantic Role Labeling Diego Marcheggiani , Michael Roth, Ivan Titov, Benjamin

An Overview of Labelling-Based Justification Status Martin Caminada Yining Wu 1 1

Spectral gap-labelling conjecture for magnetic Schrdinger operators and recent progress Recent

Applying Link-based Classification to Label Blogs Graham Cormode - PowerPoint PPT Presentation

Applying Link-based Classification to Label Blogs Graham Cormode Smriti Bhagat, Irina Rozenbaum Blogs as Multigraphs Many interesting new data sources are best modelled as multigraphs, with multiple attributes and link types. Blogs are

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

Extreme Classification A New Paradigm for Ranking &amp; Recommendation Manik Varma Microsoft

BLOGS ARE ECHO CHAMBERS: BLOGS ARE ECHO CHAMBERS Eric Gilbert | Tony Bergstrom | Karrie Karahalios

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Publishing Online Lecture 6 COMPS CI111/ 111G Todays lecture Blogs Wikis Blogs

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

On-line Hierarchical Multi-label Classification last 6 months Jesse Read jesse.read@gmail.com

Learning Context-dependent Label Permutations for Multi-label Classification Jinseok Nam Amazon

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

Work on Multi-label Classification Jesse Read Supervised by Bernhard Pfahringer

Bag-of-features for category classification for category classification Cordelia Schmid

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language

Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap:

Statistically Based Model Comparison Techniques H. T. Banks Center for Research in Scientific

I n d u s t r y e x p e r i e n c e a n d p o s i t i o n t o T i O 2 c l a s s i f i c a t i o

Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry Zhu, Aarti Singh Supervised

Neural Methods for Semantic Role Labeling Diego Marcheggiani , Michael Roth, Ivan Titov, Benjamin

An Overview of Labelling-Based Justification Status Martin Caminada Yining Wu 1 1

Spectral gap-labelling conjecture for magnetic Schrdinger operators and recent progress Recent

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft