Graph-Based Methods for M ltili Multilingual Text and Web l T t d W b Mining Mining Mark Last Department of Information Systems Engineering p y g g Ben-Gurion University of the Negev In cooperation with H Horst Bunke (University of Bern) t B k (U i it f B ) Abraham Kandel, Adam Schenker (University of South Florida) Alex Markov, Marina Litvak, Guy Danon (Ben-Gurion University) E-mail: mlast@bgu.ac.il Home Page: http://www.bgu.ac.il/~mlast/ Text Mining Day 2009 at BGU, May 25, 2009
Agenda • Introduction and Motivation • Graph Based Representations of Text and • Graph-Based Representations of Text and Web Documents • Graph-Based Categorization and Clustering Algorithms Algorithms • The Hybrid Approach to Web Document The Hybrid Approach to Web Document Categorization • Graph-Based Keyword Extraction • Summary • Summary Prof. Mark Last (BGU) 2
3 INTRODUCTION AND Prof. Mark Last (BGU) O MOTIVATION MOTIVATION C O
Web Mining Tasks g Web Mining Web Structure Web Usage Web Content Mining Mining Mining Mining Mining Mining PageRank g Information Document Document Keyword Search and Categorization g Clustering g Extraction and t act o a d Retrieval Retrieval Summarization Prof. Mark Last (BGU) 4
The Vector-Space Model (Salton et al ., 1975) (Salton et al 1975) • A t A text document is considered a “bag of words (terms / features)” t d t i id d “b f d (t / f t )” – Document d j = (w 1j ,… ,w | T| j ) where T = (t 1 ,… ,t | T| ) is set of terms (features) that occurs at least once in at least one document (features) that occurs at least once in at least one document ( vocabulary ) • Term: n -gram single word noun phrase keyphrase etc Term: n gram, single word, noun phrase, keyphrase, etc. • Term weights: binary, frequency-based, etc. • Meaningless (“stop”) words are removed Meaningless (“stop”) words are removed • Stemming operations may be applied – Leaders => Leader – Expiring => expire • The ordering and position of words, as well as document logical structure and layout , are completely ignored May 29, 2009 5
Advantages of the Vector-Space Model Model (based on Joachims, 2002) • A simple and straightforward representation for A i l d i h f d i f English and other languages, where words have a g g g clear delimiter • Most weighting schemes require a single scan of • Most weighting schemes require a single scan of each document • A fixed-size vector representation makes unstructured text accessible to most classification unstructured text accessible to most classification algorithms (from decision trees to SVMs) • Consistently good results in the information C i t tl d lt i th i f ti retrieval domain (mainly, on English corpora) May 29, 2009 6
Limitations of the Vector- Space Model Space Model • Text documents T t d t – Ignoring the word position in the document – Ignoring the ordering of words in the document • Web Documents – Ignoring the information contained in HTML tags (e.g., document sections) • Multilingual documents – Word separation may be tricky in some languages (e g – Word separation may be tricky in some languages (e.g., Latin, German, Chinese, etc.) – No comprehensive evaluation on large non-English No comprehensive evaluation on large non English corpora May 29, 2009 7
The Word Separation in the Ancient Latin Ancient Latin The Arch of Titus, Rome (1 st Century AD) Dedication to Julius Caesar (1 st Century BC) Words are separated by triangles May 29, 2009 8
Introduced in Schenker et al ., 2005 GRAPH-BASED REPRESENTATIONS OF TEXT AND WEB DOCUMENTS AND WEB DOCUMENTS Prof. Mark Last (BGU) 9
Relevant Definitions ( Based on Bunke and Kandel, 2 0 0 0 ) ( Based on Bunke and Kandel 2 0 0 0 ) ( ) G = α , β • A ( labeled ) graph G is a 4-tuple V , E , Where Wh ⊆ ⊆ × V is a set of nodes (vertices), ( ), is a set of E V V α edges connecting the nodes, is a function β β labeling the nodes and labeling the nodes and is a function labeling is a function labeling the edges. Edge label label Node x y label A B C • Node and edge IDs are omitted for brevity • Graph size : | G| = | V| + | E| • Graph size : | G| = | V| + | E| Prof. Mark Last (BGU) 10
The Graph-Based Model of Web Documents – Basic Ideas Documents Basic Ideas • At most one node for each unique term in a document • At most one node for each unique term in a document • If a word B follows a word A , there is a directed edge from A to B from A to B – Unless the words are separated by certain punctuation marks (periods, question marks, and exclamation points) • Stop words are removed • Graph size may be limited by including only the most f frequent terms t t • Stemming – Alternate forms of the same term (singular/plural, Alt t f f th t ( i l / l l past/present/future tense, etc.) are conflated to the most frequently occurring form q y g • Several variations for node and edge labeling (see the next slides) Prof. Mark Last (BGU) 11
The Standard Representation p • Edges are labeled according to the document section where the Edges are labeled according to the document section where the words are followed by each other – Title (TI) contains the text related to the document’s title and any provided ( ) y p keywords (meta-data); – Link (L) is the “anchor text” that appears in clickable hyper-links on the document; document; – Text (TX) comprises any of the visible text in the document (this includes anchor text but not title and keyword text) TI L YAHOO YAHOO NEWS NEWS MORE MORE TX TX SERVICE SERVICE REPORTS REPORTS REUTERS REUTERS TX Prof. Mark Last (BGU) 12
The Simple Representation • The graph is based only the visible text on the Th h i b d l h i ibl h page (title and meta-data are ignored) p g ( g ) • Edges are not labeled NEWS NEWS MORE MORE REPORTS REPORTS REUTERS REUTERS SERVICE SERVICE Prof. Mark Last (BGU) 13
Other Representations • The n distance Representation • The n -distance Representation – Look up to n terms ahead and connect the succeeding terms with an edge that is labeled with the succeeding terms with an edge that is labeled with the distance between them ( n ) • The n -simple Representation • The n -simple Representation – Look up to n terms ahead and connect the succeeding terms with an unlabeled edge succeeding terms with an unlabeled edge • The Absolute Frequency Representation – Each node and edge is labeled with an absolute Each node and edge is labeled with an absolute frequency measure • The Relative Frequency Representation The Relative Frequency Representation – Each node and edge is labeled with a relative frequency measure frequency measure Prof. Mark Last (BGU) 14
Graph Based Docum ent Representation Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 Exam ple Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 Prof. Mark Last (BGU) 15
16 title Representation - Parsing Parsing Graph Based Docum ent link text Representation Prof. Mark Last (BGU)
Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing TI TLE TI TLE CNN.com International Stop word removal Stop word removal Text A car bomb has exploded outside a popular Baghdad p p p g restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqi Prime Minister Ibrahim al-Jaafari and his driver were Iraqi Prime Minister Ibrahim al Jaafari and his driver were killing killed in a drive-by shooting. Stemming g Li k Links Iraq bomb: Four dead, 110 wounded. FULL STORY FULL STORY. Prof. Mark Last (BGU) 17
Graph Based Docum ent Representation - Preprocessing Representation - Preprocessing TI TLE TI TLE CNN.com International Text A car bomb has exploded outside a popular Baghdad p p p g restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqis Prime Minister Ibrahim al-Jaafari and his driver were Iraqis Prime Minister Ibrahim al Jaafari and his driver were killing in a driver shooting. Li k Links Iraqis bomb: Four dead, 110 wounding. FULL STORY FULL STORY. Prof. Mark Last (BGU) 18
Standard Graph Based Docum ent Representation Representation Ten most frequent terms are used TX KILL Word Frequency CAR DRIVE Iraq Iraq 3 3 TX TX TX Text TX Kill 2 L L Bomb Bomb 2 2 IRAQ BOMB Wound 2 TX Link Link Drive D i 2 2 TX Explod 1 TX TX Baghdad 1 WOUND EXPLOD BAGHDAD International 1 Title CNN 1 TI INTERNATIONAL CNN Car 1 Prof. Mark Last (BGU) 19
Sim ple Graph Based Docum ent Representation Representation Ten most frequent terms are used KILL CAR DRIVE Word Frequency Iraq 3 Kill 2 Bomb 2 IRAQ BOMB Wound 2 Drive 2 Explod 1 WOUND EXPLOD BAGHDAD Baghdad 1 International International 1 1 CNN 1 Car Car 1 1 Prof. Mark Last (BGU) 20
Based on Schenker et al ., 2005 GRAPH-BASED CATEGORIZATION AND CATEGORIZATION AND CLUSTERING ALGORITHMS CLUSTERING ALGORITHMS Prof. Mark Last (BGU) 21
Recommend
More recommend