ןב תטיסרבינוא - בגנב ןוירוג Ben-Gurion University of the Negev Model-Based Classification of Web Documents Represented by Graphs Alex Markov and Mark Last Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer- Sheva, Israel Abraham Kandel National Institute for Applied Computational Intelligence University of South Florida, Tampa, FL, USA E-mail: mlast@bgu.ac.il Home Page: http://www.ise.bgu.ac.il/faculty/mlast/ WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2006, at KDD 2006, Philadelphia, PA, USA
Content • Introduction and Motivation • Graph-based Representation of Web Documents • The Hybrid Methodology for Web Document Representation and Classification – The Naïve Approach – The Smart Approach – The Smart Approach with Fixed Threshold • Comparative Evaluation • Conclusions and Future Research 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 2 2006, at KDD 2006, Philadelphia, PA, USA
Motivation • Most of Web document classification algorithms – Treat web documents the same way as text documents • HTML tags are completely ignored • The popular Vector-Space model – Ignores the word position in the document – Ignores the order of words in the document • Solution – structure-sensitive document representation – Graph representation in this research 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 3 2006, at KDD 2006, Philadelphia, PA, USA
Text Categorization ( TC) Relevant Definitions • TC – task of assigning a Boolean { T, F} value ∈ × to each pair , where d , c D C j i D = (d 1 , … , d | D| ) is domain of documents and C = (c 1 , … , c | C| ) is set of pre-defined categories (classes) • Single Label TC – only one category can be assigned to each document • Multi Label TC – overlapping categories allowed • Ranking categorization – Degree of relevance of every document to each category is calculated 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 4 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation Exam ple – Source: w w w .cnn.com , 2 4 / 0 5 / 2 0 0 5 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 5 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation - Parsing title link text 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 6 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation - Preprocessing TI TLE CNN.com International Stemming Stop word removal Text A car bomb has exploded outside a popular Baghdad restaurant, killing three Iraqis and wounding more than 110 others, police officials said. Earlier an aide to the office of Iraqi Prime Minister Ibrahim al-Jaafari and his driver were killed in a drive-by shooting. Links Iraq bomb: Four dead, 110 wounded. FULL STORY. 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 7 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation – Graph Construction TX Word Frequency KILL CAR DRIVE Iraq 3 TX Kill 2 TX TX Text Bomb 2 L Wound 2 IRAQ BOMB Drive 2 TX Link TX Explod 1 Baghdad 1 TX WOUND EXPLOD BAGHDAD International 1 Title CNN 1 Car 1 TI INTERNATIONAL CNN 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 8 2006, at KDD 2006, Philadelphia, PA, USA
Web Document Classification with Graph-Based Models • Advantages (Schenker et al ., 2004) – Keep HTML structure information – Retain original order of words • Limitation – Can work only with “lazy” classifiers, which have a very low classification speed • Example: k-Nearest Neighbors classifier • Conclusion – Graph models cannot be used directly for model-based classification of web documents • Solution – The hybrid approach : represent a document as a vector of sub-graphs 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 9 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation – Subgraphs Extraction • Naïve Method – Input: • G - Training set of directed, unique nodes graphs • t min – Threshold (minimum sub-graph frequency) – Output : Subgraph Class • Set of classification-relevant sub-graphs Frequency – Process: • For each class find frequent sub-graphs SCF > t min • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are frequent in a specific category 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 10 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation – Subgraphs Extraction • Sm art Method – Input • G – training set of directed, unique nodes graphs • CR min - Minimum Classification Rate – Output • Set of classification-relevant sub-graphs – Process : • For each class find sub-graphs CR > CR min • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are more frequent in a specific category than in other categories 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 11 2006, at KDD 2006, Philadelphia, PA, USA
Graph Based Docum ent Representation – Subgraphs Extraction • Sm art w ith Fixed Threshold Method – Input • G – training set of directed, unique nodes graphs • t min – Threshold (minimum sub-graph frequency) • CR min - Minimum Classification Rate – Output • Set of classification-relevant sub-graphs – Process : • For each class find sub-graphs SCF > t m in and CR > CR m in • Combine all sub-graphs into one set • Classification-Relevant Sub-Graphs are frequent in a specific category and not frequent in other categories 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 12 2006, at KDD 2006, Philadelphia, PA, USA
Predictive Model I nduction w ith Hybrid Representation We b o r te xt do c ume nts Sub-graph Text representation Graph Extraction Co nstruc tio n Document Creation of Feature selection classification prediction model rules I F Re pre se ntatio n o f all do c ume nts as ve c to rs with de ntific atio n o f be st attribute s (bo o le an fe ature s) inally – pre dic tio n mo de l c o nstruc tio n and E Se t o f do c ume nts with kno wn c ate go ry – training se t Do c ume nts graph re pre se ntatio n xtrac tio n o f sub-graphs r e le vant for c lassific ation fo r c lassific atio n e xtrac tio n o f c lassific atio n rule s bo o le an value s fo r e ve ry sub-graph in the se t 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 13 2006, at KDD 2006, Philadelphia, PA, USA
Frequent Subgraphs Extraction: Notations Notation Description G Set of document graphs Subgraph frequency threshold t min K Number of edges in the graph G Single graph sg Single subgraph sg k Subgraph with k edges F k Set of frequent subgraphs with k edges E k Set of extension subgraphs with k edges C k Set of candidate subgraphs with k edges 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 14 2006, at KDD 2006, Philadelphia, PA, USA
Frequent Subgraphs Extraction: Algorithm ( based on the FSG algorithm by Kuram ochi and Karypis, 2 0 0 4 ) 1 : F 0 � Detect all frequent 1 node subgraphs (nodes) in G 2 : k � 1 3 : W hile F k-1 ≠ Ø Do For Each subgraph sg k-1 ∈ F k-1 Do 4 : For Each graph g ∈ G Do 5 : I f sg k-1 is subgraph of g Then 6 : E k � Detect all possible k edge extensions of sg k-1 in g 7 : For Each subgraph sg k ∈ E k Do 8 : I f sg k already a member of C k Then 9 : { sg k ∈ C k } .Count+ + 1 0 : 1 1 : Else sg k .Count � 1 1 2 : C k � sg k 1 3 : F k � { sg k in C k | sg k .Count > t min * | G| } 1 4 : 1 5 : k+ + 1 6 : Return F 1 , F 2 , … F k-2 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 15 2006, at KDD 2006, Philadelphia, PA, USA
Frequent Subgraphs Extraction: Com plexity Subgraph isom orphism Isomorphism between graph G 1 = (V 1 ,E 1 , α 1 , β 1 ) and part of graph G 2 = (V 2 ,E 2 , α 2 , β 2 ) can be found by two simple actions: Determine that V 1 ⊆ V 2 - O (| V 1 | * | V 2 | ) 1. Determine that E 1 ⊆ E 2 – O (| V 1 | 2 ) 2. Total complexity: O(| V 1 | * | V 2 | + | V 1 | 2 ) ≤ O(| V 2 | 2 ) Graph isom orphism Isomorphism between graphs G 1 = (V 1 ,E 1 , α 1 , β 1 ) and G 2 = (V 2 ,E 2 , α 2 , β 2 ) can be found by two simple actions: Determine G 1 ⊆ G 2 - O(| V 2 | ) 1. Determine G 2 ⊆ G 1 - O(| V 2 | ) 2. Total complexity: O(| V 2 | ) 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 16 2006, at KDD 2006, Philadelphia, PA, USA
Frequent Subgraph Extraction Exam ple Subgraphs Docum ent Graph Extensions Arab Arab Arab Arab Bank Arab West Politic West Arab Politic Arab Arab Arab West Bank Politic Politic Politic Politic 8/27/2006 WebKDD 2006 Workshop on Knowledge Discovery on the Web, Aug. 20, 17 2006, at KDD 2006, Philadelphia, PA, USA
Recommend
More recommend