page segmentation by web content clustering
play

Page Segmentation by Web Content Clustering Sadet Alcic - PowerPoint PPT Presentation

Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 WIMS 11 1 / 19 Outline 1 Introduction Motivation


  1. Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 WIMS ’11 1 / 19

  2. Outline 1 Introduction Motivation Related Work 2 Web Page Segmentation by Clustering General Idea Distance functions for web contents Clustering methods 3 Evaluation Studies Distance functions Clustering 4 Conclusion and Future work WIMS ’11 2 / 19

  3. Introduction Motivation Motivation WIMS ’11 3 / 19

  4. Introduction Motivation Motivation Web Page is cluttered with different contents ◮ Different news articles ◮ Link lists ◮ Commercials ◮ Template elements ◮ Functional elements WIMS ’11 3 / 19

  5. Introduction Motivation Motivation Web Page Segmentation ◮ Separation of web contents into structural and semantic cohesive blocks WIMS ’11 3 / 19

  6. Introduction Motivation Motivation WIMS ’11 3 / 19

  7. Introduction Motivation Motivation Applications ◮ Web Content Search ◮ Web Page Categorization ◮ Web Page Adaptation for Mobile Devices ◮ Web Image Indexing ◮ ... WIMS ’11 3 / 19

  8. Introduction Related Work Overview of Related Work to Web Page Segmentation ◮ TOP-DOWN page segmentation: ◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) ◮ APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) ◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics) ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  9. Introduction Related Work Overview of Related Work to Web Page Segmentation ◮ TOP-DOWN page segmentation: ◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) ◮ APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) ◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics) Basic Idea ◮ Start with complete Page as initial block ◮ Decide for each block: ◮ should the block be separated? ◮ if yes, where to separate? ! Based on heuristics WIMS ’11 4 / 19

  10. Introduction Related Work Overview of Related Work to Web Page Segmentation Basic Idea ◮ Start with smallest content units (e.g., DOM leafs) ◮ group them to blocks of coherent content ◮ How? ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  11. Introduction Related Work Overview of Related Work to Web Page Segmentation Our Approach ◮ belongs to BOTTOM-UP methods ◮ DOM leafs are used as basic web objects ◮ Idea!: group web objects to blocks by clustering ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  12. Web Page Segmentation by Clustering Web Page Segmentation by Clustering WIMS ’11 5 / 19

  13. Web Page Segmentation by Clustering General Idea Page Segmentation by Clustering General Definition: Clustering ◮ Clustering is the process of organizing objects into groups whose members are similar in some way ◮ A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters Open questions addressed in this work ◮ How can the similarity (or dissimilarity) of web objects be estimated? ◮ Which representation is best suitable to represent web objects? ◮ Which clustering method should be applied for clustering? WIMS ’11 6 / 19

  14. Web Page Segmentation by Clustering Distance functions for web contents Different Representations of Web objects ◮ Geometric Representation ◮ web browser puts every object of a web page in a 2-dim plane ◮ extract the bounding rectangle for each object ◮ Semantic Representation ◮ elements in DOM contain some textual contents ◮ extract keywords from the corresponding text ◮ DOM-based Representation ◮ each object is a node in the DOM tree of the page ◮ use the position of the object in DOM tree to characterize it A B ⇒ Different distance measures are possible WIMS ’11 7 / 19

  15. Web Page Segmentation by Clustering Distance functions for web contents Geometric Distance ◮ Let R = [( r x , r y ) , ( r x , , r y , )] and S = [( s x , s y ) , ( s x , , r y , )] be two bounding rectangles ◮ The geometric distance of R to S is given by  r i − s i , if r i > s i , i ∈ x , y t i 2 � 1 2 , with t i = ��  r i , < s i dist ( R , S ) = s i − r i , if 0 if otherwise .  ◮ Visually: (0, 0) x (s x , s y ) S mindist(R,S) (r x , r y ) (s x ', s y ') R y (r x ', r y ') WIMS ’11 8 / 19

  16. Web Page Segmentation by Clustering Distance functions for web contents Semantic Distance ◮ Given T 1 = (dog, run, street), T 2 = (puppy, walk, road) ◮ Cosine Similarity Measure (Information Retrieval) ◮ Lexical word-to-word matching → sim ( T 1 , T 2 ) = 0 ◮ to strict: e.g. synonym and hyponym relationships are not considered ◮ Instead: text similarity measure based on WordNet [Corley 05] ◮ Words are mapped to concepts in WordNet (concept-to-concept matching) � w i ∈ T 1 maxSim ( w i , T 2 ) · idf ( w i ) sim ( T 1 , T 2 ) = � w i ∈ T 1 idf ( w i ) WIMS ’11 9 / 19

  17. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 WIMS ’11 10 / 19

  18. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 Requirements ◮ Nodes under same parent are closer than nodes under different parent ◮ Nodes on higher tree level are closer that nodes on lower level WIMS ’11 10 / 19

  19. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ Traverse DOM-tree in preorder traversing: P = ( ❦ A , ❦ B , ❦ 1 , ❦ 2 , ❦ 3 , ❦ C , ❦ 4 , ❦ 5 , ❦ 6 ) WIMS ’11 10 / 19

  20. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ Traverse DOM-tree in preorder traversing: P = ( ❦ A , ❦ B , ❦ 1 , ❦ 2 , ❦ 3 , ❦ C , ❦ 4 , ❦ 5 , ❦ 6 ) ◮ For each element in P define a weight w p i that expresses the costs needed to reach p i from its predecessor in P WIMS ’11 10 / 19

  21. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ The distance between p a , p b ∈ P , (wlog. a < b ) is defined as: b � d ( p a , p b ) = w p i i = a +1 Example: d ( ❦ 2 , ❦ 4 ) = w 3 + w C + w 4 WIMS ’11 10 / 19

  22. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ The weight w i of a node p i ∈ P depends on the level l and the level degree d l of p i : � : d l = 0 c w ( l ) = (1) d l · w ( l + 1) : d l > 0 , e.g., w (2) = c , w (1) = 3 ∗ w (2) = 3 c , w (0) = 2 ∗ w (1) = 6 c , WIMS ’11 10 / 19

  23. Web Page Segmentation by Clustering Clustering methods Clustering Methods ◮ Partitioning Clustering ◮ k-medoid (similar as k-means, but cluster representatives are real objects) ◮ Agglomerative Hierarchical Clustering ◮ single link method applied to compute distance between sets of objects ◮ Density-based Clustering ◮ DBSCAN variant (able to find clusters of different density levels) WIMS ’11 11 / 19

  24. Evaluation Studies Evaluation Studies WIMS ’11 12 / 19

  25. Evaluation Studies Distance functions Distance-Matrix Visualization ◮ A distance matrix contains all pairwise distances of the objects to be clustered, e.g. a b c a 0 1 . 9 1 . 1 b 1 . 9 0 2 . 3 c 1 . 1 2 . 3 0 WIMS ’11 13 / 19

  26. Evaluation Studies Distance functions 57-65 66-68 44-54 1-19 29-43 20-28 WIMS ’11 55-56 13 / 19

Recommend


More recommend