Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. Sakyo, Kyoto 606-8501 Japan {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp
Background • Understanding of structure in web pages is important for many applications • Web search • Automatic summarization of web pages • Web information extraction 2
Structure in web pages • Web pages contain various types of structures
Structure in web pages Header • Web pages contain various types of structures • Layout structure, Menu Content body
Structure in web pages Header • Web pages contain various types of structures • Layout structure, Menu list or table structure, … Content body Item 1 Item 2 Item 3
Structure in web pages Header • Web pages contain various types of Big heading structures Small heading • Layout structure, Small heading Menu list or table structure, … Content body Big heading • We focus on hierarchical Item 1 heading structure Item 2 Item 3 • 78% of pages contain it
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. Information Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m. History 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 7
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours From 9 a.m. to 5 p.m. History 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 8
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours From 9 a.m. to 5 p.m. History 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 9
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours From 9 a.m. to 5 p.m. History 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 10
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours From 9 a.m. to 5 p.m. History 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 11
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours • Block From 9 a.m. to 5 p.m. • A segment with its heading History • may contain each other 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 12
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours • Block From 9 a.m. to 5 p.m. • A segment with its heading History • may contain each other 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 13
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours • Block From 9 a.m. to 5 p.m. • A segment with its heading History • may contain each other 2010 Jul. Construction started. • 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 14
Kyoto Aquarium Hierarchical is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. • Heading Information Holidays • Topic description of a segment Open throughout the year. Opening Hours • Block From 9 a.m. to 5 p.m. • A segment with its heading History • may contain each other 2010 Jul. Construction started. • 2012 • Hierarchical heading structure Feb. Construction finished. • Mar. Opened just as planned. • • composed of these Jul. Welcomed the 1Mth visitor. • headings and blocks 15
Kyoto Aquarium Importance of is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. 2010 Mar Search • Traditional search engines: Information Holidays • This page contains both words Open throughout the year. • Extracts this page incorrectly Opening Hours From 9 a.m. to 5 p.m. • Heading-aware Bool. retrieval: History • “March” occurs under “2012”, 2010 not “2010” Jul. Construction started. • • Can reject this page correctly 2012 Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 16
Kyoto Aquarium Importance of is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. 2010 Mar Search Information Holidays • Traditional search engines: Open throughout the year. • This page contains both words Opening Hours From 9 a.m. to 5 p.m. • return this page incorrectly History • Heading-aware Bool. retrieval: 2010 • “March” occurs under “2012”, Jul. Construction started. • not “2010” 2012 • Can reject this page correctly Feb. Construction finished. • Mar. Opened just as planned. • Jul. Welcomed the 1Mth visitor. • 17
Kyoto Aquarium Importance of is an aquarium in Kyoto, Japan. Overview heading structure One of the largest inland aquariums. 2010 Mar Search Information Holidays • Traditional search engines: Open throughout the year. • This page contains both words Opening Hours From 9 a.m. to 5 p.m. • return this page incorrectly History 2010 Jul. Construction started. • Heading-aware engines: • • “Mar.” occurs under “2012”, 2012 Feb. Construction finished. not “2010” • Mar. Opened just as planned. • • Will not return this page Jul. Welcomed the 1Mth visitor. • 18
Problem to be solved • Hierarchical heading structure is useful • It seems easy to extract the structure 19
Problem to be solved • Hierarchical heading structure is useful • It seems easy to extract the structure • In fact, it’s NOT easy Our research problem: Extraction of hierarchical heading structure 20
Hierarchical heading structure extraction is NOT easy • HTML has tags for descripting headings • H1 to H6 and DT tags 21
Hierarchical heading structure extraction is NOT easy • HTML has tags for descripting headings • H1 to H6 and DT tags • These tags are not always used or used incorrectly In our data set: • Only 32% of headings were tagged by these tags • Only 67% of components tagged by these tags were headings 22
Hierarchical heading structure extraction is NOT easy • HTML has tags for descripting headings • H1 to H6 and DT tags • These tags are not always used or used incorrectly In our data set: • Only 32% of headings were tagged by these tags • Only 67% of components tagged by these tags were headings • More sophisticated extraction method is necessary 23
Humans use visual style • How do humans extract hierarchical heading structure? 24
Humans use visual style • How do humans extract hierarchical heading structure? • They use visual style • consists of various visual attributes of components • e.g. font-size, color 25
Humans use visual style • How do humans extract hierarchical heading structure? • They use visual style • consists of various visual attributes of components • e.g. font-size, color 26
Visual style can be easily detected • Visual style is assigned to each DOM node • DOM node is a pair of tags or a text fragment split by tags <LI> LI <B> Jul. </B> B text Construction started. </LI> Jul. 27
Visual style can be easily detected • Visual style is assigned to each DOM node • DOM node is a pair of tags or a text fragment split by tags <LI> LI <B> Jul. </B> B text Construction started. </LI> Jul. 28
Visual style can be easily detected • Visual style is assigned to each DOM node • DOM node is a pair of tags or a text fragment split by tags <LI> LI <B> Jul. </B> B text Construction started. </LI> Jul. 29
Recommend
More recommend