Content Extraction from Webpages Using Machine Learning Master’s Thesis Hamza Yunis Bauhaus Universit¨ at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby
Motivation Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 2 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 3 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 4 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 5 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 6 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 7 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 8 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 9 /35
What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 10 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Usually used in template recognition. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 12 /35
What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 13 /35
What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 14 /35
What is the Main Content? The main content is the non-noisy content! Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 15 /35
What is the Noisy Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 16 /35
What is the Noisy Content? Advertisements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
What is the Noisy Content? Advertisements. Navigation links. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Input elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Types of HTML Elements Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 18 /35
Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Types of HTML Elements Elements to Be Classified Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . <div> elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Recommend
More recommend