content extraction from webpages using machine learning
play

Content Extraction from Webpages Using Machine Learning Masters - PowerPoint PPT Presentation

Content Extraction from Webpages Using Machine Learning Masters Thesis Hamza Yunis Bauhaus Universit at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby Motivation Hamza Yunis (Bauhaus


  1. Content Extraction from Webpages Using Machine Learning Master’s Thesis Hamza Yunis Bauhaus Universit¨ at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby

  2. Motivation Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 2 /35

  3. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 3 /35

  4. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 4 /35

  5. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 5 /35

  6. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 6 /35

  7. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 7 /35

  8. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 8 /35

  9. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 9 /35

  10. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 10 /35

  11. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  12. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  13. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  14. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  15. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  16. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  17. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Usually used in template recognition. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  18. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 12 /35

  19. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 13 /35

  20. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 14 /35

  21. What is the Main Content? The main content is the non-noisy content! Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 15 /35

  22. What is the Noisy Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 16 /35

  23. What is the Noisy Content? Advertisements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  24. What is the Noisy Content? Advertisements. Navigation links. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  25. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  26. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  27. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  28. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Input elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  29. Types of HTML Elements Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 18 /35

  30. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  31. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  32. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  33. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  34. Types of HTML Elements Elements to Be Classified Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

  35. Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

  36. Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . <div> elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

Recommend


More recommend