introducing web fragments
play

Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobb (LTCI, Tlcom ParisTech & Inria Paris) DBWeb seminar May 31, 2017 The e-diasporas Atlas > A collection of online migrant collectives | m


  1. Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobbé (LTCI, Télécom ParisTech & Inria Paris) DBWeb seminar – May 31, 2017

  2. The e-diasporas Atlas > A collection of online migrant collectives | m a r o c a i n s d u m o n d e . g o v . m a | l a r b i . o r g A migrant web site is a website created or managed by migrants and/or that deals with them | y a b i l a d i . c o m An e-Diaspora is a directed network of migrant websites linked by url | m o r o c c o b o a r d . c o m 10.000 migrant websites crawled, categorized and organized among 30 e-diasporas

  3. The e-diasporas Atlas > A tool for sociological analysis y a b i l a d i . c o m > moroccan e-diasporas

  4. The e-diasporas Atlas > A tool for sociological analysis Associations y a b i l a d i . c o m Blogs Institutions > moroccan e-diasporas

  5. Facing the evolutions of e-Diasporas ... > new website > alternative spaces of expression y a b i l a d i . c o m > death of blogs > new link > moroccan e-diasporas

  6. … we build a corpus of web archives > To keep a trace of the evolutions of websites page 1 page 2 page 1 page 2 page 3 > time 1 > time 2 record 1 record 2 > Our corpus is a 70 To web archive, categorized by e-diasporas corpus, crawled weekly or Monthly, between 2010 and 2015 hosted at the INA

  7. Our original research questions > Considering the e-Diasporas archived corpus Can the structure and content of the archived e-Diasporas be permeable to the efgects of shocks and external events such as political and social mobilizations? > Considering any archived corpus How can we follow traces through web archives in order to deal with a given event and its genesis by restoring it in the dual temporality of the web and the real?

  8. The naive approach > focusing on the particular case of yabiladi.com a hub at the center of the network an ancient and hybrid website forum videos y a b i l a d i . c o m since 2002 news dating > 2.8 Millions of archived pages

  9. The naive approach > considering all the archived pages as traces of activities on the website Number of new archived pages by day > Are those peaks and valleys relevant ?

  10. The naive approach > considering all the archived pages as traces of activities on the website

  11. Web archives are not direct traces of the web > web archives should be considered as direct traces of the crawler Continuous Web Discrete Archives download download download date 1 date 2 date 3 > We saw what we call a crawl legacy efgect

  12. To avoid the crawl legacy efgect We propose to conduct an exploratory analysis of web archives which would go beyond the level of the webpages

  13. The original scale of web archives is the webpage > what can we learn from the structure of web archives fjles? .WARC .DAFF t1 t2 t1 t2 meta meta meta meta crawler date crawler date crawler date crawler date download date download date download date download date data data data html content html content html content > by defjnition, web archives are built on top of webpages

  14. Archiving is all about selecting and destroying > as webpages change over time > structural changes move, copy, delete, inserte, update … > attribute changes css, font … > type changes <div> to <p> > semantic changes > "Boulevard du Temple", Louis Daguerre, 1838

  15. Archiving on top of webpages goes with many challenges > Crawler blindness and archive quality edition dates crawler dates download dates archived periods > Web archiving goes with construction locks

  16. Archiving on top of webpages goes with many challenges > Archive consistency across pages p1 changes p2 changes p1 & p2 archives href ? p1 p2 > Web archiving goes with navigation locks

  17. Archiving on top of webpages goes with many challenges > Pages with archive-like content p1 changes p1 archives > Archiving goes with discrete and continuous interpretation locks

  18. To face or reduce these challenges We propose to build a new entity from based on web archives called web fragments meta data ? web page

  19. The web fragment > A structured part of a webpage with high informationing contents > New structure for web archives an article meta crawler date data page download date page content a news item data frag data frag edition date edition date a comment author author frag content frag content

  20. Finding web fragments > We must see a webpage as a front & back end object back end front end screen <div id="com-" class="news news_plus" style="padding-top:0px;"> <div class="efget_special"> <div class="com-header"> <div class="comment-subject"> <div class="icone-comment iconuser_m" style=""> Kim </div> </div> <div class="com-info"> <a style="font-weight:400;" href="">Kim</a> 08 mai 2017 à 12h28 <br> <span class="com-auteur">08 mai 2017 à 12h28</span> </div> </div> Blabla <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> Blabla </div> <div style="foat:left;width:100%;"> </div> </div> - or an ordered tree - or an unordered tree - a fat-fjle Related works :

  21. Finding web fragments > A webpage is a 2D hierarchical list of HTML nodes depth <div id="com-8537568" class="new comment"> <div class="efget_special"> <div class="com-header"> <div class="com-info"> sequence <a class="com-author" href="/profjl/24368/kim.html">Kim</a> <br> <span class="com-date">le 08 mai 2017 à 12h28</span> </div> </div> <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> blabla </div> <div style="foat:left;width:100%;"> </div> </div> > Nodes are categorized among : title, author, date and text

  22. Finding web fragments > Nodes are selected based on markup & class & id using regex <h1 id = ''title'' class = ''title_comment''> Hello archives </h1> > Nodes are incrementally grouped into web fragments using ad-hoc rules [ U text ] or [ text U _text ] or [ title U text ] or [ date U _text ] or [ author U date ] ... > Algorithm 1, Select nodes in DOM 2, Group in fragments 3, Group by list of fragments text text [ text text ] title [ text text ] author [ title author text ] text [ title author text ] author [ author date text ] date [ author date text ], [ author date text ] text [ author date text ] author date text

  23. Rethinking archive challenges using web fragments > Crawler blindness can be reduced and archive quality increased download date edition date 2 edition date 1 Yabiladi's older fragments go back to 2003 > We introduce a more permissive archive consistency based on fragments and user requests href stable fragment stable fragment page 1 new fragment page 2

  24. Rethinking archive challenges using web fragments > Pages with archive-like content is no more a problem with web fragments as a search unit base Sharing the same id (sha256) > Web fragments help us expanding web archives beyond web pages Now let's see how we can concretely conduct an exploratory archive analysis ...

  25. Exploratory analysis of Web archives > Following John Wilder Tukey's work An iterative process that is deliberately part of a logic of observation, discovery and astonishment Acquire Parse Filter Mine Represent Refjne Interprete

  26. Archives extraction engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 1) meta Fragments crawler date Extractor y a b i l a d i . c o m data page Crawler ArchiveMiner download date page content .DAFF External Resources data frag edition date author frag content

  27. Archives exploration engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 2) meta crawler date Index of Events Index of Pages & Fragments Full text data page Facet ArchiveSearch ArchiveViz download date page content Ngrams data frag edition date author frag content

  28. The validation of web fragments Acquire Parse Filter Mine Represent Refjne Interprete > Using an event detection system 2. identifjcation with titles of news articles 3. fjelds and experts interpretations 1. threshold-based detection > Let's see the Web Archives Explorer in action video presentation for CIKM2017

Recommend


More recommend