Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobbé (LTCI, Télécom ParisTech & Inria Paris) DBWeb seminar – May 31, 2017

The e-diasporas Atlas > A collection of online migrant collectives | m a r o c a i n s d u m o n d e . g o v . m a | l a r b i . o r g A migrant web site is a website created or managed by migrants and/or that deals with them | y a b i l a d i . c o m An e-Diaspora is a directed network of migrant websites linked by url | m o r o c c o b o a r d . c o m 10.000 migrant websites crawled, categorized and organized among 30 e-diasporas

The e-diasporas Atlas > A tool for sociological analysis y a b i l a d i . c o m > moroccan e-diasporas

The e-diasporas Atlas > A tool for sociological analysis Associations y a b i l a d i . c o m Blogs Institutions > moroccan e-diasporas

Facing the evolutions of e-Diasporas ... > new website > alternative spaces of expression y a b i l a d i . c o m > death of blogs > new link > moroccan e-diasporas

… we build a corpus of web archives > To keep a trace of the evolutions of websites page 1 page 2 page 1 page 2 page 3 > time 1 > time 2 record 1 record 2 > Our corpus is a 70 To web archive, categorized by e-diasporas corpus, crawled weekly or Monthly, between 2010 and 2015 hosted at the INA

Our original research questions > Considering the e-Diasporas archived corpus Can the structure and content of the archived e-Diasporas be permeable to the efgects of shocks and external events such as political and social mobilizations? > Considering any archived corpus How can we follow traces through web archives in order to deal with a given event and its genesis by restoring it in the dual temporality of the web and the real?

The naive approach > focusing on the particular case of yabiladi.com a hub at the center of the network an ancient and hybrid website forum videos y a b i l a d i . c o m since 2002 news dating > 2.8 Millions of archived pages

The naive approach > considering all the archived pages as traces of activities on the website Number of new archived pages by day > Are those peaks and valleys relevant ?

The naive approach > considering all the archived pages as traces of activities on the website

Web archives are not direct traces of the web > web archives should be considered as direct traces of the crawler Continuous Web Discrete Archives download download download date 1 date 2 date 3 > We saw what we call a crawl legacy efgect

To avoid the crawl legacy efgect We propose to conduct an exploratory analysis of web archives which would go beyond the level of the webpages

The original scale of web archives is the webpage > what can we learn from the structure of web archives fjles? .WARC .DAFF t1 t2 t1 t2 meta meta meta meta crawler date crawler date crawler date crawler date download date download date download date download date data data data html content html content html content > by defjnition, web archives are built on top of webpages

Archiving is all about selecting and destroying > as webpages change over time > structural changes move, copy, delete, inserte, update … > attribute changes css, font … > type changes <div> to <p> > semantic changes > "Boulevard du Temple", Louis Daguerre, 1838

Archiving on top of webpages goes with many challenges > Crawler blindness and archive quality edition dates crawler dates download dates archived periods > Web archiving goes with construction locks

Archiving on top of webpages goes with many challenges > Archive consistency across pages p1 changes p2 changes p1 & p2 archives href ? p1 p2 > Web archiving goes with navigation locks

Archiving on top of webpages goes with many challenges > Pages with archive-like content p1 changes p1 archives > Archiving goes with discrete and continuous interpretation locks

To face or reduce these challenges We propose to build a new entity from based on web archives called web fragments meta data ? web page

The web fragment > A structured part of a webpage with high informationing contents > New structure for web archives an article meta crawler date data page download date page content a news item data frag data frag edition date edition date a comment author author frag content frag content

Finding web fragments > We must see a webpage as a front & back end object back end front end screen <div id="com-" class="news news_plus" style="padding-top:0px;"> <div class="efget_special"> <div class="com-header"> <div class="comment-subject"> <div class="icone-comment iconuser_m" style=""> Kim </div> </div> <div class="com-info"> <a style="font-weight:400;" href="">Kim</a> 08 mai 2017 à 12h28 <br> <span class="com-auteur">08 mai 2017 à 12h28</span> </div> </div> Blabla <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> Blabla </div> <div style="foat:left;width:100%;"> </div> </div> - or an ordered tree - or an unordered tree - a fat-fjle Related works :

Finding web fragments > A webpage is a 2D hierarchical list of HTML nodes depth <div id="com-8537568" class="new comment"> <div class="efget_special"> <div class="com-header"> <div class="com-info"> sequence <a class="com-author" href="/profjl/24368/kim.html">Kim</a> <br> <span class="com-date">le 08 mai 2017 à 12h28</span> </div> </div> <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> blabla </div> <div style="foat:left;width:100%;"> </div> </div> > Nodes are categorized among : title, author, date and text

Finding web fragments > Nodes are selected based on markup & class & id using regex <h1 id = ''title'' class = ''title_comment''> Hello archives </h1> > Nodes are incrementally grouped into web fragments using ad-hoc rules [ U text ] or [ text U _text ] or [ title U text ] or [ date U _text ] or [ author U date ] ... > Algorithm 1, Select nodes in DOM 2, Group in fragments 3, Group by list of fragments text text [ text text ] title [ text text ] author [ title author text ] text [ title author text ] author [ author date text ] date [ author date text ], [ author date text ] text [ author date text ] author date text

Rethinking archive challenges using web fragments > Crawler blindness can be reduced and archive quality increased download date edition date 2 edition date 1 Yabiladi's older fragments go back to 2003 > We introduce a more permissive archive consistency based on fragments and user requests href stable fragment stable fragment page 1 new fragment page 2

Rethinking archive challenges using web fragments > Pages with archive-like content is no more a problem with web fragments as a search unit base Sharing the same id (sha256) > Web fragments help us expanding web archives beyond web pages Now let's see how we can concretely conduct an exploratory archive analysis ...

Exploratory analysis of Web archives > Following John Wilder Tukey's work An iterative process that is deliberately part of a logic of observation, discovery and astonishment Acquire Parse Filter Mine Represent Refjne Interprete

Archives extraction engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 1) meta Fragments crawler date Extractor y a b i l a d i . c o m data page Crawler ArchiveMiner download date page content .DAFF External Resources data frag edition date author frag content

Archives exploration engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 2) meta crawler date Index of Events Index of Pages & Fragments Full text data page Facet ArchiveSearch ArchiveViz download date page content Ngrams data frag edition date author frag content

The validation of web fragments Acquire Parse Filter Mine Represent Refjne Interprete > Using an event detection system 2. identifjcation with titles of news articles 3. fjelds and experts interpretations 1. threshold-based detection > Let's see the Web Archives Explorer in action video presentation for CIKM2017

Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobb (LTCI, Tlcom ParisTech & Inria Paris) DBWeb seminar May 31, 2017 The e-diasporas Atlas > A collection of online migrant collectives | m

Lab 8 Fragments KUAN-TING LAI 2020/10/8 Fragments: Make It Modular Fragments:

Chemspace Modifiable Fragments Acid fragments and Amine fragments Description Presence of

Presenting Fragments as Quotations or Quotations as Fragments A Digital Edition of the Fragments

CS 403X Mobile and Ubiquitous Computing Lecture 8: Fragments Camera Emmanuel Agu Fragments

CS 4518 Mobile and Ubiquitous Computing Lecture 7: Fragments, Camera Emmanuel Agu Fragments

CS 528 Mobile and Ubiquitous Computing Lecture 4a: Fragments, Database and Firebase Cloud API

CS378 - Mobile Computing What's Next? Fragments Added in Android 3.0, a release aimed at

Admissible Rules of (Fragments of) R-Mingle Admissible Rules of (Fragments of) R-Mingle Laura

Chemspace Heavy Fragments Description Aside of NMR, X ray crystallography is a technique that

Finite Model Reasoning in Expressive Fragments of First-Order Logic Lidia Tendera Institute of

From Small Carbon Fragments to Self- From Small Carbon Fragments to Self- Assembled Fullerenes in

Identify potential adjacent fragments and computer their alignments based on color/texture

Bootstrapping Dependency Grammars from Sentence Fragments via Austere Models Valentin I.

Chemspace Pre-Plated Compounds Description Discover our sets of Pre-Plated compounds! Fragments

Introducing .. .. Introducing HIT- -4G 4G HIT 03/09/2018 03/09/2018

Introducing more people Introducing more people Introducing more people Introducing more people

Introducing EXPLORER 710 Slide 2 - the bar is raised Cobham plc 2 Introducing the EXPLORER 710

INTRODUCING INTRODUCING Self-propelled aerial work platforms PM Oil&Ste e l Asia Pte L td

Propositional Fragments for Knowledge Compilation and Quantified Boolean Formulae Sylvie

Introducing Asp.Net Ing. Gabriele Zannoni gabriele.zannoni@unibo.it Introducing Asp.Net 1

2014 PRODUCTION TOOLS 5. INTRODUCING SURVEY TOOLS FUNDAMENTAL OF LANDSCAPE ARCHITECTURE ARL200

Fragments of peptoid 1: Synthesis of N -subs8tuted glycine

ACTIVITY AND FRAGMENTS Introduction An application is composed of at least one Activity GUI

The Basics of Syntax Introducing Noun Phrases Some Further Details Introducing Verb Phrases

Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobb (LTCI, Tlcom ParisTech & Inria Paris) DBWeb seminar May 31, 2017 The e-diasporas Atlas > A collection of online migrant collectives | m

Lab 8 Fragments KUAN-TING LAI 2020/10/8 Fragments: Make It Modular Fragments:

Chemspace Modifiable Fragments Acid fragments and Amine fragments Description Presence of

Presenting Fragments as Quotations or Quotations as Fragments A Digital Edition of the Fragments

CS 403X Mobile and Ubiquitous Computing Lecture 8: Fragments Camera Emmanuel Agu Fragments

CS 4518 Mobile and Ubiquitous Computing Lecture 7: Fragments, Camera Emmanuel Agu Fragments

CS 528 Mobile and Ubiquitous Computing Lecture 4a: Fragments, Database and Firebase Cloud API

CS378 - Mobile Computing What's Next? Fragments Added in Android 3.0, a release aimed at

Admissible Rules of (Fragments of) R-Mingle Admissible Rules of (Fragments of) R-Mingle Laura

Chemspace Heavy Fragments Description Aside of NMR, X ray crystallography is a technique that

Finite Model Reasoning in Expressive Fragments of First-Order Logic Lidia Tendera Institute of

From Small Carbon Fragments to Self- From Small Carbon Fragments to Self- Assembled Fullerenes in

Identify potential adjacent fragments and computer their alignments based on color/texture

Bootstrapping Dependency Grammars from Sentence Fragments via Austere Models Valentin I.

Chemspace Pre-Plated Compounds Description Discover our sets of Pre-Plated compounds! Fragments

Introducing .. .. Introducing HIT- -4G 4G HIT 03/09/2018 03/09/2018

Introducing more people Introducing more people Introducing more people Introducing more people

Introducing EXPLORER 710 Slide 2 - the bar is raised Cobham plc 2 Introducing the EXPLORER 710

INTRODUCING INTRODUCING Self-propelled aerial work platforms PM Oil&amp;Ste e l Asia Pte L td

Propositional Fragments for Knowledge Compilation and Quantified Boolean Formulae Sylvie

Introducing Asp.Net Ing. Gabriele Zannoni gabriele.zannoni@unibo.it Introducing Asp.Net 1

2014 PRODUCTION TOOLS 5. INTRODUCING SURVEY TOOLS FUNDAMENTAL OF LANDSCAPE ARCHITECTURE ARL200

Fragments of peptoid 1: Synthesis of N -subs8tuted glycine

ACTIVITY AND FRAGMENTS Introduction An application is composed of at least one Activity GUI

The Basics of Syntax Introducing Noun Phrases Some Further Details Introducing Verb Phrases

INTRODUCING INTRODUCING Self-propelled aerial work platforms PM Oil&Ste e l Asia Pte L td