WEB ARCHIVING @ LIP6 S TÉPHANE G ANÇARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014)
Web Archives Web is ephemeral and constantly evolving Need to preserve information WEB WEB ARCHIVES 50.000.000.000 165.000.000.000 (Google@2012) (Internet Archive@2012) non-cumulative Index Cumulative Index Freshness Coherence, Completeness, Preservation 2 / 54
Issues Avoid duplicates Temporal Queries, completeness, IR, indexing, Coherence navigation, coherence (Crawling) Spatial completeness discovery Emulation, migration , Cloud computing 3 / 54
1. Efficient crawling using segmentation and patterns Efficient : Maximize temporal completeness and coherence Under limited resources (bandwith, politeness, storage…) Temporal completeness : how well the archive captures the history of a page. Relevant for medium size archives (e.g. INA legal deposit) For large size archives, spatial completeness Temporal coherence Capture versions of different pages that were present at the same time on the Web 4 / 54
Temporal completeness Importance of captured versions temporal completeness = Importance of all the versions appeared on the Web Importance of a page version ω(υ ω(P υ υ ) = ) * impCh ( , ) i j i i 1 Page importance Changes importance How to measure it ? 5 / 54
Measuring changes importance update insert What are the changes? Are they important ? Depends on where Version n- Version the change occurs 1 n Version n-1 Version n Change importance is somehow related with what users see on the page Render pages before analysis Users see blocks of information Use web page segmentation 6 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
Global overview Web www Crawler Archive Version V(n) Version V(n) Patterns Time series Importanc Change Pattern e Detection Discovery Version V(n-1) Estimation 7 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
Segmentation BOM BOM Page VIPS VIPS Extension B1 Page [Cai03] B1 B2 B3 ID ID ID B1 B2 B3 B2.1 B2.2 B2.3 B2.2 B2.1 B2.2 B2.3 B2.1 ID ID B2.3 ID B3 <xml> < Page url=‘ ‘ version=‘’ ….> < Block ref=‘B1’ pos=‘ ‘ <Links id=‘ ‘ > < link name= ‘ ‘ adr =‘ ’ /> < link name=‘ ‘ adr= ‘ ‘ /> <Links/> Images Texts <Images id=‘ ‘ > Links ID ID ID <img name =‘ ‘ src=‘ ‘ /> IDList IDList IDList IDList <Images/> Txt=‘Text1’ <Texts id=‘ ‘ text=‘ ‘ /> ID Link ID ID Img Link < Block /> Img ID < Block ref=‘B2’ id=‘ ‘ …> Name=‘Link1’ Name=‘Link2’ Name=‘img1’ Name=‘img2’ … Adr=‘ ‘ Adr=‘ ‘ Src=‘ ‘ Src=‘ ‘ < Block /> < Page /> <xml/> Vi-XML Document 8 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
Changes detection Page Page B2 B2 B1 ID ID ID B1 ID ID ID B3 B3 B2.1 B2.2 B2.1 B2.2 ID ID B2.3 ID ID B2.3 Vi-XML V(n) Vi-XML V(n-1) Version n-1 Version n Vi-DIFF • structural changes (O(n2)) • Content changes (O (n. log n)) 9 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
From changes to crawl scheduling From successive delta files we can compute change patterns for pages Blocks are weighted According to [Song&al@WWW04] change importance between page versions = (normalized) weighted sum of the block changes From page patterns and last crawl date, we can compute an urgency function that simulate the change importance accumulated on the page since the last crawl We crawl the page that has maximum urgency 10
Experimentations Page versions are crawled from a « complete » archive, so that we can compute the completeness of each strategy 11
2. Accessing Web archives What if we click on this link ?
Coherent navigation P1[tq] Related Work Recent: it returns the closest version before tq: P2[t1] Nearest: it returns the nearest version by minimizing |tq – tx|:P2[t2] Our approach ? ? Choose the version of P2 that has the maximum probability to be coherent with P1[tq] Ø according to P1 and P2s patterns P2[t1] [tq] P2[t2] Crawl time 13 / 54
Experiments Count how many times coherent version chosen Dataset 60 France TV channels 1000 hourly crawl Simulation Links between pages Crawling Results 15 % better than Nearest 40 % better than Recent 14 / 54
Access to Wacs Today Wayback Machine, Internet Archive 15 / 54
Access to Wacs Today Full text search 16 / 54
Why Query Language? Web Users ≠ WAC Users Historians Full-tex Journalists and Researcher Not navigation Web Philologist sufficient usually Web Archaeologists sufficient 17 / 54
Operators Classic operators Search only blocks : Page E Page D InBlock Get version of page p at Page A t : Wayback Incomplete: If p is not crawled at t : Nearest/Recent/Cohere Page B Page C nt Navigational Operators : in, out, jump 18/54
Static Index Pruning Index compressing: discard postings (term,doc) less likely to affect the retrieval performance So that (part of) the index fits in main memory Off-line • State of the art • Give a score to each posting • Based on a threshold filter out a part of postings • Obtain prune ratio ( % of removed postings) Random , TCP (Carmel et al. SIGIR '01) , IP-u (Chen et al. CIKM '12) , 2N2P (Thota et al. ECIR '11) , PRPP (Blanco et al. ACM '10) 19 / 54
Introducing diversification Existing pruning techniques not designed with time in mind Query « Iraq War » Temporal dimension should be preserved while pruning ANOVA test shows that there is a link between temporal coverage and retrieval performance We design 3 methods that takes diversification into account for pruning Temporal aspects model : windows of fixed size (simple and sliding) and dynamic (gaussian mixture model) 20 / 54
Experiments - Dataset The English Wikipedia of July 7, 2009 with temporal queries and relevance judgements [Berberich et al. ECIR'10] 2,300,277 Wikipedia articles with temporal expressions 40 queries with temporal dimension and related relevance judgements Our methods overpasses the existing ones, mostly when the prune ratio is high Currently we try to experiment on larger archives (Portuguese Web Archive) 21 / 54
Results 22 / 54
3. Web page segmentation Detect blocks of information in a page Many application in Data Preservation and access Crawl scheduling (cf. first part of the talk) Emulation control: check if an archived page can be properly rendered with a new browser (if not, keep the old browser) Migration control : archived pages must be migrated (e.g. change archive file format), control if same rendering after/before migration Mobile devices (small screen): display blocks, not whole page HTML4 to HTML5 migration (map blocks to HTML5 tags, current work) Etc… 23 / 54
Block-o-Matic: web page segmentation Content Categories : Root, tabular, forms,links, ... Labels : Header, navigation, article, content,… reconstruction analysis understanding rendering DOM Content Logic Flow W W 24 W'
# of MoB tool Evaluation Method elements in common Ground Truth Segmentation F. Shafait, D. Keysers, and T. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):941 – 954, 2008. 25
Experiments 26
Future/current work Block based information extraction… Detecting blocks is an extraction task Information extraction = add semantics to blocks Automatic classification : specific rules based on geometry, text proportion… Extract information contained by blocks Leverage the segmentation to optimize the extraction process (ex: objet extracted from the same block, adjacent blocks,…) Related with linked object ML/image processing techniques (cf. N. Thome presentation) Enhance the comparison algorithm (classifier: similar/different) Learning blocks weight (go beyond Song’s approach) 27
Obrigado. 28
Recommend
More recommend