Offline Data Processing: Tasks and Infrastructure Support T. Yang, - PowerPoint PPT Presentation

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N

Table of Content • Offline incremental data processing: case study • Example of content analysis • System support

Offline Architecture for Ask.com Search

Content Management • Organize the vast amount of pages crawled to facilitate online search.  Data preprocessing  Inverted index  Compression  Classify and partition data • Collect additional content and ranking signals.  Link, anchor text, log data • Extract and structure content • Duplicate detection

Classifying and Partitioning data English English English Main . UK Australia • Classify  Content quality. Language/country etc • Partition  Based on languages and countries. Geographical distribution based on data center locations  Partition based on quality – First tier --- high chance that users will access  Quality indicator  Click feedback Tier 1 – Second tier – lower chance Tier 2

Examples of Context Extraction/Analysis • Identify key phrases that capture the meaning of this document.  For example, title, section title, highlighted words. – HTML vs PDF • Identify parts of a document representing the meaning of this document.  Many web pages contain a side-menu, which his less relevant to the main content of the documents • Capture page content through Javascript analysis.  Page rendering and Javascript evaluation within a page

Example of Content Analysis • Detect content block related to the main content of a page  Non-content text/link material is de-prioritized during indexing process

Redundant Content Removal in Search Engines • Over 1/3 of Web pages crawled are near duplicates • When to remove near duplicates?  Offline removal Offline data Web Duplicate Online processing Pages filtering index  Online removal with query-based duplicate removal Final results Online index Duplicate matching & User removal result ranking query

Why there are so many duplicates? • Same content, different URLs, often with different session IDs. • Crawling time difference

Tradeoff of online vs. offline removal Online-dominating Offline-dominating approach approach Impact to offline High precision High precision Low recall High recall Remove fewer Remove most of duplicates duplicates Higher offline burden Impact to online More burden to Less burden to online deduplication online deduplication Impact to overall Higher serving cost Lower serving cost cost

Software Infrastructure Support at Ask.com • Programming support (multi-threading/exception Handling, Hadoop MapReduce) • Data stores for managing billions of objects  Distributed hash tables, queues etc • Communication and data exchange among machines/services • Execution environment  Controllable (stop, pause, restart).  Service registration and invocation  service monitoring  Logging and test framework.

Requirements for Data Repository Support in Offline Systems • Update  handling large volumes of modified documents  adding new content • Random access  request the content of a document based on its URL • Compression and large files  reducing storage requirements and efficient access • Scan  Scan documents for text mining.

Options for Data Stores • Bigtable at Google • Dynamo at Amazon • Open source software Technology Language Users/ Platform sponsors Apache Bigtable Java/Hadoop Apache Cassandra Dynamo Hypertable Bigtable C++/Hadoop Baidu Hbase Bigtable Java/Hadoop Apache LevelDB Bigtable C++ Google MongoDB C++

Offline Data Processing: Tasks and Infrastructure Support T. Yang, - PowerPoint PPT Presentation

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N Table of Content Offline incremental data processing: case study Example of content analysis System support Offline Architecture for Ask.com Search

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 293S Table of Content

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

1 Redundant Content Removal in Search Engines Over 1/3 of Web pages crawled are near Example

Shared Memory Programming with OpenMP Lecture 6: Tasks What are tasks? Tasks are

Scheduling Aperiodic Tasks Background Scheduling Treat aperiodic tasks as lowest-priority

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Opyum: offline package management with Yum -- Debarshi Ray What is it? An offline package

Offline Inbox Interceptor - Ultimate Presentation Offline Inbox Interceptor - Ultimate

CAF Benchmarking CAF Benchmarking Marco MEONI CERN - Offline Week C N O e Wee Alice Offline

5.1 Online versus Offline SVMs We start with a review of the Offline Support Vector Machine.

Taking it all Offline with SQL Anywhere Eric Farrar, Product Manager Sybase iAnywhere March 5,

Update on offline resources at CERN and some news on: database for logging online processing

RECENT PROGRESS ON WEB SERVICES FOR SFT Nefeli Kousi TASKS TASKS ROOT Primer to Notebooks

Time Management Beth Asbury Outline Time Bandits Scheduling tasks Prioritising tasks

Slide 1 Page: 1 Mathematical Tasks.ppt Effective Mathematics Instruction: The Role of

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

1 Fetcher WebDB/Fetcher Updates Fetcher is very stupid. Not a crawler URL:

OUT OF THE PIPELINE: NOVEL TARGETS AND TREATMENTS FOR SCHIZOPHRENIA Learning Objectives

IN3170/4170, spring 2020, mandatory labratory exercise 2: Gate delay and nFET intrinsic gain

Joint Learning of Syntactic and Semantic Dependencies Xavier Llu s and Llu s M` arquez

2017 DNSSEC KSK Rollover Carlos Martnez | LACNIC | LACNIC 27, Foz Do Iguass Purpose of this

Speeding things up: Getting sloooower The TLB Memory Exception Every new level of paging no p

MAPREDUCE INFORMATION RETRIEVAL EXPERIMENTS CLEF 2010, Tuesday 21 September 2010 Djoerd Hiemstra

Mak Karim IT Director Summary 3 Recruitment Challenges Effective Candidate Sourcing