Interactive Wrapper Generation with Minimal User Effort Utku Irmak - PowerPoint PPT Presentation

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu

Introduction  Information on WWW is usually unstructured in nature, and presented via HTML  Not appropriate for (certain types of) automatic processing  Significant amount of embedded structured data  Stock data, product/price data, various statistics, …  Expressed through layout, HTML structure  Wrapper: a software tool and set of rules for extracting such structured data from web pages  Challenge: different sites, variations within sites

An Example: Meta Search Engine

An Example: Meta Search Engine Rank Title URL Snippet 1 Parallel and Distributed www.csse.monash... ... Introduction … Databases 2 distributed and parallel springerlink.com/app... databases 3 Shared Cache – The Future of csdl2.computer.org… … Shared Cache Parallel Databases – The future … 4 Distributed and Parallel www.informatik.uni- … Distributed Databases trier.edu/... and Parallel…

Introduction  Extracting the relevant data embedded in web pages and store in a relational structure for further processing  Specialized software programs called wrappers  Manual wrappers: e.g., Perl scripts …  Due to shortcomings of manually developing wrappers, many tools have been proposed for generating wrappers  Semi-automatic (interactive and non-interactive)  Fully-automatic

An Example: Meta Search Engine

Our Goal in this Work  Design a complete interactive system for generating wrappers  Developed for industrial application  Overcome common obstacles such as  Missing (multiple) attributes  Visual variations  Minimize user effort  Create robust and reliable wrappers on future pages

Related Work  Semi-automatic approaches  WIEN, SoftMealy, STALKER,  Active learning techniques are employed by Muslea et al.  Semi-automatic interactive approaches  W4F, XWrap, Lixto  Fully-automatic approaches  IEPAD, RoadRunner, work by Zhai et al.

Our Contributions We describe a new system for semi-automatic wrapper  generation based on  an interactive interface  a powerful extraction language  ranking of likely candidate sets To implement the interface, we describe a framework  based on active learning We propose the use of a category utility function for  ranking the tuple sets  We perform a detailed experimental evaluation

Framework Training Verification Webpage Set User Input: Wrapper Generation - a training webpage System - a number of verification pages

Framework Training Verification Webpage Set User Wrapper Generation (1)User highlights a tuple System on training webpage

Framework Training Verification Webpage Set User Wrapper (2) Selected tuple submitted Generation to our system, which System generates several wrappers

Framework Training Verification Webpage Set ? User Wrapper Wrapper Generatio Generation (3a) System presents user with n System System a candidate tuple set

Framework Training Verification Webpage Set ? ? ? User Wrapper Generation (3b) System presents user with System another candidate tuple set

Framework Training Verification Webpage Set ? User Wrapper Generation (3c) System presents user with System another candidate tuple set

Framework Training Verification Webpage Set User Wrapper (4) User selects one of the Generation proposed candidate tuple set System

Framework Training Verification Webpage Set User Wrapper Generation (5) System refines wrapper and System tests it on verification set

Framework Training Verification Webpage Set ! User Wrapper Generation (6) System finds one page where System the wrapper “disagrees”

Framework Training Verification Webpage Set ? ? ? User Wrapper Generation (7a) System presents user with System a candidate tuple set on this page in verification set

Framework Training Verification Webpage Set ? ? User Wrapper Generation (7b) System presents user with System another candidate tuple set on page in verification set

Framework Training Verification Webpage Set User Wrapper (8) User selects one of the Generation proposed candidate tuple set System

Framework Training Verification Webpage Set User (9) System outputs Wrapper final wrapper Generation Wrapper System

Definition: Wrapper  A wrapper is a set of extraction rules that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)  The extraction rules within a wrapper may disagree on not yet encountered web pages  In this case, a wrapper can be refined by removing some of the extraction rules

Summary of Interaction Steps:  User highlights a tuple on training page  This allows system to generate a number of wrappers that capture different candidate tuple sets  System presents candidate tuple sets on the training page to user, in order of “plausibility”  User selects the correct tuple set  System tests resulting wrapper on verification set to find any “disagreements”  For any disagreement, user selects the correct set from a ranked list of choices

A Real Example: half.ebay.com  Extract tuple with attributes:  Price, Total Price, Shipping, Seller  Only extract those tuples that:  Are listed in “Like New Items” and  Whose sellers are awarded a Red Star

A Real Example: half.ebay.com

A Real Example: half.ebay.com Training page:

Observations:  There can be a lot of unexpected cases and variations on real websites  A powerful language is needed to specify extraction rules  Simple extraction followed by SQL filtering conditions will often not work  The final wrapper may still contain many extraction rules and may disagree on webpages encountered in the future

User Effort: (0) Cost of defined table structure: number of attribute, their names, maybe types (1) Cost of highlighting one (or maybe two) tuples on training pages (2) Cost of one or more selections from a ranked list of candidate tuple sets

To Implement We Need: (0) User interface based browser extensions (1) Powerful extraction language (2) Algorithms for generating extraction rules and grouping them into wrappers (3) Techniques for ranking wrappers in terms of plausibility

System Architecture Overview

Document Representation

Extraction Language Overview  Based on DOM-tree with auxiliary properties  Extraction patterns consists of a sequence of expressions on the path from root to a tuple attribute  Each expression consists of conjunctions and disjunctions of predicates  If a node at depth i  Satisfies its expression: Accept  Otherwise: Reject  Only children of accepted nodes are checked further for the expression defined at depth i+1

Predicates in the Extraction Language  Element Nodes  Text Nodes  tagName  textNode  tagAttr  textSiblingPosition  tagAttrArray  syntax  elementSiblingPosition  leftTextNode   tagPstn leftElementNode  …  …

The Wrapper Structure

Wrapper Generation Algorithm  Creating dom_path and LCA objects  Creating patterns that extract tuple attributes  Creating initial wrappers  Generating the tuple validation rules and new wrappers  Combining the wrappers  Ranking the tuple sets  Getting confirmation from the user  Testing the wrapper on the verification set

Ranking the Tuple Sets We adopt the concept of category utility:  T S 0  Maximize inter-cluster dissimilarity  Minimize intra-cluster similarity  Dom-Path, specific value, missing attributes, indexing, content specification The weight of attribute A 1) The probability that an item has value v for 2) attribute A , given it belongs to cluster C The probability that an item belongs to cluster C , 3) given it has value v for attribute A

Ranking: Discussion  Note: we are ranking tuple sets and wrappers  A wrapper is more plausible if the tuples is extracted are very similar to each other, and if those tuples are very different from the non-tuples  One could also try to rank extraction patterns, say using MDL

Experimental Evaluations Results on four previously used data sets from RISE   Okra, BigBook, Internet Address Finder, Quote Server Number of training tuples required by our system and previous works

Experimental Evaluations We chose ten well-  known web sites and collected fifty web pages from each:  AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)

Experimental Evaluation Updating Term Weights (effect of adaptive approach):  The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites

Interactive Wrapper Generation with Minimal User Effort Utku Irmak - PowerPoint PPT Presentation

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu Introduction Information on WWW is usually unstructured in

Wrapper Learning Wrapper Learning Craig Knoblock University of Southern California This

Outline Outline Introduction Introduction Using R as a Wrapper in Using R as a Wrapper

Coarse Classification of Binary Minimal Clones Zarathustra Brady Minimal clones A clone C is

Write a Foreign Data Wrapper in 15 minutes Error: Reference source not found Table des matires

Toward full ACID distributed transaction support with Foreign Data Wrapper Masahiko Sawada

Wrapper Classes for Primitive Types in Java Primitive Data Types Include... byte, short, int,

Feder ederal al Time Time and and Effort Effort Reporting Requirements Reporting

Interactive Proofs Lecture 18 AM 1 Interactive Proofs 2 Interactive Proofs IP[k] 2

Synthetic Minimal Chromosome 2010 CBNU-KOREA team genetic information necessary and sufficient

A toy example in Minimal Model Program In minimal model program for 3-folds, Mori connected

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Technologies for Web-based Adaptive Interactive Systems: User Modeling Factors, User Data

GOOGLE NEXUS 5X & 6P MAINLINING EFFORT JEREMY MCNICOLL JEREMYMC@REDHAT.COM $> WHOAMI 2

Vertex-Minimal ertex-Minimal Planar Planar Graphs Graphs with with a Prescrib Prescribed ed

Sequential Minimal Optimization Seth Terashima April 23, 2012 Seth Terashima Sequential Minimal

2 What's OT 3 What have you known 2 as remsoothfied Discrete Mathematics and Applications by

From COCO to Object365 More object categories: 80 -> 365 More training images: 11W

d.IO CitxTxD2 Cx5 CxY Old linear time better than OCdD Products of all subsets Cx X 2 Its

Games, etc Warmup: argmin Recursion types Games ReasonML representa6on of games Warmup

Complexity COMP61511 (Fall 2017) Software Engineering Concepts in Practice Week 4 Bijan Parsia

Internet for Executives Internet for Executives Information Superhighway Information

Tools of the Trade CS 697 Sha Lai, Peilun Dai Mar 20 picture credit:

Web Design Web Design We e k 6 MSDN Ac c o unt MSDN Ac c o unt All the ac c o unts are c re

Interactive Wrapper Generation with Minimal User Effort Utku Irmak - PowerPoint PPT Presentation

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu Introduction Information on WWW is usually unstructured in

Wrapper Learning Wrapper Learning Craig Knoblock University of Southern California This

Outline Outline Introduction Introduction Using R as a Wrapper in Using R as a Wrapper

Coarse Classification of Binary Minimal Clones Zarathustra Brady Minimal clones A clone C is

Write a Foreign Data Wrapper in 15 minutes Error: Reference source not found Table des matires

Toward full ACID distributed transaction support with Foreign Data Wrapper Masahiko Sawada

Wrapper Classes for Primitive Types in Java Primitive Data Types Include... byte, short, int,

Feder ederal al Time Time and and Effort Effort Reporting Requirements Reporting

Interactive Proofs Lecture 18 AM 1 Interactive Proofs 2 Interactive Proofs IP[k] 2

Synthetic Minimal Chromosome 2010 CBNU-KOREA team genetic information necessary and sufficient

A toy example in Minimal Model Program In minimal model program for 3-folds, Mori connected

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Technologies for Web-based Adaptive Interactive Systems: User Modeling Factors, User Data

GOOGLE NEXUS 5X &amp; 6P MAINLINING EFFORT JEREMY MCNICOLL JEREMYMC@REDHAT.COM $&gt; WHOAMI 2

Vertex-Minimal ertex-Minimal Planar Planar Graphs Graphs with with a Prescrib Prescribed ed

Sequential Minimal Optimization Seth Terashima April 23, 2012 Seth Terashima Sequential Minimal

2 What's OT 3 What have you known 2 as remsoothfied Discrete Mathematics and Applications by

From COCO to Object365 More object categories: 80 -&gt; 365 More training images: 11W

d.IO CitxTxD2 Cx5 CxY Old linear time better than OCdD Products of all subsets Cx X 2 Its

Games, etc Warmup: argmin Recursion types Games ReasonML representa6on of games Warmup

Complexity COMP61511 (Fall 2017) Software Engineering Concepts in Practice Week 4 Bijan Parsia

Internet for Executives Internet for Executives Information Superhighway Information

Tools of the Trade CS 697 Sha Lai, Peilun Dai Mar 20 picture credit:

Web Design Web Design We e k 6 MSDN Ac c o unt MSDN Ac c o unt All the ac c o unts are c re

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

GOOGLE NEXUS 5X & 6P MAINLINING EFFORT JEREMY MCNICOLL JEREMYMC@REDHAT.COM $> WHOAMI 2

From COCO to Object365 More object categories: 80 -> 365 More training images: 11W