wrapper learning wrapper learning
play

Wrapper Learning Wrapper Learning Craig Knoblock University of - PowerPoint PPT Presentation

Wrapper Learning Wrapper Learning Craig Knoblock University of Southern California This presentation is based on slides prepared by Ion Muslea and Chun-Nan Hsu Wrappers & Information Agents Wrappers & Information Agents Thai GIVE ME


  1. Wrapper Learning Wrapper Learning Craig Knoblock University of Southern California This presentation is based on slides prepared by Ion Muslea and Chun-Nan Hsu

  2. Wrappers & Information Agents Wrappers & Information Agents Thai GIVE ME : < $20 Thai food A < $20 G “A”-rated E N “A”rated T USC Information Sciences Institute ISI ISI

  3. Roadmap to Roadmap to Wrapper Building Wrapper Building • Today: • Part 1: • Wrapper Learning • Part 2: • Agent Builder • Extracting information from a page • Executing wrappers • Next Time: • Automatic Wrapper Generation • Advanced Agent Builder • Navigating through a site USC Information Sciences Institute ISI ISI

  4. Wrapper Induction Wrapper Induction Problem description: • Web sources present data in human-readable format • take user query • apply it to data base • present results in “template” HTML page • To integrate data from multiple sources, one must first extract relevant information from Web pages • Task: learn extraction rules based on labeled examples • Hand-writing rules is tedious, error prone, and time consuming USC Information Sciences Institute ISI ISI

  5. Example of Extraction Task Example of Extraction Task Casablanca Restaurant NAME 220 Lincoln Boulevard STREET Venice CITY (310) 392-5751 PHONE USC Information Sciences Institute ISI ISI

  6. In this part of the lecture … In this part of the lecture … • Wrapper Induction Systems • WIEN: • The rules • Learning WIEN rules • SoftMealy • The STALKER approach to wrapper induction • The rules • The ECTs • Learning the rules USC Information Sciences Institute ISI ISI

  7. WIEN [Kushmerick et al ‘97, ‘00] WIEN [Kushmerick et al ‘97, ‘00] • Assumes items are always in fixed, known order … Name: J. Doe ; Address: 1 Main ; Phone: 111-1111 . <p> Name: E. Poe ; Address: 10 Pico ; Phone: 777-1111 . <p> … • Introduces several types of wrappers • LR: Name: ; : ; : . Addr Phone Name USC Information Sciences Institute ISI ISI

  8. Rule Learning Rule Learning • Machine learning: • Use past experiences to improve performance • Rule learning: • INPUT: • Labeled examples: training & testing data • Admissible rules (hypotheses space) • Search strategy • Desired output: • Rule that performs well both on training and testing data USC Information Sciences Institute ISI ISI

  9. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … USC Information Sciences Institute ISI ISI

  10. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … • Admissible rules: • prefixes & suffixes of items of interest • Search strategy: • start with shortest prefix & suffix, and expand until correct USC Information Sciences Institute ISI ISI

  11. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … • Admissible rules: • prefixes & suffixes of items of interest • Search strategy: • start with shortest prefix & suffix, and expand until correct < > > < Phone Name USC Information Sciences Institute ISI ISI

  12. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … • Admissible rules: • prefixes & suffixes of items of interest • Search strategy: • start with shortest prefix & suffix, and expand until correct < > b> < Phone Name USC Information Sciences Institute ISI ISI

  13. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … • Admissible rules: • prefixes & suffixes of items of interest • Search strategy: • start with shortest prefix & suffix, and expand until correct < b> b> < Phone Name USC Information Sciences Institute ISI ISI

  14. Learning LR extraction rules Learning LR extraction rules <html> Name:<b> Kim’s </b> Phone:<b> (800) 757-1111 </b> … <html> Name:<b> Joe’s </b> Phone:<b> (888) 111-1111 </b> … • Admissible rules: • prefixes & suffixes of items of interest • Search strategy: • start with shortest prefix & suffix, and expand until correct < <b> b> < Phone Name USC Information Sciences Institute ISI ISI

  15. Summary Summary • Advantages: • Fast to learn & extract • Drawbacks: • Cannot handle permutations and missing items • Must label entire page • Requires large number of examples USC Information Sciences Institute ISI ISI

  16. In this part of the lecture … In this part of the lecture … • Wrapper Induction Systems • WIEN: • The rules • Learning WIEN rules • SoftMealy • The STALKER approach to wrapper induction • The rules • The ECTs • Learning the rules USC Information Sciences Institute ISI ISI

  17. SoftMealy [Hsu & Dung, ‘98] SoftMealy [Hsu & Dung, ‘98] • Learns a transducer Addr: Addr ; ; Name: Name ; Phone: Phone: Phone . USC Information Sciences Institute ISI ISI

  18. SoftMealy --- --- extractor representation extractor representation SoftMealy formalism formalism • Variation of finite state transducer (a.k.a. Mealy machine ) • Simple enough to be learnable from a small number of examples of extractions • fixed graph structure or strictly confined search space for graph structures • less edges, less outgoing edges • Complex enough to handle irregular attribute permutations • missing attributes • multiple attribute values • variant attribute ordering USC Information Sciences Institute ISI ISI

  19. How SoftMealy SoftMealy extractors work extractors work How <LI><A HREF=“mani.html”> Mani Chandy</A>, <I>Professor of Computer Science</I> and <I>Executive Officer for Computer Science</I> skip extract extract A N N N N N Contextual Contextual Contextual rules rules rules USC Information Sciences Institute ISI ISI

  20. Contextual rule Contextual rule • Contextual rule looks like: TRANSFER FROM state N TO state N IF left context = capitalized string right context = HTML tag “</A>” • When the “master” read head stops at the boundary between two tokens, the “secondary” read head scans the left and right context and matches what’s read with contextual rules • It is not necessary that both left context and right context are used in a contextual rule • A contextual rule may have disjunctions USC Information Sciences Institute ISI ISI

  21. Summary Summary • Advantages: • Also learns order of items • Allows item permutations & missing items • Uses wildcards (eg, Number, AllCaps, etc) • Drawback: • Must “see” all possible permutations USC Information Sciences Institute ISI ISI

  22. In this part of the lecture … In this part of the lecture … • Wrapper Induction Systems • WIEN: • The rules • Learning WIEN rules • SoftMealy • The STALKER approach to wrapper induction • The rules • The ECTs • Learning the rules USC Information Sciences Institute ISI ISI

  23. STALKER [Muslea et al, ’98 ’99 ’01] STALKER [Muslea et al, ’98 ’99 ’01] • Hierarchical wrapper induction • Decomposes a hard problem in several easier ones • Extracts items independently of each other • Each rule is a finite automaton USC Information Sciences Institute ISI ISI

  24. STALKER: The Wrapper The Wrapper STALKER: Architecture Architecture Data Query Information Extractor Extraction EC Tree Rules USC Information Sciences Institute ISI ISI

  25. Extraction Rules Extraction Rules Extraction rule : sequence of landmarks SkipTo( Phone) SkipTo (<i> ) SkipTo (</i> ) Name: Joel’s <p> Phone: <i> (310) 777-1111 </i><p> Review: … USC Information Sciences Institute ISI ISI

  26. More about Extraction Rules More about Extraction Rules Name: Joel’s <p> Phone: <i> (310) 777-1111 </i><p> Review: … Name: Kim’s <p> Phone (toll free) : <b> (800) 757-1111 </b> … Name: Kim’s <p> Phone:<b> (888) 111-1111 </b><p>Review: … Start: EITHER SkipTo( Phone : <i> ) OR SkipTo( Phone ) SkipTo( : <b> ) USC Information Sciences Institute ISI ISI

Recommend


More recommend