snowball extracting relations from large plain text
play

Snowball : Extracting Relations from Large Plain-Text Collections - PowerPoint PPT Presentation

Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University 1 Extracting Relations from Documents Text documents hide valuable structured information. If


  1. Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University 1

  2. Extracting Relations from Documents Text documents hide valuable structured information. If we manage to extract this information: • We can answer user queries more accurately • We can run data mining tasks (e.g., finding trends) • 2 Eugene Agichtein Columbia University

  3. GOAL: Extract All Tuples “Hidden” in the Document Collection System must: • Require minimal training for each new task • Recover from noise • Exploit redundancy of information in documents • 3 Eugene Agichtein Columbia University

  4. Example Task: Organization/Location Redundancy Organization Location Microsoft 's central headquarters in Redmond is home to almost every product group and division. Microsoft Redmond Apple Cupertino Brent Barlow, 27, a software analyst and Computer beta-tester at Apple Computer headquarters in Cupertino , was fired Monday for "thinking Nike Portland a little too different." Apple 's programmers "think different" on a "campus" in Cupertino, Cal . Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore . • 4 Eugene Agichtein Columbia University

  5. Extracting Relations from Text Collections • Related Work • The Snowball System • Evaluation Metrics • Experimental Results • 5 Eugene Agichtein Columbia University

  6. Related Work • Traditional Information Extraction – MUCs ( M essage U nderstanding C onferences) • Significant (manual) training for each new task • Bootstrapping – Riloff et al. (‘99), Collins & Singer (‘99) • (Named-entity recognition) – Brin (DIPRE) (‘98) • 6 Eugene Agichtein Columbia University

  7. Extracting Relations from Text: DIPRE Initial Seed Tuples: ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Generate Extraction Patterns Augment Table • 7 Eugene Agichtein Columbia University

  8. Extracting Relations from Text: DIPRE Computer servers at Microsoft ’s headquarters in Redmond … Occurrences of In mid-afternoon trading, share of seed tuples: Redmond -based Microsoft fell… The Armonk -based IBM introduced ORGANIZATION LOCATION MICROSOFT REDMOND a new line… IBM ARMONK The combined company will operate BOEING SEATTLE INTEL SANTA CLARA from Boeing ’s headquarters in Seattle. Intel , Santa Clara , cut prices of its Pentium processor. Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Generate Extraction Patterns Augment Table • 8 Eugene Agichtein Columbia University

  9. Extracting Relations from Text: DIPRE DIPRE • < STRING1 >’s headquarters in < STRING2 > Patterns: • < STRING2 > -based < STRING1 > • < STRING1 > , < STRING2 > Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Generate Extraction Patterns Augment Table • 9 Eugene Agichtein Columbia University

  10. Extracting Relations from Text: DIPRE ORGANIZATION LOCATION Generate AG EDWARDS ST LUIS new seed 157TH STREET MANHATTAN tuples; 7TH LEVEL RICHARDSON start new iteration 3COM CORP SANTA CLARA 3DO REDWOOD CITY JELLIES APPLE MACWEEK SAN FRANCISCO Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Generate Extraction Patterns Augment Table • 10 Eugene Agichtein Columbia University

  11. Extracting Relations from Text: Potential Pitfalls • Invalid tuples generated – Degrade quality of tuples on subsequent iterations – Must have automatic way to select high quality tuples to use as new seed • Pattern representation – Patterns must generalize • 11 Eugene Agichtein Columbia University

  12. Extracting Relations from Text Collections • Related Work – DIPRE • The Snowball System: – Pattern representation and generation – Tuple generation – Automatic pattern and tuple evaluation • Evaluation Metrics • Experimental Results • 12 Eugene Agichtein Columbia University

  13. Extracting Relations from Text: Snowball Initial Seed Tuples: ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Tag Entities Generate Extraction Patterns Augment Table • 13 Eugene Agichtein Columbia University

  14. Extracting Relations from Text: Snowball Computer servers at Microsoft ’s headquarters in Redmond … Occurrences of In mid-afternoon trading, share of seed tuples: Redmond -based Microsoft fell… The Armonk -based IBM introduced ORGANIZATION LOCATION MICROSOFT REDMOND a new line… IBM ARMONK The combined company will operate BOEING SEATTLE INTEL SANTA CLARA from Boeing ’s headquarters in Seattle. Intel , Santa Clara , cut prices of its Pentium processor. Initial Seed Tuples Occurrences of Seed Tuples Generate New Seed Tuples Tag Entities Generate Extraction Patterns Augment Table • 14 Eugene Agichtein Columbia University

  15. Problem: Patterns Excessively General Pattern: < STRING2 >-based < STRING1 > Today's merger with McDonnell Douglas positions Seattle -based Boeing to make major money in space . …, a producer of apple-based jelly , ... <jelly, apple> • 15 Eugene Agichtein Columbia University

  16. Extracting Relations from Text: Snowball Computer servers at Microsoft ’s Tag Entities headquarters in Redmond … In mid-afternoon trading, share of Use MITRE’s Redmond -based Microsoft fell… Alembic Named The Armonk -based IBM introduced a new line… Entity tagger The combined company will operate [Himanshu] from Boeing ’s headquarters in Seattle. Intel , Santa Clara , cut prices of its + use of types Pentium processor. Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate New Seed Tuples Generate Extraction Patterns Augment Table • 16 Eugene Agichtein Columbia University

  17. Extracting Relations from Text Computer servers at Microsoft 's headquarters in Redmond ... Exxon , Irving , said it will boost its [Akshay] • < ORGANIZATION >’s headquarters in < LOCATION > stake in the... + unexact pattern In midafternoon trading, shares of • < LOCATION > -based < ORGANIZATION > Irving -based Exxon fell… matching The Armonk -based IBM has introduced a • < ORGANIZATION > , < LOCATION > new line ... The combined company will operate from Boeing 's headquarters in Seattle . PROBLEM: Patterns too specific: Intel , Santa Clara , cut prices of its have to match text exactly . Pentium... Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate New Seed Tuples Generate Extraction Patterns Augment Table • 17 Eugene Agichtein Columbia University

  18. Snowball: Pattern Representation A Snowball pattern vector is a 5-tuple < left, tag1, middle, tag2, right >, – tag1 , tag2 are named-entity tags – left , middle , and right are vectors of weighed terms. ORGANIZATION 's central headquarters in LOCATION is home to... {<'s 0.5>, <central 0.5> {<is 0.75>, LOCATION ORGANIZATION <headquarters 0.5>, <home 0.75> < in 0.5>} } < left , tag1 , middle , tag2 , right > • 18 Eugene Agichtein Columbia University

  19. Snowball: Pattern Generation Tagged Occurrences of seed tuples: Computer servers at Microsoft ’s central headquarters in Redmond … In mid-afternoon trading, share of Redmond -based Microsoft fell… The Armonk -based IBM introduced a new line… The combined company will operate from Boeing ’s headquarters in Seattle. • 19 Eugene Agichtein Columbia University

  20. [Dinesh] Snowball Pattern Generation: + vector-based rep Cluster Similar Occurrences of context Occurrences of seed tuples converted to Snowball representation: {<servers 0.75> {<’s 0.5> <central LOCATION ORGANIZATION <at 0.75>} 0.5> <headquarters 0.5> <in 0.5>} {<shares 0.75> {<- 0.75> LOCATION ORGANIZATION {<fell 1>} <of 0.75>} <based 0.75> } {<the 1>} {<- 0.75> {<introduced LOCATION ORGANIZATION <based 0.75> } 0.75> <a 0.75>} {<operate 0.75> {<’s 0.7> LOCATION ORGANIZATION <from 0.75>} <headquarters 0.7> <in 0.7>} • 20 Eugene Agichtein Columbia University

  21. [Ankit] Similarity Metric - Could be better? P = < Lp , tag1 , Mp , tag2 , Rp > [Yash] - Semantic eq of context missing? S = < Ls , tag1 , Ms , tag2 , Rs > Match ( P , S ) = Lp . Ls + Mp . Ms + Rp . Rs { if the tags match 0 otherwise • 21 Eugene Agichtein Columbia University

  22. Snowball Pattern Generation: Clustering Cluster 1 {<servers 0.75> {<’s 0.5> <central LOCATION ORGANIZATION <at 0.75>} 0.5> <headquarters 0.5> <in 0.5>} {<operate 0.75> {<’s 0.7> LOCATION ORGANIZATION <from 0.75>} <headquarters 0.7> <in 0.7>} Cluster 2 {<shares 0.75> {<- 0.75> LOCATION ORGANIZATION {<fell 1>} <of 0.75>} <based 0.75> } {<the 1>} {<- 0.75> {<introduced LOCATION ORGANIZATION <based 0.75> } 0.75> <a 0.75>} • 22 Eugene Agichtein Columbia University

Recommend


More recommend