a reference set approach to information extraction from
play

A Reference-Set Approach to Information Extraction from - PowerPoint PPT Presentation

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources Matthew Michelson Ph.D. Defense Nov. 3 rd , 2008 Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion


  1. A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources Matthew Michelson Ph.D. Defense Nov. 3 rd , 2008

  2. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Motivation: Data Integration Query: Average price for a 3-star crash- rated Honda, and reviews. Mediator User Integrate? Query QUERY? QUERY QUERY WRAPPERS ?????? Classified ads, Auction listings, Car NHTSA Etc. Review Ratings Unstructured, Ungrammatical Structured Sources Semi-Structured Sources Sources

  3. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Motivation: Data Integration Query: Average price for a 3-star crash- rated Honda, and reviews. Mediator User Integrate? Query QUERY? QUERY QUERY WRAPPERS THESIS Classified ads, Auction listings, Car NHTSA Etc. Review Ratings Unstructured, Ungrammatical Structured Sources Semi-Structured Sources Sources

  4. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unstructured, Ungrammatical Data: “Posts”

  5. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unstructured, Ungrammatical Data: “Posts”

  6. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Query? … Information Extraction/Annotation! Model: Civic Trim: SI Price: $2900 Year: 91 MAKE: HONDA (implied!) MODEL: CIVIC TRIM: 2 Door SI YEAR: 1991

  7. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Difficulties � � Unstructured � � No assumptions on structure � � “Rule/Pattern” based techniques unsuited � � Ungrammatical � � Does not conform to English grammar � � Natural-Language Processing techniques unsuited

  8. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Reference-Set Based Extraction/ Annotation 91 Civic SI RHD SHELL - $2900 - Record Linkage Reference Set (s) Information Extraction Annotation HONDA CIVIC 2 Door SI 1991 Extracted Civic SI 91 $2900 Attributes Query Integrate

  9. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Reference Sets � � Collections of entities and their attributes � � List cars � <make, model, trim, …> Scrape make, model, trim, year for all cars from 1990-2005…

  10. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Contributions � � Automatic matching and extraction algorithm that exploits a given reference set � � Automatically select the appropriate reference sets from a repository of reference sets � � Automatic method for building reference sets from the posts themselves � � Suggest the number of posts required to sufficiently build reference set � � Algorithm to determine whether automatic method will work, or user should create reference set � � Supervised machine learning for high-accuracy � � High accuracy, even in the face of ambiguity

  11. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Contributions 3 reference-set based extraction methods Summary Advantages Method 1 Automatically select � � State-of-the-art extraction 1. � reference set from (ARX) � � Automatic, given reference set repository Automatic extraction 2. � [IJDAR 07] Method 2 Automatically build � � Cannot build reference set 1. � reference set (difficult attributes) (ILA) � � Fully automatic � � Competitive state-of-the-art [JAIR, review] Method 3 Supervised approach � � Highest-accuracy extraction 1. � to extraction (Phoebus) � � Deals with ambiguity [JAIR, 08]

  12. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Automatic method: Three steps IJDAR, 2007 Posts Reference Set repository ------------------ ----------------- 1) Select reference set(s) ----------------- Hotels ------------------ Restaurants -------------- Edmunds Cars 2) Find best matches (unsupervised) 3) Extraction using matches (unsupervised) ARX: Automatic Reference-set based eXtraction

  13. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Selecting the Reference Set(s) Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 SIM:0.7 Cars Hotels Restaurants

  14. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Selecting the Reference Set(s) Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 SIM:0.7 SIM:0.4 Cars Hotels Restaurants

  15. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Selecting the Reference Set(s) Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 SIM:0.7 SIM:0.4 SIM:0.3 Cars Hotels Restaurants

  16. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Selecting the Reference Set(s) Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 Cars 0.7 PD(C,H) = 0.75 > T SIM:0.7 SIM:0.4 SIM:0.3 Hotels 0.4 PD(H,R) = 0.33 < T Restaurants 0.3 Avg. 0.47 Cars Cars Hotels Restaurants

  17. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unsupervised matching between the posts and reference set new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! Cheap too! {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} {NISSAN, ALTIMA, 2007} {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}

  18. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unsupervised matching between the posts and reference set new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! Cheap too! {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} {NISSAN, ALTIMA, 2007} {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}

  19. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unsupervised matching between the posts and reference set new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! Cheap too! {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} {NISSAN, ALTIMA, 2007} {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} {BMW, M3, 2 Dr STD Convertible, 2002} {LINCOLN, TOWN CAR, 4 Dr, 2001} {RENAULT, LE CAR, 2 Dr, 1987}

  20. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unsupervised matching between the posts and reference set new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! Cheap too! {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} {NISSAN, ALTIMA, 2007} {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} {BMW, M3, 2 Dr STD Convertible, 2002} {LINCOLN, TOWN CAR, 4 Dr, 2001} Prune false { } positives! {RENAULT, LE CAR, 2 Dr, 1987}

  21. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Unsupervised Extraction 91 Civic SI RHD SHELL - $2900 - similarity 1991 2 Dr SI Honda Civic year make model trim Civic SI 91 Clean Whole Attribute

  22. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Results: Information Extraction State-of-the-art comparison � � Conditional Random Field (structure) 1. � CRF-Orth 1. � Orthographic features: cap, start-num, etc. � � CRF-Win 2. � CRF-Orth + 2-word sliding window � � more structure! � � Amilcare 2. � NLP � � “Gazetteers” (list of hotels, etc.) � � ARX = automatic, others = supervised � � Field-level extractions � � All tokens required, no extras (strict!) � �

  23. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Results: Information Extraction Craigs Cars Posts (Craigslist) ARX CRF-Orth CRF-Win Amilcare Make 97.95 83.66 78.67 94.57 Model 88.61 74.25 68.72 81.24 Trim 49.70 47.88 38.75 35.94 Year 86.47 88.04 84.52 88.97 � � ARX ~27,000 cars: Edmunds/ Super Lamb Auto � � Automatic & better than supervised on 5/7 attributes BFT Posts (biddingfortravel.com) � � Cases where ARX ARX CRF-Orth CRF-Win Amilcare underperforms Star Rating 91.03 94.77 94.21 96.46 � � w/in 5% Hotel Name 73.46 67.47 41.33 62.91 � � Strong numeric component Local Area 71.98 70.19 33.07 68.01 � � Recall issue � � CRF-Win ~130 hotels: BiddingForTravel.com � � Worst on 6/7 Automatic, state-of-the-art extraction on posts � � Can’t rely on structure!

  24. Introduction Unsupervised IE Building Reference Sets Supervised IE Conclusion Automatic construction of reference sets � � What if there isn’t already a reference set? HP Pavillion DV2000 laptop Gateway ML6230, Intel Cel … � � What about coverage? ? Ford Focus ACURA TL 3.2 VTEC - 1999 Dodge Caravan

Recommend


More recommend