creating specialized and general corpora using automated
play

Creating Specialized and General Corpora Using Automated Search - PowerPoint PPT Presentation

Introduction Building a specialized corpus Building a BNC Conclusions Creating Specialized and General Corpora Using Automated Search Engine Queries Marco Baroni, Serge Sharoff SSLMIT, University of Bologna; CTS, University of Leeds


  1. Introduction Building a specialized corpus Building a BNC Conclusions Creating Specialized and General Corpora Using Automated Search Engine Queries Marco Baroni, Serge Sharoff SSLMIT, University of Bologna; CTS, University of Leeds Birmingham, July 2005 Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  2. Introduction Building a specialized corpus Building a BNC Conclusions Outline Introduction 1 Building a specialized corpus 2 Background The procedure in detail Conclusions on specialized corpus building Building a BNC 3 DIY manual Analysing macrostructure (composition) Analysing microstructure (lexicon) Conclusions 4 Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  3. Introduction Building a specialized corpus Building a BNC Conclusions Introduction A “middle ground strategy” Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  4. Introduction Building a specialized corpus Building a BNC Conclusions Introduction A “middle ground strategy” Some relevant work: Ghani and colleagues’ CorpusBuilder project Corpus comparison work, e.g., Rayson and Garside 2000 Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  5. Introduction Building a specialized corpus Building a BNC Conclusions What you need Unix-like OS and Unix skills Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  6. Introduction Building a specialized corpus Building a BNC Conclusions What you need Unix-like OS and Unix skills Google API (or, now, Yahoo API) Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  7. Introduction Building a specialized corpus Building a BNC Conclusions What you need Unix-like OS and Unix skills Google API (or, now, Yahoo API) Our scripts – contact us! Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  8. Introduction Building a specialized corpus Building a BNC Conclusions What you need Unix-like OS and Unix skills Google API (or, now, Yahoo API) Our scripts – contact us! POS taggers, indexers, etc. Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  9. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Outline Introduction 1 Building a specialized corpus 2 Background The procedure in detail Conclusions on specialized corpus building Building a BNC 3 DIY manual Analysing macrostructure (composition) Analysing microstructure (lexicon) Conclusions 4 Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  10. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Applications Uses: technical translation, terminography, populating ontologies. . . Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  11. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Applications Uses: technical translation, terminography, populating ontologies. . . Domains: medical, legal, meteorology, arts, food, nautical terminology, (e-)commerce. . . Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  12. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Applications Uses: technical translation, terminography, populating ontologies. . . Domains: medical, legal, meteorology, arts, food, nautical terminology, (e-)commerce. . . Languages: English, Italian, Japanese, Spanish, German, French, Danish. . . Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  13. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions The basic idea Select initial seeds Query Google for random seed combinations Retrieve pages and format as text (corpus) Extract new seeds via corpus comparison Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  14. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Example 7 seeds: black sabbath , led zeppelin , deep purple , motorhead , rainbow , judas priest , iron maiden Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  15. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Example 7 seeds: black sabbath , led zeppelin , deep purple , motorhead , rainbow , judas priest , iron maiden 35 3-seed combinations: "led zeppelin" rainbow "black sabbath" "deep purple" motorhead rainbow "deep purple" "judas priest" motorhead ... Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  16. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Example 7 seeds: black sabbath , led zeppelin , deep purple , motorhead , rainbow , judas priest , iron maiden 35 3-seed combinations: "led zeppelin" rainbow "black sabbath" "deep purple" motorhead rainbow "deep purple" "judas priest" motorhead ... 20 documents per query Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  17. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Document retrieval and processing Automated retrieval of documents is the easy part (e.g., with perl LWP module). . . Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  18. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Document retrieval and processing Automated retrieval of documents is the easy part (e.g., with perl LWP module). . . Filtering and cleaning (“boilerplate removal”) is more tricky Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  19. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Cleaning with a standard HTML formatter Blackmore’s Night Latest News Ritchie Blackmore’s Bio Blackmore’s Night Band Bios Blackmore’s Night Tour Info Blackmore’s Night Merchandise Blackmore’s Night Photo Gallery Blackmore’s Night Audio Clips ... Register for Blackmores Night Email Updates! Just enter your email address in the box below and click the ’Sign up’ button! ... RITCHIE BLACKMORE A MUSICAL HISTORY... 1967 - RITCHIE BLACKMORE - who has previously played with such bands as the Outlaws, Screaming Lord Sutch, and Neil Christian & The Crusaders - is invited by ex-Artwoods/The Flowerpot Men keybordist Jon Lord (who was invited by The Searchers ex-drummer, Chris Curtis) to form a new band. Other musician’s would be auditioned from a Melody Maker ad in Deeves Hall in Hertfordshire. 1968- In February, the group would form as Roundabout, consisting of the three (with Chris Curtis on vocals) along with Dave Curtis on bass and Bobby Woodman on drums. After only a month of uncompromising rehearsals, BLACKMORE and LORD would be the only two remaining, ... Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  20. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Finn’s BTE heuristic http://www.smi.ucd.ie/hyppia/ Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  21. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Finn’s BTE heuristic http://www.smi.ucd.ie/hyppia/ Basic observation: Content-rich section of page tends to occur in low-HTML-density area Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  22. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Finn’s BTE heuristic http://www.smi.ucd.ie/hyppia/ Basic observation: Content-rich section of page tends to occur in low-HTML-density area Look for stretch that maximizes the quantity: N ( TOKEN ) − N ( TAG ) Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

  23. Introduction Background Building a specialized corpus The procedure in detail Building a BNC Conclusions on specialized corpus building Conclusions Finn’s BTE heuristic: why it (mostly) works TAG TAG TOKEN TOKEN TAG TAG TAG TOKEN TAG TAG TOKEN TAG TAG TAG TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TAG TOKEN TOKEN TAG TOKEN TOKEN TOKEN TAG TAG TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TOKEN TAG TAG TAG TAG TAG TAG TOKEN TAG TAG TOKEN TAG Marco Baroni, Serge Sharoff Web-Corpora via Search Engine Queries

Recommend


More recommend