building a web scale dependency parsed corpus from common
play

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common


  1. Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl

  2. Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 2/24

  3. “ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

  4. Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; “ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

  5. Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

  6. Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

  7. Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens ; ukWaC: 2 billion tokens ; Wikipedia: 3 billion tokens . Web-scale datasets: ClueWeb12: 0.7 billion documents ; CommonCrawl 2017: 3 billion documents ; The indexed Web: 5 billion documents ; The Web: 50 billion documents . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 5/24

  8. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

  9. no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

  10. 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

  11. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

  12. Related Work May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 7/24

  13. Related Work Large scale text collections WaCkypedia Syn.Ngrams ClueWeb12 Wikipedia GigaWord ENCOW16 PukWaC Tokens, 10 9 0.80 2.90 1.91 1.76 16.82 N/A 345.00 Documents, 10 6 1.10 5.47 5.69 4.11 9.22 733.02 3.50 Type Encyclop. Encyclop. Web News Web Web Books Source texts Yes Yes Yes Yes Yes Yes No Preprocessing Yes No Yes No Yes No No NER No No No No Yes No No Dep.parsed Yes No Yes No Yes No Yes May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 8/24

  14. Related Work Common Crawl as a corpus [Laippala & Ginter, 2014]: used Common Crawl to construct a Finnish Parsebank (1.5 billion tokens, 116 million sentences) [Pennington et al., 2014]: GloVe embeddings trained on English Common Crawl: 42 and 820 billion of tokens (tokenization, no source texts); [Grave et al., 2018]: fastText embeddings trained on Common Crawl for 158 languages (tokenization, no source texts). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 9/24

  15. Building a Web-Scale Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 10/24

  16. Building a Web-Scale Corpus Corpus construction approach §3.3 §3.1 §3.2 Filtered preprocessed documents WARC web crawls Crawling Web Pages : Crawling Web Pages : Preprocessing : Linguistic Analysis: The Web lefex (Apache Hadoop) CCBot (Apache Nutch) CCBot (Apache Nutch) C4Corpus (Apache Hadoop) POS Tagging (OpenNLP) §5.2 Lemmatization (Stanford) Comp. of Distributional Model : Term Vectors , Named Entity Recognition (Stanford) JoBimText (Apache Spark) DepCC: Dependency Distributional Thesaurus Dep. Parsing (Malt + collapsing) Parsed Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 11/24

  17. s3://commoncrawl/contrib/c4corpus/CC-MAIN-2016-07 Building a Web-Scale Corpus Preprocessing of texts C4Corpus tool [Habernal et al., 2016]: 1 Language detection , license detection, and removal of boilerplate page elements, such as menus; 2 “Exact match” document de-duplication ; 3 Removing near duplicate documents; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 12/24

  18. Building a Web-Scale Corpus Stages of development of the corpus … based on the Common Crawl 2016-07 web crawl dump Stage of the Processing Size (.gz) Input raw web crawl (HTML, WARC) 29,539.4 Gb Preprocessed corpus (simple HTML) 832.0 Gb Preprocessed corpus English (simple HTML) 683.4 Gb Dependency-parsed English corpus (CoNLL) 2,624.6 Gb May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 13/24

  19. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

  20. 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

  21. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

Recommend


More recommend