Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl

Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 2/24

“ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; “ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens ; ukWaC: 2 billion tokens ; Wikipedia: 3 billion tokens . Web-scale datasets: ClueWeb12: 0.7 billion documents ; CommonCrawl 2017: 3 billion documents ; The indexed Web: 5 billion documents ; The Web: 50 billion documents . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 5/24

Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Related Work May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 7/24

Related Work Large scale text collections WaCkypedia Syn.Ngrams ClueWeb12 Wikipedia GigaWord ENCOW16 PukWaC Tokens, 10 9 0.80 2.90 1.91 1.76 16.82 N/A 345.00 Documents, 10 6 1.10 5.47 5.69 4.11 9.22 733.02 3.50 Type Encyclop. Encyclop. Web News Web Web Books Source texts Yes Yes Yes Yes Yes Yes No Preprocessing Yes No Yes No Yes No No NER No No No No Yes No No Dep.parsed Yes No Yes No Yes No Yes May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 8/24

Related Work Common Crawl as a corpus [Laippala & Ginter, 2014]: used Common Crawl to construct a Finnish Parsebank (1.5 billion tokens, 116 million sentences) [Pennington et al., 2014]: GloVe embeddings trained on English Common Crawl: 42 and 820 billion of tokens (tokenization, no source texts); [Grave et al., 2018]: fastText embeddings trained on Common Crawl for 158 languages (tokenization, no source texts). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 9/24

Building a Web-Scale Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 10/24

Building a Web-Scale Corpus Corpus construction approach §3.3 §3.1 §3.2 Filtered preprocessed documents WARC web crawls Crawling Web Pages : Crawling Web Pages : Preprocessing : Linguistic Analysis: The Web lefex (Apache Hadoop) CCBot (Apache Nutch) CCBot (Apache Nutch) C4Corpus (Apache Hadoop) POS Tagging (OpenNLP) §5.2 Lemmatization (Stanford) Comp. of Distributional Model : Term Vectors , Named Entity Recognition (Stanford) JoBimText (Apache Spark) DepCC: Dependency Distributional Thesaurus Dep. Parsing (Malt + collapsing) Parsed Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 11/24

s3://commoncrawl/contrib/c4corpus/CC-MAIN-2016-07 Building a Web-Scale Corpus Preprocessing of texts C4Corpus tool [Habernal et al., 2016]: 1 Language detection , license detection, and removal of boilerplate page elements, such as menus; 2 “Exact match” document de-duplication ; 3 Removing near duplicate documents; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 12/24

Building a Web-Scale Corpus Stages of development of the corpus … based on the Common Crawl 2016-07 web crawl dump Stage of the Processing Size (.gz) Input raw web crawl (HTML, WARC) 29,539.4 Gb Preprocessed corpus (simple HTML) 832.0 Gb Preprocessed corpus English (simple HTML) 683.4 Gb Dependency-parsed English corpus (CoNLL) 2,624.6 Gb May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 13/24

2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common

Annotating and querying the Icelandic Parsed Historical Corpus and closely related

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

What a parsed corpus is and how to use it Anthony Kroch and Beatrice Santorini University of

Building and searching large parsed corpora of diachronic texts Beatrice Santorini University of

From Sentence to Discourse Building an Annotation Scheme for Discourse Based on Prague Dependency

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

From dependency structures to LFG representations Dag Haug Seminar in computational linguistics

R ELATEDNESS AND SCALE DEPENDENCY IN VERY HIGH RESOLUTION DIGITAL ELEVATION MODELS DERIVATIVES

Taking Valium Before Presentation A common practice with dependency is coupling the drug with the

Understanding an R corpus IN TRODUCTION TO N ATURAL LAN GUAGE P ROCES S IN G IN R Kasey Jones

Dependency Grammars Topological Dependency Trees: A Constraint-based Account of Linear

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel

Concepts Mediterranean traditional architecture Common, alive because it is occupied,

Characterization of Conversational Activities in a Corpus of Assistance Requests Franois

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

MaTaCOp The Map Task Corpus of The Open University of Israel Background MaTaCOp is a large

Building a chatbot: NLP pipeline and dependency parsing By: Andrei uiu meetup.com/IASI-AI/

The use of parsed corpora in information structural research LSA Summer Institute 2013: Workshop

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Building a Discourse-annotated Dutch Text Corpus Nynke van der Vliet , Ildik Berzlnovich ,

Building a Corpus for Sentiment Analysis of Surgical Operation Notes Jake Albert Advisors:

Office of Evangelization and Catechesis Building a Church of Communion Diocese of Corpus Christi

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

Building a Wide Reach Corpus for Secure Parser Development LangSec 2020 May 21, 2020 The Team