Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - PowerPoint PPT Presentation

Corpus Bootstrapping with NLTK by Jacob Perkins

Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk

Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus

Solution make a custom training corpus

Problems with Manual Annotation takes time requires expertise expert time costs $$$

Solution: Bootstrap less time less expertise costs less requires thinking & creativity

Corpus Bootstrapping at Weotta review sentiment keyword classification phrase extraction & classification

Bootstrapping Examples english -> spanish sentiment phrase extraction

Translating Sentiment start with english sentiment corpus & classifier english -> spanish -> spanish

English -> Spanish -> Spanish 1. translate english examples to spanish 2. train classifier 3. classify spanish text into new corpus 4. correct new corpus 5. retrain classifier 6. add to corpus & goto 4 until done

Translate Corpus $ translate_corpus.py movie_reviews --source english --target spanish

Train Initial Classifier $ train_classifier.py spanish_movie_reviews

Create New Corpus $ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Manual Correction 1. scan each file 2. move incorrect examples to correct file

Train New Classifier $ train_classifier.py spanish_sentiment

Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold

Add more at a Lower Threshold $ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt

When are you done? what level of accuracy do you need? does your corpus reflect real text? how much time do you have?

Tips garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features

Bootstrapping a Phrase Extractor 1. find a pos tagged corpus 2. annotate raw text 3. train pos tagger 4. create pos tagged & chunked corpus 5. tag unknown words 6. train pos tagger & chunker 7. correct errors 8. add to corpus, goto 5 until done

NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/

Train Tagger $ train_tagger.py treebank --simplify_tags

Phrase Annotation Hello world, [this is an important phrase].

Tag Phrases $ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Correct Unknown Words 1. find -NONE- tagged words 2. fix tags

Train New Tagger $ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Train Chunker $ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Final Tips error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement

Links http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - PowerPoint PPT Presentation

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk Problem you want to do NLProc many proven supervised

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Improved Bootstrapping Approach in Multichannel Cognitive Radio Ad Hoc Networks The 4th Workshop

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser

INF5210 Information Infrastructure Class #11 Bootstrapping & Gateways Ben Eaton Dan Truong

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

PS 406 Week 3 Section: Bootstrapping D.J. Flynn April 21, 2014 D.J. Flynn PS406 Week 3

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

From NL to FOL From NL to Logic Semantics and the NLTK Scott Farrar CLMA, University of

Computational Semantics LING 571 Deep Processing for NLP October 23, 2019 Shane

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Introducing ARCUS Company Overview 24 th April 2017 Clients Products & Services What is

The ROI of Strengths Why a strengths based approach to developing leaders is good business

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

Off to a Cold Start a few observations November 2013 Ralph Grishman New

Natural Language Processing for historical language varieties Cristina S anchez Marco Gjvik

SOARing into Netsec With Carl Bolterstein Name | Title | Date Objectives - Introduction to

KYOTO: Open platform for mining facts Asian-European project funded by the EU, Taiwan and NICT

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - PowerPoint PPT Presentation

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk Problem you want to do NLProc many proven supervised

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Improved Bootstrapping Approach in Multichannel Cognitive Radio Ad Hoc Networks The 4th Workshop

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser

INF5210 Information Infrastructure Class #11 Bootstrapping &amp; Gateways Ben Eaton Dan Truong

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

PS 406 Week 3 Section: Bootstrapping D.J. Flynn April 21, 2014 D.J. Flynn PS406 Week 3

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

From NL to FOL From NL to Logic Semantics and the NLTK Scott Farrar CLMA, University of

Computational Semantics LING 571 Deep Processing for NLP October 23, 2019 Shane

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Introducing ARCUS Company Overview 24 th April 2017 Clients Products &amp; Services What is

The ROI of Strengths Why a strengths based approach to developing leaders is good business

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

Off to a Cold Start a few observations November 2013 Ralph Grishman New

Natural Language Processing for historical language varieties Cristina S anchez Marco Gjvik

SOARing into Netsec With Carl Bolterstein Name | Title | Date Objectives - Introduction to

KYOTO: Open platform for mining facts Asian-European project funded by the EU, Taiwan and NICT

INF5210 Information Infrastructure Class #11 Bootstrapping & Gateways Ben Eaton Dan Truong

Introducing ARCUS Company Overview 24 th April 2017 Clients Products & Services What is