Corpus Bootstrapping with NLTK by Jacob Perkins
Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk
Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus
Solution make a custom training corpus
Problems with Manual Annotation takes time requires expertise expert time costs $$$
Solution: Bootstrap less time less expertise costs less requires thinking & creativity
Corpus Bootstrapping at Weotta review sentiment keyword classification phrase extraction & classification
Bootstrapping Examples english -> spanish sentiment phrase extraction
Translating Sentiment start with english sentiment corpus & classifier english -> spanish -> spanish
English -> Spanish -> Spanish 1. translate english examples to spanish 2. train classifier 3. classify spanish text into new corpus 4. correct new corpus 5. retrain classifier 6. add to corpus & goto 4 until done
Translate Corpus $ translate_corpus.py movie_reviews --source english --target spanish
Train Initial Classifier $ train_classifier.py spanish_movie_reviews
Create New Corpus $ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle
Manual Correction 1. scan each file 2. move incorrect examples to correct file
Train New Classifier $ train_classifier.py spanish_sentiment
Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold
Add more at a Lower Threshold $ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt
When are you done? what level of accuracy do you need? does your corpus reflect real text? how much time do you have?
Tips garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features
Bootstrapping a Phrase Extractor 1. find a pos tagged corpus 2. annotate raw text 3. train pos tagger 4. create pos tagged & chunked corpus 5. tag unknown words 6. train pos tagger & chunker 7. correct errors 8. add to corpus, goto 5 until done
NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/
Train Tagger $ train_tagger.py treebank --simplify_tags
Phrase Annotation Hello world, [this is an important phrase].
Tag Phrases $ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt
Chunked & Tagged Phrase Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.
Correct Unknown Words 1. find -NONE- tagged words 2. fix tags
Train New Tagger $ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
Train Chunker $ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
Final Tips error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement
Links http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com
Recommend
More recommend