corpus bootstrapping with nltk
play

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins - PowerPoint PPT Presentation

Corpus Bootstrapping with NLTK by Jacob Perkins Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk Problem you want to do NLProc many proven supervised


  1. Corpus Bootstrapping with NLTK by Jacob Perkins

  2. Jacob Perkins http://www.weotta.com http://streamhacker.com http://text-processing.com https://github.com/japerk/nltk-trainer @japerk

  3. Problem you want to do NLProc many proven supervised training algorithms but you don’t have a training corpus

  4. Solution make a custom training corpus

  5. Problems with Manual Annotation takes time requires expertise expert time costs $$$

  6. Solution: Bootstrap less time less expertise costs less requires thinking & creativity

  7. Corpus Bootstrapping at Weotta review sentiment keyword classification phrase extraction & classification

  8. Bootstrapping Examples english -> spanish sentiment phrase extraction

  9. Translating Sentiment start with english sentiment corpus & classifier english -> spanish -> spanish

  10. English -> Spanish -> Spanish 1. translate english examples to spanish 2. train classifier 3. classify spanish text into new corpus 4. correct new corpus 5. retrain classifier 6. add to corpus & goto 4 until done

  11. Translate Corpus $ translate_corpus.py movie_reviews --source english --target spanish

  12. Train Initial Classifier $ train_classifier.py spanish_movie_reviews

  13. Create New Corpus $ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

  14. Manual Correction 1. scan each file 2. move incorrect examples to correct file

  15. Train New Classifier $ train_classifier.py spanish_sentiment

  16. Adding to the Corpus start with >90% probability retrain carefully decrease probability threshold

  17. Add more at a Lower Threshold $ classify_to_corpus.py categorized_corpus -- classifier categorized_corpus_NaiveBayes.pickle -- threshold 0.8 --input new_examples.txt

  18. When are you done? what level of accuracy do you need? does your corpus reflect real text? how much time do you have?

  19. Tips garbage in, garbage out correct bad data clean & scrub text experiment with train_classifier.py options create custom features

  20. Bootstrapping a Phrase Extractor 1. find a pos tagged corpus 2. annotate raw text 3. train pos tagger 4. create pos tagged & chunked corpus 5. tag unknown words 6. train pos tagger & chunker 7. correct errors 8. add to corpus, goto 5 until done

  21. NLTK Tagged Corpora English: brown, conll2000, treebank Portuguese: mac_morpho, floresta Spanish: cess_esp, conll2002 Catalan: cess_cat Dutch: alpino, conll2002 Indian Languages: indian Chinese: sinica_treebank see http://text-processing.com/demo/tag/

  22. Train Tagger $ train_tagger.py treebank --simplify_tags

  23. Phrase Annotation Hello world, [this is an important phrase].

  24. Tag Phrases $ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

  25. Chunked & Tagged Phrase Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

  26. Correct Unknown Words 1. find -NONE- tagged words 2. fix tags

  27. Train New Tagger $ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

  28. Train Chunker $ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

  29. Extracting Phrases import collections, nltk.data from nltk import tokenize from nltk.tag import untag tagger = nltk.data.load('taggers/my_corpus_tagger.pickle') chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle') def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d sents = tokenize.sent_tokenize(text) words = tokenize.word_tokenize(sents[0]) d = extract_phrases(chunker.parse(tagger.tag(words))) # defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

  30. Final Tips error correction is faster than manual annotation find close enough corpora use nltk-trainer to experiment iterate -> quality no substitute for human judgement

  31. Links http://www.nltk.org https://github.com/japerk/nltk-trainer http://text-processing.com

Recommend


More recommend