counting words
play

Counting Words: Non- Randomness Pre-Processing and Non-Randomness - PowerPoint PPT Presentation

Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M alaga, 11 August 2006 Outline Pre-processing and


  1. Pre-processing and non-randomness Baroni & Evert Pre-Processing Counting Words: Non- Randomness Pre-Processing and Non-Randomness The End Marco Baroni & Stefan Evert M´ alaga, 11 August 2006

  2. Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

  3. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  4. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- Randomness The End

  5. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus)

  6. Pre-processing Pre-processing ◮ IT IS IMPORTANT!!! (Evert and L¨ udeling 2001) and non-randomness ◮ Automated pre-processing often necessary (13,850 types Baroni & Evert begin with re- in BNC, 103,941 types begin with ri- in Pre-Processing itWaC) Non- ◮ We can rely on: Randomness The End ◮ POS tagging ◮ Lemmatization ◮ Pattern matching heuristics (e.g., candidate prefixed form must be analyzable as PRE+VERB , with VERB independently attested in corpus) ◮ However. . .

  7. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert Pre-Processing Non- Randomness The End

  8. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources

  9. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts)

  10. The problem with low frequency words Pre-processing ◮ Correct analysis of low frequency words is fundamental to and non-randomness measure productivity, estimate LNRE models Baroni & Evert ◮ Automated tools will tend to have lowest performance on Pre-Processing low frequency forms: Non- ◮ Statistical tools will suffer from lack of relevant training Randomness data The End ◮ Manually-crafted tools will probably lack the relevant resources ◮ Problems in both directions (under- and overestimation of hapax counts) ◮ Part of the more general “95% performance” problem

  11. Underestimation of hapaxes Pre-processing ◮ The Italian TreeTagger lemmatizer is lexicon-based; and non-randomness out-of-lexicon words (e.g., productively formed words Baroni & Evert containing a prefix) are lemmatized as UNKNOWN Pre-Processing ◮ No prefixed word with dash ( ri-cadere ) is in lexicon Non- Randomness ◮ Writers are more likely to use dash to mark transparent The End morphological structure

  12. Productivity of ri- with and without an extended lexicon Pre-processing and non-randomness Baroni & Evert 1000 Pre-Processing 800 Non- Randomness The End E [ V ( N )] 600 400 200 post−cleaning pre−cleaning 0 0 200000 600000 1000000 N

  13. Overestimation of hapaxes Pre-processing ◮ “Noise” generates hapax legomena and non-randomness ◮ The Italian TreeTagger thinks that dashed expressions Baroni & Evert containing pronoun-like strings are pronouns Pre-Processing ◮ Dashed strings can be anything, including full sentences Non- Randomness ◮ This creates a lot of pseudo-pronoun hapaxes: tu-tu, The End parapaponzi-ponzi-p` o, altri-da-lui-simili-a-lui

  14. Productivity of the pronoun class before and after cleaning Pre-processing and non-randomness 350 Baroni & Evert 300 Pre-Processing Non- 250 Randomness 200 The End E [ V ( N )] 150 100 50 pre−cleaning post−cleaning 0 0e+00 1e+06 2e+06 3e+06 4e+06 N

  15. P (and V ) with/without correct post-processing Pre-processing ◮ With: and non-randomness class V V 1 N P Baroni & Evert 1098 346 1,399,898 0.00025 ri- Pre-Processing pronouns 72 0 4,313,123 0 Non- ◮ Without: Randomness The End class V V 1 N P 318 8 1,268,244 0.000006 ri- pronouns 348 206 4,314,381 0.000048

  16. A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  17. A final word on pre-processing Pre-processing ◮ IT IS IMPORTANT and non-randomness ◮ Often, major roadblock of lexical statistics investigations Baroni & Evert Pre-Processing Non- Randomness The End

  18. Outline Pre-processing and non-randomness Pre-Processing Baroni & Evert Pre-Processing Non-Randomness Non- Randomness The End The End

  19. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing Non- Randomness The End

  20. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness The End

  21. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End

  22. Non-randomness Pre-processing ◮ LNRE modeling based on assumption that our and non-randomness corpora/datasets are random samples from the Baroni & Evert population Pre-Processing ◮ This is obviously not the case Non- Randomness ◮ Can we pretend that a corpus is random? The End ◮ What are the consequences of non-randomness?

  23. A Brown-sized random sample from a ZM population estimated with Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  24. The real Brown Pre-processing and non-randomness 50000 Baroni & Evert Pre-Processing 40000 Non- Randomness The End 30000 ( N ) V ( 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  25. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End

  26. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert Pre-Processing Non- Randomness The End

  27. Where does non-randomness come from? Pre-processing ◮ Syntax? and non-randomness ◮ the the should be most frequent English bigram Baroni & Evert ◮ If the problem is due to syntax, randomizing by sentence Pre-Processing will not get rid of it (Baayen 2001, ch. 5) Non- Randomness The End

  28. The Brown randomized by sentence Pre-processing and non-randomness Baroni & Evert Pre-Processing 50000 Non- Randomness 40000 The End 30000 V ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  29. Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert Pre-Processing Non- Randomness The End

  30. Where does non-randomness come from? Pre-processing ◮ Not syntax (syntax has short span effect; the counts for and non-randomness 10k intervals are OK) Baroni & Evert ◮ Underdispersion of content-rich words Pre-Processing ◮ The chance of two Noriegas is closer to π/ 2 than π 2 Non- Randomness (Church 2000) The End ◮ diethylstilbestrol occurs 3 times in Brown, all in same document (recommendations on feed additives)

Recommend


More recommend