Outline zipfR zipfR (Computational) linguistics Evert & Baroni Evert & Baroni Linguistics Linguistics Statistical inference in (computational) linguistics The zipfR library: Statistical Statistical inference inference Words and other rare events in R Zipf’s law and the LNRE problem Zipf’s law Zipf’s law LNRE models LNRE models LNRE models for linguistic populations Frequency Frequency Stefan Evert & Marco Baroni spectrum spectrum Model estimation: The frequency spectrum zipfR zipfR University of Osnabrück, Germany Extrapolation Extrapolation stefan.evert@uos.de The zipfR library Next steps Next steps University of Bologna, Forlì, Italy Availability Availability baroni@sslmit.unibo.it Extrapolation of VGCs useR! 2006, Vienna, 15 June 2006 Further work Availability What is (computational) linguistics? Corpora in (computational) linguistics zipfR zipfR ◮ increasing focus on language use and empirical The science of linguistics is concerned with . . . Evert & Baroni Evert & Baroni evidence in recent years ◮ natural language as a formal system Linguistics Linguistics ◮ based on corpora = (usually large) machine-readable (phonology, morphology, syntax, semantics, etc.) Statistical Statistical samples of naturally ocurring language inference ◮ human language production and understanding, inference ◮ some applications of corpus data Zipf’s law Zipf’s law including the acquisition of language competence LNRE models LNRE models ◮ test hypotheses about formal system of language Computational linguistics . . . Frequency Frequency ◮ validation of linguists’ introspective judgements spectrum spectrum ◮ applies computers and electronic resources ◮ observable result of human language production zipfR zipfR to linguistic research questions ◮ model for linguistic experience of human speaker Extrapolation Extrapolation ◮ training data for statistical NLP applications ◮ makes use of linguistic insights to build automatic Next steps Next steps ◮ corpus = sample ➜ need for statistical analysis natural language processing (NLP) systems Availability Availability ◮ standard methodologies are being established ◮ random sample assumption is controversial for most corpora ➜ statistical inference may be unreliable ☞ ongoing research into appropriate statistical models
Statistical inference from corpus data A characteristic problem: Zipf’s law zipfR zipfR ◮ only observable data are corpus frequencies ◮ linguistic population is usually characterized by a very Evert & Baroni Evert & Baroni large or even infinite number of type probabilities ◮ commonly used terminology: types vs. tokens Linguistics Linguistics ◮ tokens can be running words, sentences in a text, ◮ in addition, substantial portion of probability mass is Statistical Statistical instances of syntactic constructions, documents, etc. distributed over very infrequent types ( ≠ normal dist.) inference inference ◮ categorization into fixed or open-ended set of types : Zipf’s law Zipf’s law ◮ referred to as the LNRE property (Khmaladze 1987) distinct word forms or lemmas, parts of speech, etc. LNRE models LNRE models ( large number of rare events ) ◮ of central interest are type frequencies f(ω) Frequency Frequency ◮ corpus is interpreted as a random sample of tokens ◮ popularly known as Zipf’s law , based on the spectrum spectrum ➜ inferences about type probabilities π ω from f(ω) zipfR zipfR Zipf-Mandelbrot law for type probabilities π k = π w k : Extrapolation ◮ linguistic populations are characterized by . . . Extrapolation C Next steps Next steps 1. finite or countably infinite set of types ω π k ≈ Availability Availability (k + b) a 2. type probabilities π ω ➥ multinomial distribution of observed frequencies where b > 0 and a > 1 is usually close to 1 ◮ confidence intervals or Bayesian estimates ◮ Zipf ranking: π 1 ≥ π 2 ≥ π 3 ≥ . . . ◮ comparison of type probabilities ( H 0 : π 1 = π 2 ) ◮ see e.g. Baayen (2001, 101) for Zipf-Mandelbrot law ◮ statistical associations ◮ can be derived from Markov process (Rouault 1978) Consequences of Zipf’s law LNRE models zipfR zipfR ◮ most types occur just once in a sample ( hapax ◮ we need a population model for the distribution of type Evert & Baroni Evert & Baroni legomena ) or not at all ( out-of-vocabulary , OOV) probabilities ➜ LNRE model (Baayen 2001) Linguistics Linguistics ◮ such LNRE models have a wide range of applications ◮ hypothesis tests, confidence intervals and Bayesian Statistical Statistical ◮ analyze accuracy of hypothesis tests and confidence estimates (for uniform or beta priors) will be inaccurate inference inference interval estimates (Evert 2004b, Ch. 4) Zipf’s law Zipf’s law ◮ better prior distributions for Bayesian estimates LNRE models Imagine a population with 500 highly frequent types LNRE models ◮ estimate population vocabulary size (number of types), ( π = 10 − 3 ) and 500,000 rare types ( π = 10 − 6 ). In a Frequency Frequency spectrum spectrum sample of size N = 1000 there will be approx. 500 of the e.g. in authorship attribution (Thisted and Efron 1987), zipfR zipfR rare types among the hapax legomena, but the p -value stylometry, or early diagnosis of Alzheimer’s disease for each individual occurrence is p < . 001 (binomial test). Extrapolation Extrapolation (Garrard et al. 2005) ◮ extrapolate vocabulary growth, e.g. to estimate Next steps Next steps ◮ estimators can also be highly biased Availability Availability proportion of OOV types in large amounts of text, if unseen types (OOV) are not taken into account or the proportion of typos on the Web ◮ extrapolate proportion of hapaxes for measuring morphological productivity in word formation (Baayen 2003; Lüdeling and Evert 2003)
LNRE models based on the Zipf-Mandelbrot law The Zipf-Mandelbrot LNRE model zipfR zipfR ◮ most widely-used LNRE models are based on the Some simplifications . . . Evert & Baroni Evert & Baroni Zipf-Mandelbrot law ◮ use Poisson sampling instead of multinomial Linguistics Linguistics ◮ rewrite Zipf-Mandelbrot equation as distribution distribution (not conditioned on sample size N ) Statistical Statistical function for type probabilities (as r.v.) inference inference ◮ approximate step function G(ρ) by continuous Zipf’s law Zipf’s law function with type density g(π) : � F(ρ) ≔ π k LNRE models LNRE models � ∞ π k ≤ ρ Frequency Frequency G(ρ) ≈ g(π) d π spectrum spectrum ◮ F is an increasing step function with range [ 0 , 1 ] ρ zipfR zipfR ◮ type distribution function G is more useful: Extrapolation Extrapolation ➥ the Zipf-Mandelbrot (ZM) model (Evert 2004a) Next steps Next steps � � G(ρ) ≔ � { ω k | π k ≥ ρ } Availability Availability C · π − α − 1 � 0 ≤ π ≤ B g(π) ≔ 0 otherwise ◮ G is a decreasing step function ◮ for ρ → 0 , we have G(ρ) → S ◮ free parameters are 0 < α < 1 and 0 < B ≤ 1 ( S = population vocabulary size, which may be infinite) ◮ relation to Zipf-Mandelbrot law: α = a − 1 ◮ can easily be specified for ρ = π k The Zipf-Mandelbrot LNRE model The Zipf-Mandelbrot LNRE model zipfR zipfR Evert & Baroni Type distribution (ZM model) Type density (ZM model) Evert & Baroni Distribution function (ZM model) Probability density (ZM model) 100 1.0 0.5 10 Linguistics Linguistics g ( π ) [log 10 −transformed, million types] 0.8 0.4 80 Statistical Statistical 8 f ( π ) [log 10 −transformed] inference inference G ( ρ ) [million types] 0.3 60 0.6 Zipf’s law Zipf’s law 6 F ( ρ ) LNRE models LNRE models 0.2 40 0.4 4 Frequency Frequency spectrum spectrum 0.2 0.1 20 2 zipfR zipfR 0.0 0.0 0 0 Extrapolation Extrapolation 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 1e−10 1e−08 1e−06 1e−04 1e−02 Next steps Next steps ρ ρ π π Availability Availability ◮ type density function of Zipf-Mandelbrot LNRE model ◮ corresponding p.d.f. for type probabilities g(π) = C · π − α − 1 f(π) = C · π − α ( 0 ≤ π ≤ B) ( 0 ≤ π ≤ B) (densities in the images are log 10 -transformed) (densities in the images are log 10 -transformed)
Recommend
More recommend