Statistical Natural Language Processing Statistical models: learning, inference, estimation, prediction Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017
Statistical models: learning, inference, estimation, prediction Overview classifjed as statistical models data analysis this lecture Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 1 / 22 • Many methods/tools we use in NLP can broadly be • Statistical models have a central role in ML and statistical • We will go through an overview of statistical modeling in
Statistical models: learning, inference, estimation, prediction Models in science and practice Modeling is a basic activity in science and practice. A few examples: Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 2 / 22 • Galilean model of solar system • Bohr model of atom • Animal models in medicine • Scale models of buildings, bridges, cars, … • Econometric models • Models of atmosphere
Statistical models: learning, inference, estimation, prediction What do we do with models? – verify or compare hypotheses on the model model Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 22 • Inference: learn more about the reality being modeled • Prediction: predict the (feature) events/behavior using the
Statistical models: learning, inference, estimation, prediction Models are not reality All models are wrong, some are useful. not match with reality of) these assumptions / simplifjcations Box and Draper (1986, p. 424) Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 22 • All models make some (simplifying) assumptions that do • (some) models are useful despite (or, sometimes, because
Statistical models: learning, inference, estimation, prediction Statistical models uncertainty into account Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 22 • Statistical models are mathematical models that take • Statistical models are models of data • We express a statistical model in the form, outcome = model prediction + error • ‘error’ or uncertainty is part of the model description
Statistical models: learning, inference, estimation, prediction Parametric models or account for (may include additional parameters) Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 6 / 22 Most statistical models are described by a set of parameters w y = f ( x ; w ) + ϵ x is the input to the model y is the quantity or label assigned to for a given input w is the parameter(s) of the model f ( x ; w ) is the model’s estimate ( ˆ y ) of y given the input x ϵ represents the uncertainty or noise that we cannot explain
Statistical models: learning, inference, estimation, prediction Parametric models explaining the observed phenomena) variance) is important Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 7 / 22 y = f ( x ; w ) + ϵ • In machine learning (and in this course), focus is on prediction: given x , make accurate predictions of y • In statistics, the focus is on inference (testing hypotheses or – for example, does x have an efgect on y ? • For both purposes, fjnding a good estimate w is important • For inference, properties of ϵ (e.g., its distribution and
Statistical models: learning, inference, estimation, prediction the squared deviations from the mean estimate Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, variance results in high bias. But there is a trade-ofg: reducing one increases the other. low We want low bias low variance. What are good estimates / estimators? 8 / 22 estimate Variance of an estimate is, simply its variance, the value of being estimated, and the expected value of the Bias of an estimate is the difgerence between the value B ( ˆ w ) = E [ ˆ w ] − w • An unbiased estimator has 0 bias [ w ]) 2 ] var ( ˆ w ) = E ( ˆ w − E [ ˆ
Statistical models: learning, inference, estimation, prediction the squared deviations from the mean estimate Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, variance results in high bias. But there is a trade-ofg: reducing one increases the other. low We want low bias low variance. What are good estimates / estimators? 8 / 22 estimate Variance of an estimate is, simply its variance, the value of being estimated, and the expected value of the Bias of an estimate is the difgerence between the value B ( ˆ w ) = E [ ˆ w ] − w • An unbiased estimator has 0 bias [ w ]) 2 ] var ( ˆ w ) = E ( ˆ w − E [ ˆ
Statistical models: learning, inference, estimation, prediction Estimating parameters: Bayesian approach calculating the expected value of the posteriror the uncertainty of the estimate Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 9 / 22 Given the training data x , we fjnd the posterior distribution p ( w | x ) = p ( x | w ) p ( w ) p ( x ) • The result, posterior, is a distribution over the parameter(s) • One can get a point estimate of w , for example, by • The posterior distribution also contains the information on • A prior distribution required for the estimation
Statistical models: learning, inference, estimation, prediction Estimating parameters: frequentist approach Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, a function the parameters 10 / 22 maximizes the likelihood Maximum likelihood estimation (MLE) Given the training data x , we fjnd the value of w that w = arg max ˆ p ( x | w ) w • The likelihood function L ( w | x ) = p ( x | w ) , is a function of • The problem becomes searching for the maximum value of • Note that we cannot make probabilistic statements about w • Uncertainty of the estimate is less straightforward
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 11 / 22 Data: We have two data sets (samples) characters in tweets. Problem: We want to estimate the average number of defjnition small x = 87 , 101 , 88 , 45 , 138 x ) is 91 . 8 – The mean of the sample ( ¯ – Variance of the sample ( sd 2 ) is 1111 . 7 ( sd = 33 . 34 ) large x = ( 87 , 101 , 88 , 45 , 138 , 66 , 79 , 78 , 140 , 102 ) x = 92 . 4 – ¯ – sd 2 = 876 . 71 ( sd = 29 . 61 )
Statistical models: learning, inference, estimation, prediction A simple example the task population) – Given a sample, what is the most likely population mean? – How certain is our estimate of the population mean? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 12 / 22 • We are interested in the mean of all tweets (a large • We only have samples • Questions:
Statistical models: learning, inference, estimation, prediction (we won’t cover it here) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predictors. We are normally interested in conditional models , models with model A simple example Equivalently, the model 13 / 22 where µ ∼ N ( 0 , σ 2 ) y = µ + ϵ y ∼ N ( µ + σ 2 ) • The model is known as the mean/constant/intercept • It is related to well-known statistical tests such as t-test
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, be calculated analytically estimate. With more data, we get a more certain estimate prior and the data mean mean is (almost) the same as the mean of the data We simply use the Bayes’ formula: Bayesian estimation / inference 14 / 22 p ( µ | x ) = p ( x | µ ) p ( µ ) p ( x ) • With a vague prior (high variance/entropy), the posterior • With a prior with lower variance, posterior is between the • Posterior variance indicates the uncertainty of our • With a normal prior, posterior will also be normal, and can
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 15 / 22 Bayesian estimation: vague prior, small sample Prior: N ( µ = 70 , 1000 2 ) Likelihood: N ( 91 . 8 , 33 . 34 2 ) Posterior: N ( 91 . 78 , 14 . 91 2 ) 0 . 04 0 . 03 0 . 02 0 . 01 0 . 00 0 20 40 60 80 100 120 140 160 180 200
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 16 / 22 Bayesian estimation: vague prior, larger sample Prior: N ( µ = 70 , 1000 2 ) Likelihood: N ( 92 . 4 , 29 . 61 2 ) Posterior: N ( 92 . 39 , 9 . 36 2 ) 0 . 04 0 . 03 0 . 02 0 . 01 0 . 00 0 20 40 60 80 100 120 140 160 180 200
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 17 / 22 Bayesian estimation: stronger prior, small sample Prior: N ( µ = 70 , 50 2 ) Likelihood: N ( 91 . 8 , 33 . 34 2 ) Posterior: N ( 90 . 02 , 14 . 29 2 ) 0 . 04 0 . 03 0 . 02 0 . 01 0 . 00 0 20 40 60 80 100 120 140 160 180 200
Statistical models: learning, inference, estimation, prediction A simple example Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 18 / 22 MLE estimation µ = arg max L ( µ ; x ) ˆ µ = arg max p ( x | µ ) µ ∏ = arg max p ( x | µ ) µ x ∈ x e − ( x − µ ) 2 2σ2 ∏ = arg max √ σ 2π µ x ∈ x = ¯ x µ = ¯ x = 91 . 8 (cf. 91 . 78 ) • For 5-tweet sample: ˆ µ = ¯ x = 92 . 4 (cf. 92 . 39 ) • For 10-tweet sample: ˆ
Recommend
More recommend