machine learning for computational linguistics
play

Machine Learning for Computational Linguistics Regression ar ltekin - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Regression ar ltekin University of Tbingen Seminar fr Sprachwissenschaft April 26/28, 2016 Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tbingen


  1. Machine Learning for Computational Linguistics Regression Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft April 26/28, 2016

  2. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Please follow the instructions precisely! For each homework, you either get 6 ECTS without term paper 9 ECTS with term paper Regression Statistical inference 1 / 35 ▶ Course credits: ▶ Homeworks & evaluation: 0 not satisfactory or not submitted [ 6 , 10 ] satisfactory and on time ▶ Late homeworks are not accepted

  3. Practical matters 5 2 13 3 14 2 15 0 16 1 17 18 3 0 19 3 20 2 If the data was really uniformly distributed: . Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 12 11 Revisiting entropy 2 Statistical inference Regression Entropy of your random numbers 1 1 2 1 3 0 4 5 1 0 6 1 7 6 8 4 9 2 10 2 / 35

  4. Practical matters 5 Revisiting entropy 13 3 14 2 15 0 16 1 17 18 3 0 19 3 20 2 If the data was really uniformly distributed: . Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 12 2 11 2 Statistical inference Regression Entropy of your random numbers 1 1 2 1 3 1 4 0 5 0 10 2 9 4 8 2 / 35 6 7 1 6 ∑ H ( X ) = − P ( x ) log 2 P ( x ) x = 2 . 61

  5. Practical matters 1 Revisiting entropy 2 13 3 14 2 15 0 16 17 11 5 18 0 19 3 20 2 Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 3 12 1 4 Statistical inference Regression Entropy of your random numbers 1 1 2 1 10 0 3 2 8 2 9 4 5 2 / 35 6 7 1 6 0 ∑ H ( X ) = − P ( x ) log 2 P ( x ) x = 2 . 61 If the data was really uniformly distributed: H ( X ) = 4 . 32 .

  6. Practical matters 0 1 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Average code length of a string under code 2: Average code length of a string under code 1: 111 1 1 d Revisiting entropy 1 0 c 10 110 3 / 35 b 0 Statistical inference 0 0 a Regression code 2 code 1 prob Coding a four-letter alphabet letter 1/2 1/4 1/8 1/8 1 22 + 1 42 + 1 82 + 1 82 = 2 . 0 bits 1 21 + 1 42 + 1 83 + 1 83 = 1 . 75 bits = H

  7. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, instances training set such that the resulting model is useful for unseen defjned by a set of parameters the best model within the class of models sample) beyond the data at hand (training set, or experimental Statistical inference and estimation Regression Statistical inference 4 / 35 ▶ Statistical inference is about making generalizations that go ▶ In a typical scenario, we (implicitly) assume that a particular class of models describe the real-world process, and try to fjnd ▶ In most cases, our models are parametrized: the model is ▶ The task, then, becomes estimating the parameters from the

  8. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, explaining the observed phenomena) account for 5 / 35 Estimation of model parameters A typical statistical model can be formulated as Regression Statistical inference y = f ( x ; w ) + ϵ x is the input to the model y is the quantity or label assigned to for a given input w is the parameter(s) of the model f ( x ; w ) is the model’s estimate of output y given the input x , sometimes denoted as ˆ y ϵ represents the uncertainty or noise that we cannot explain or ▶ In machine learning, focus is correct prediction of y ▶ In statistics, the focus is on inference (testing hypotheses or

  9. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, uncertainty of the estimate calculating the expected value from the distribution parameter(s) Regression Estimating parameters: Bayesian approach Statistical inference 6 / 35 Given the training data X , we fjnd the posterior distribution p ( w | X ) = p ( X | w ) p ( w ) p ( X ) ▶ The result, posterior, is a probability distribution of the ▶ One can get a point estimate of w , for example, by ▶ The posterior distribution also contains the information on the ▶ Prior information can be specifjed by the prior distribution

  10. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, function value of probability mass function for the continuous variables 7 / 35 Estimating parameters: frequentist approach the likelihood Regression Statistical inference Given the training data X , we fjnd the value of w that maximizes w = arg min ˆ p ( X | w ) w ▶ The likelihood function p ( X | w ) , often denoted L ( w | X ) , is the probability of data given w for discrete variables, and the ▶ The problem becomes searching for the maximum value of a ▶ Note that we cannot make probabilistic statements about w ▶ Uncertainty of the estimate is less straightforward

  11. Practical matters An example: April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy characters in twitter messages. We will use two data sets: 8 / 35 We assume that data observed comes from the model: A simple example: estimation of the population mean Regression Statistical inference y = µ + ϵ where, ϵ ∼ N ( 0 , σ 2 ) ▶ Let’s assume that we are estimating the average number of ▶ 87 , 101 , 88 , 45 , 138 ▶ The mean of the sample ( ¯ x ) is 91 . 8 ▶ Variance of the sample ( sd 2 ) is 1111 . 7 ( sd = 33 . 34 ) ▶ 87 , 101 , 88 , 45 , 138 , 66 , 79 , 78 , 140 , 102 ▶ ¯ x = 92 . 4 ▶ sd 2 = 876 . 71 ( sd = 29 . 61 )

  12. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, With more data, we get a more certain estimate and the data mean mean is (almost) the same as the mean of the data Estimating mean: Bayesian way We simply use Bayes’ formula: Regression Statistical inference 9 / 35 p ( µ | D ) = p ( D | µ ) p ( µ ) p ( D ) ▶ With a vague prior (high variance/entropy), the posterior ▶ With a prior with lower variance, posterior is between the prior ▶ Posterior variance indicates the uncertainty of our estimate.

  13. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 vague prior, small sample Estimating mean: Bayesian way Regression Statistical inference 10 / 35 Prior: N ( 70 , 1000 ) Likelihood: N ( 91 . 8 , 33 . 34 ) Posterior: N ( 91 . 78 , 14 . 91 )

  14. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 vague prior, larger sample Estimating mean: Bayesian way Regression Statistical inference 11 / 35 Prior: N ( 70 , 1000 ) Likelihood: N ( 92 . 40 , 29 . 61 ) Posterior: N ( 92 . 39 , 9 . 36 )

  15. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 visualization Estimating mean: Bayesian way Regression Statistical inference 12 / 35 Prior: N ( 70 , 50 ) Likelihood: N ( 92 . 40 , 29 . 61 ) Posterior: N ( 91 . 64 , 9 . 20 )

  16. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, of the same size drawn from the same population. which corresponds to the means of the (hypothetical) samples 13 / 35 sample Statistical inference Regression Estimating mean: frequentist way ▶ The MLE of the mean of the population is the mean of the µ = ¯ x = 91 . 8 ▶ For 5-tweet sample: ˆ µ = ¯ x = 92 . 4 ▶ For 10-tweet sample: ˆ ▶ We express the uncertainty in terms of standard error of the mean ( SE ) x = sd x SE ¯ √ n √ ▶ For 5-tweet sample: SE ¯ x = 33 . 34/ 5 = 14 . 91 √ ▶ For 10-tweet sample: SE ¯ x = 29 . 61/ 10 = 9 . 36 ▶ A rough estimate for a 95% confjdence interval is ¯ x ± 2SE ¯ x ▶ For 5-tweet sample: 91 . 8 ± 2 × 14 . 91 = [ 61 . 98 , 121 . 62 ] ▶ For 10-tweet sample: 92 . 4 ± 2 × 9 . 36 = [ 83 . 04 , 101 . 76 ]

  17. Practical matters Revisiting entropy Statistical inference Regression Regression continuous response variables based on a number of predictors variable given the predictor(s) But the border between the two often is not that clear Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 14 / 35 ▶ Regression is a supervised method for predicting value of a ▶ We estimate the conditional expectation of the outcome ▶ If the outcome is a label, the problem is called classifjcation.

  18. Practical matters b (slope) is the April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy increased one unit. 15 / 35 where the line Statistical inference a (intercept) is Regression The linear equation: a reminder y y = 1 − x y = a + bx 2 x y = 2 + 1 x crosses the y axis. 2 x y = − 1 y = 1 change in y as x is

  19. Practical matters increased one unit. April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy (relation)? What is the 15 / 35 b (slope) is the Statistical inference where the line Regression The linear equation: a reminder a (intercept) is y y y = a + bx = 1 − x 2 x y = 2 + 1 crosses the y axis. x change in y as x is 2 x y = − 1 y = 1 correlation between x and y for each line

Recommend


More recommend