Machine Learning for Computational Linguistics Some probability - PDF document

Machine Learning for Computational Linguistics Some probability distributions … Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 5 / 48 Probability theory Information theory 0 0 1 0 0 Probability mass function Example: probabilities for sentence length in words Probability Sentence length 0.1 0.2 1 0 0 0 1 0 0 1 0 0 0 3 Adv Negative Positive Value Outcome Noun Verb Adj … 1 0 0 0 0 Value 1 2 3 A refresher on probability and information theory … …or 2 4 Examples: 6 C. Prob. 1 2 3 4 5 7 Length 8 9 10 11 Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 Prob. 1.0 5 SfS / University of Tübingen 6 7 8 9 10 11 Ç. Çöltekin, April 19/21, 2016 0.5 6 / 48 Probability theory Some probability distributions Information theory Cumulative distribution function Cumulative Probability Sentence length Outcome 4 Random variables: mapping outcomes to real numbers April 19/21, 2016 0 the event is impossible 0.5 the event is as likely to happen (or happened) as it is not 1 the event is certain Axioms of probability states that Ç. Çöltekin, SfS / University of Tübingen 2 / 48 What is probability? Probability theory Some probability distributions Information theory Where do probabilities come from Axioms of probability does not specify how to assign probabilities to events. between 0 and 1 Information theory its relative frequency (in the limit) Information theory Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft April 19/21, 2016 Probability theory Some probability distributions Why probability theory? Some probability distributions because of Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 1 / 48 Probability theory Two major (rival) ways of assigning probabilities to events are 7 / 48 belief Information theory Ç. Çöltekin, Note: not all of these are numbers on the sample space) of a trial to (a vector of) real numbers (a real valued function April 19/21, 2016 uncertainties 4 / 48 Probability theory Random variables Some probability distributions SfS / University of Tübingen Information theory Some probability distributions Probability theory SfS / University of Tübingen 3 / 48 April 19/21, 2016 Ç. Çöltekin, ▶ Probability theory studies uncertainty ▶ In machine learning we deal with problems with uncertainty, ▶ inherently stochasticity of some physical systems ▶ incomplete/inaccurate measurements ▶ incomplete modeling ▶ Probability is a measure of (un)certainty of an event ▶ We quantify the probability of an event with a number ▶ All possible outcomes of an trial (experiment or observation) ▶ Frequentist (objective) probabilities: probability of an event is is called the sample space ( Ω ) ▶ Bayesian (subjective) probabilities: probabilities are degrees of 1. P ( E ) ∈ R , P ( E ) ⩾ 0 2. P ( Ω ) = 1 3. For disjoint events E 1 and E 2 , P ( E 1 ∪ E 2 ) = P ( E 1 ) + P ( E 2 ) ▶ Continuous Example: frequency of a sound signal: 100 . 5 , 220 . 3 , 4321 . 3 … ▶ A random variable is a variable whose value is subject to ▶ Discrete ▶ A random variable is always a number ( ∈ R for our purposes) ▶ Number of words in a sentence: 2 , 5 , 10 , … ▶ Think of a random variable as mapping between the outcomes ▶ Whether a review is negative or positive: ▶ Example outcomes of uncertain experiments 0 1 ▶ height or weight of a person ▶ The POS tag of a word: ▶ length of a word randomly chosen from a corpus ▶ whether an email is spam or not ▶ the fjrst word of a book, or fjrst word uttered by a baby ▶ F X ( x ) = P ( X ⩽ x ) ▶ Probability mass function of a discrete random variable ( X ) maps every possible ( x ) value to its probability ( P ( X = x ) ). 0 . 16 0 . 16 x P ( X = x ) 0 . 18 0 . 34 0 . 21 0 . 55 0 . 155 0 . 19 0 . 74 0 . 185 0 . 10 0 . 85 0 . 210 0 . 07 0 . 91 0 . 194 0 . 04 0 . 95 0 . 102 0 . 02 0 . 97 0 . 066 0 . 01 0 . 99 0 . 039 0 . 01 0 . 99 0 . 023 0 . 00 1 . 00 0 . 012 1 2 3 4 5 6 7 8 9 10 11 0 . 005 1 2 3 4 5 6 7 8 9 10 11 0 . 004

Probability theory f Probability distribution of letters following probabilities Lett. a b c d e g Some probability distributions h Prob. Probability Letter Some probability distributions 0.2 e a Information theory Probability theory d Mode, median, mean, standard deviation functions Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 11 / 48 Probability theory Some probability distributions Information theory Visualization on sentence length example 12 / 48 Probability Sentence length 0.1 0.2 mode = median = 3.0 Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 h g Median and mode of a random variable SfS / University of Tübingen b c d e f g h Ç. Çöltekin, April 19/21, 2016 h 14 / 48 Probability theory Some probability distributions Information theory Expected values of joint distributions Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 a g c Information theory b f Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 13 / 48 Probability theory Some probability distributions Joint and marginal probability f Two random variables form a joint probability distribution . An examaple: consider the bigrams of letters in our earlier example. The joint distribution can be defjned with a table like the one below. a b c d e Mode is the value that occurs most often in the data. 0.1 Information theory random variable Probability Short divergence: Chebyshev’s inequality Information theory Some probability distributions Probability theory 9 / 48 April 19/21, 2016 SfS / University of Tübingen Ç. Çöltekin, Variance and standard deviation 0.11 Information theory Some probability distributions Probability theory 8 / 48 April 19/21, 2016 SfS / University of Tübingen Ç. Çöltekin, Expected value Information theory 0.25 15 / 48 0.04 Ç. Çöltekin, 0.0001 This also shows why standardizing values of random variables, 0.01 Some probability distributions Probability theory 10 / 48 April 19/21, 2016 SfS / University of Tübingen ▶ Expected value (mean) of a random variable X is, ▶ Variance of a random variable X is, n n ∑ ∑ Var ( X ) = σ 2 = P ( x i )( x i − µ ) 2 = E [ X 2 ] − ( E [ X ]) 2 E [ X ] = µ = P ( x i ) x i = P ( x 1 ) x 1 + P ( x 2 ) x 2 + . . . + P ( x n ) x n i = 1 i = 1 ▶ More generally, expected value of a function of X is ▶ It is a measure of spread, divergence from the central tendency n ▶ The square root of variance is called standard deviation ∑ E [ f ( X )] = P ( x i ) f ( x i ) � n i = 1 � ∑ � ( P ( x i ) x 2 ) σ = − µ 2 � i ▶ Expected value is an important measure of central tendency i = 1 ▶ Note: it is not the ‘most likely’ value ▶ Standard deviation is in the same units as the values of the ▶ Expected value is linear ▶ Variance is not linear: σ 2 X + Y ̸ = σ 2 X + σ 2 Y (neither the σ ) E [ aX + bY ] = aE [ X ] + bE [ Y ] For any probability distribution, and k > 1 , Median is the mid-point of a distribution. Median of a random variable is defjned as the number m that satisfjes P ( | x − µ | > kσ ) ⩽ 1 k 2 P ( X ⩽ m ) ⩾ 1 and P ( X ⩾ m ) ⩾ 1 2 2 ▶ Median of 1 , 4 , 5 , 8 , 10 is 5 Distance from µ 2σ 3σ 5σ 10σ 100σ ▶ Median of 1 , 4 , 5 , 7 , 8 , 10 is 6 ▶ Modes appear as peaks in probability mass (or density) z = x − µ σ ▶ Mode of 1 , 4 , 4 , 8 , 10 is 4 ▶ Modes of 1 , 4 , 4 , 8 , 9 , 9 are 4 and 9 makes sense (the normalized quantity is often called the z-score). ▶ We have a hypothetical language with 8 letters with the µ = 3 . 56 0 . 23 0 . 04 0 . 05 0 . 08 0 . 29 0 . 02 0 . 07 0 . 22 9 0 . 2 = σ 1 2 3 4 5 6 7 8 9 10 11 ∑ ∑ E [ f ( X , Y )] = P ( x , y ) f ( x , y ) x y ∑ ∑ µ X = E [ X ] = P ( x , y ) x x y ∑ ∑ µ Y = E [ Y ] = P ( x , y ) y 0 . 04 0 . 02 0 . 02 0 . 03 0 . 05 0 . 01 0 . 02 0 . 06 0 . 23 x y 0 . 01 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 04 We can simplify the notation by vector notation, for µ = ( µ x , µ y ) , 0 . 02 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 05 0 . 02 0 . 00 0 . 00 0 . 01 0 . 02 0 . 00 0 . 01 0 . 02 0 . 08 ∑ 0 . 06 0 . 02 0 . 01 0 . 03 0 . 08 0 . 01 0 . 01 0 . 07 0 . 29 µ = x P ( x ) 0 . 00 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 02 x ∈ XY 0 . 01 0 . 00 0 . 00 0 . 01 0 . 02 0 . 00 0 . 01 0 . 02 0 . 07 0 . 08 0 . 00 0 . 00 0 . 01 0 . 10 0 . 00 0 . 01 0 . 02 0 . 22 where vector x ranges over all possible combinations of the values of random variables X and Y . 0 . 23 0 . 04 0 . 05 0 . 08 0 . 29 0 . 02 0 . 07 0 . 22

Machine Learning for Computational Linguistics Some probability - PDF document

Machine Learning for Computational Linguistics Some probability distributions . ltekin, SfS / University of Tbingen April 19/21, 2016 5 / 48 Probability theory Information theory 0 0 1 0 0 Probability mass function Example:

One-Shot Learning: Language Acquisition for Machine SS16 Computational Linguistics for

Machine Learning for Computational Linguistics ar ltekin University of Tbingen

Machine Learning for Computational Linguistics A refresher on linear algebra ar ltekin

Machine Learning for Computational Linguistics May 3, 2016 regression non-parametric neighbors

Machine Learning for Computational Linguistics Autoencoders + deep learning summary ar

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Machine Learning for Computational Linguistics Classifjcation ar ltekin University of

Machine Learning for Computational Linguistics May 24, 2016 . ltekin, label of an unknown

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Understanding Machine Learning with Language and Tensors Jon Rawski Linguistics Department

Foundations of Computational Linguistics man-machine communication in natural language R OLAND H

Understanding Machine Learning with Language and Tensors Jon Rawski Linguistics Department

Machine Learning for Computational Linguistics Regression ar ltekin University of

Machine Learning for Computational Linguistics Distributed representations ar ltekin

Machine Learning for Computational Linguistics Recurrent neural networks (RNNs) ar

Overview CS 446 What is machine learning? Machine learning : study of computational

Subregular Complexity and Machine Learning Jeffrey Heinz Linguistics Department Institute for

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using

Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL,

Dictionaries Christian Chiarcos Applied Computational Linguistics Lab

Computational Learning Theory - HT 2018 formerly Advanced Machine Learning (HT 2017) Introduction

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin