Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 52
Basic Concepts in Probability Theory Outline 1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability Bayes’ Theorem 5 Parameter Estimation Nevin L. Zhang (HKUST) Machine Learning 2 / 52
Basic Concepts in Probability Theory Random Experiments Probability associated with a random experiment — a process with uncertain outcomes Often kept implicit In machine learning, we often assume that data are generated by a hypothetical process (or a model), and task is to determine the structure and parameters of the model from data. Nevin L. Zhang (HKUST) Machine Learning 3 / 52
Basic Concepts in Probability Theory Sample Space Sample space (aka population) Ω: Set of possible outcomes and a random experiment. Example: Rolling two dices. Elements in a sample space are outcomes. Nevin L. Zhang (HKUST) Machine Learning 4 / 52
Basic Concepts in Probability Theory Events Event : A subset of the sample space. Example: The two results add to 4. Nevin L. Zhang (HKUST) Machine Learning 5 / 52
Basic Concepts in Probability Theory Probability Weight Function A probability weight P ( ω ) is assigned to each outcome. In Machine Learning, we often need to determine the probability weights, or related parameters, from data. This task is called parameter learning. Nevin L. Zhang (HKUST) Machine Learning 6 / 52
Basic Concepts in Probability Theory Probability measure Probability P ( E ) of an event E : P ( E ) = � ω ∈ E P ( ω ) A probability measure is a mapping from the set of events to [0, 1] P : 2 Ω → [0 , 1] that satisfies Kolmogorov’s axioms: 1 P (Ω) = 1. 2 P ( A ) ≥ 0 ∀ A ⊆ Ω 3 Additivity : P ( A ∪ B ) = P ( A ) + P ( B ) if A ∩ B = ∅ . In a more advanced treatment of Probability Theory, we would start with the concept of probability measure, instead of probability weights. Nevin L. Zhang (HKUST) Machine Learning 7 / 52
Basic Concepts in Probability Theory Random Variables A random variable is a function over the sample space. Example: X = sum of the two results. X ((2 , 5)) = 7; X ((3 , 1)) = 4) Why is it random? The experiment. Domain of a random variable: Set of all its possible values. Ω X = { 2 , 3 , . . . , 12 } Nevin L. Zhang (HKUST) Machine Learning 8 / 52
Basic Concepts in Probability Theory Random Variables and Event A random variable X taking a specific value x is an event: Ω X = x = { ω ∈ Ω | X ( ω ) = x } Ω X =4 = { (1 , 3) , (2 , 2 , )(3 , 1) } . Nevin L. Zhang (HKUST) Machine Learning 9 / 52
Basic Concepts in Probability Theory Probability Mass Function (Distribution) Probability mass function P ( X ): Ω X → [0 , 1] P ( X = x ) = P (Ω X = x ) P ( X = 4) = P ( { (1 , 3) , (2 , 2 , )(3 , 1) } ) = 3 36 . If X is continuous, we have a density function p ( X ). Nevin L. Zhang (HKUST) Machine Learning 10 / 52
Interpretation of Probability Outline 1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability Bayes’ Theorem 5 Parameter Estimation Nevin L. Zhang (HKUST) Machine Learning 11 / 52
Interpretation of Probability Frequentist interpretation Probabilities are long term relative frequencies . Example: X is result of coin tossing. Ω X = { H , T } P ( X =H) = 1 / 2 means that the relative frequency of getting heads will almost surely approach 1/2 as the number of tosses goes to infinite. Justified by the Law of Large Numbers: X i : result of the i-th tossing; 1 – H, 0 — T Law of Large Numbers: � n i =1 X i = 1 lim with probability 1 n 2 n →∞ The frequentist interpretation is meaningful only when experiment can be repeated under the same condition. Nevin L. Zhang (HKUST) Machine Learning 12 / 52
Interpretation of Probability Bayesian interpretation Probabilities are logically consistent degrees of beliefs . Applicable when experiment not repeatable. Depends on a person’s state of knowledge. Example: “probability that Suez canal is longer than the Panama canal”. Doesn’t make sense under frequentist interpretation. Subjectivist: degree of belief based on state of knowledge Primary school student: 0.5 Me: 0.8 Geographer: 1 or 0 Arguments such as Dutch book are used to explain why one’s probability beliefs must satisfy Kolmogorov’s axioms. Nevin L. Zhang (HKUST) Machine Learning 13 / 52
Interpretation of Probability Interpretations of Probability Now both interpretations are accepted. In practice, subjective beliefs and statistical data complement each other. We rely on subjective beliefs (prior probabilities) when data are scarce. As more and more data become available, we rely less and less on subjective beliefs. Often, we also use prior probabilities to impose some bias on the kind of results we want from a machine learning algorithm. The subjectivist interpretation makes concepts such as conditional independence easy to understand. Nevin L. Zhang (HKUST) Machine Learning 14 / 52
Univariate Probability Distributions Outline 1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability Bayes’ Theorem 5 Parameter Estimation Nevin L. Zhang (HKUST) Machine Learning 15 / 52
Univariate Probability Distributions Binomial and Bernoulli Distributions Suppose we toss a coin n times. At each time, the probability of getting a head is θ . Let X be the number of heads. Then X follows the binomial distribution , written as X ∼ Bin ( n , θ ): � � n θ k (1 − θ ) n − k � if 0 ≤ k ≤ n Bin ( X = k | n , θ ) = k 0 if k < 0 or k > n If n = 1, then X follows the Bernoulli distribution , written as X ∼ Ber ( θ ) � θ if x = 1 Ber ( X = x | θ ) = 1 − θ if x = 0 Nevin L. Zhang (HKUST) Machine Learning 16 / 52
Univariate Probability Distributions Multinomial Distribution Suppose we toss a K -sided die n times. At each time, the probability of getting result j is θ j . Let θ = ( θ 1 , . . . , θ K ) ⊤ . Let x = ( x 1 , ..., x K ) be a random vector, where x j is the number of times side j of the die occurs. Then x follows the multinomial distribution , written as x ∼ Multi ( n , θ ) � K � n θ x j � Multi ( x | n , θ ) = k , x 1 , . . . , x K j =1 � � n n ! where = x 1 ! . . . x K ! is the multinomial coefficient x 1 , . . . , x K Nevin L. Zhang (HKUST) Machine Learning 17 / 52
Univariate Probability Distributions Categorical Distribution In the previous slide, if n = 1, x = ( x 1 , ..., x K ) has one component being 1 and the others are 0. In other words, it is a one-hot vector. In this case, x follows the categorical distribution , written as x ∼ Cat ( θ ) K θ 1 ( x j =1) � Cat ( x | θ ) = , j j =1 where 1 ( x j = 1) is the indicator function, whose value is 1 when x j = 1 and 0 otherwise. Nevin L. Zhang (HKUST) Machine Learning 18 / 52
Univariate Probability Distributions Gaussian (Normal) Distribution The most widely used distribution in statistics and machine learning is the Gaussian or normal distribution. Its probability density is given by − ( x − µ ) 2 1 � � N ( x | µ, σ 2 ) = √ 2 πσ 2 exp 2 σ 2 Here µ = E [ X ] is the mean (and mode), and σ 2 = var [ X ] is the variance Nevin L. Zhang (HKUST) Machine Learning 19 / 52
Multivariate Probability Outline 1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability Bayes’ Theorem 5 Parameter Estimation Nevin L. Zhang (HKUST) Machine Learning 20 / 52
Multivariate Probability Joint probability mass function Probability mass function of a random variable X : P ( X ) : Ω X → [0 , 1] P ( X = x ) = P (Ω X = x ) . Suppose there are n random variables X 1 , X 2 , . . . , X n . A joint probability mass function , P ( X 1 , X 2 , . . . , X n ), over those random variables is: a function defined on the Cartesian product of their state spaces: n � Ω X i → [0 , 1] i =1 P ( X 1 = x 1 , X 2 = x 2 , . . . , X n = x n ) = P (Ω X 1 = x 1 ∩ Ω X 2 = x 2 ∩ . . . ∩ Ω X n = x n ) . Nevin L. Zhang (HKUST) Machine Learning 21 / 52
Multivariate Probability Joint probability mass function Example: Population: Apartments in Hong Kong rental market. Random variables: (of a random selected apartment) Monthly Rent: { low ( ≤ 1 k ), medium ((1k, 2k]), upper medium((2k, 4k]), high ( ≥ 4k) } , Type: { public, private, others } Joint probability distribution P (Rent , Type): public private others low .17 .01 .02 medium .44 .03 .01 upper medium .09 .07 .01 high 0 0.14 0.1 Nevin L. Zhang (HKUST) Machine Learning 22 / 52
Multivariate Probability Multivariate Gaussian Distributions For continuous variables, the most commonly used joint distribution is the multivariate Gaussian distribution: N ( µ , Σ) − ( x − µ ) ⊤ Σ − 1 ( x − µ ) 1 � � N ( x | µ , Σ) = exp � 2 (2 π ) D | Σ | D : dimensionality. x : vector of D random variables, representing data µ : vector of means Σ: covariance matrix. | Σ | denotes the determinant of Σ. Nevin L. Zhang (HKUST) Machine Learning 23 / 52
Recommend
More recommend