Probability & Information Theory Shan-Hung Wu - PowerPoint PPT Presentation

Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Var ( x + y ) = Var ( x )+ Var ( y ) if x and y are independent Cov ( a x + b , c y + d ) = ac Cov ( x , y ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Var ( x + y ) = Var ( x )+ Var ( y ) if x and y are independent Cov ( a x + b , c y + d ) = ac Cov ( x , y ) [Proof] Cov ( a x + b y , c w + d v ) = ac Cov ( x , w )+ ad Cov ( x , v )+ bc Cov ( y , w )+ bd Cov ( y , v ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 14 / 76

Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ The covariance matrix of x is defined as:  σ 2  σ x 1 , x 2 ··· σ x 1 , x d x 1 σ 2 σ x 2 , x 1 ··· σ x 2 , x d  x 2  Σ x =  . . .  ... . . .   . . .   σ 2 σ x d , x 1 σ x d , x 2 ··· x d σ x i , x j = Cov ( x i , x j ) = E [( x i − µ x i )( x j − µ x j )] = E ( x i x j ) − µ x i µ x j Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ The covariance matrix of x is defined as:  σ 2  σ x 1 , x 2 ··· σ x 1 , x d x 1 σ 2 σ x 2 , x 1 ··· σ x 2 , x d  x 2  Σ x =  . . .  ... . . .   . . .   σ 2 σ x d , x 1 σ x d , x 2 ··· x d σ x i , x j = Cov ( x i , x j ) = E [( x i − µ x i )( x j − µ x j )] = E ( x i x j ) − µ x i µ x j � ( x − µ x )( x − µ x ) ⊤ � = E ( xx ⊤ ) − µ x µ ⊤ Σ x = Cov ( x ) = E x Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

Multivariate Random Variables II Σ x is always symmetric Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Σ x is nonsingular iff it is positive definite Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Σ x is nonsingular iff it is positive definite Σ x is singular implies that x has either: Deterministic/independent/non-linearly dependent attributes causing zero rows, or Redundant attributes causing linear dependency between rows Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

Derived Random Variables Let y = f ( x ; w ) = w ⊤ x be a random variable transformed from x µ y = E ( w ⊤ x ) = w ⊤ E ( x ) = w ⊤ µ x σ 2 y = w ⊤ Σ x w [Homework] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 17 / 76

What Does Pr ( x = x ) Mean? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

What Does Pr ( x = x ) Mean? Bayesian probability : it’s a degree of belief or qualitative levels of 1 certainty Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

What Does Pr ( x = x ) Mean? Bayesian probability : it’s a degree of belief or qualitative levels of 1 certainty Frequentist probability : if we can draw samples of x , then the 2 proportion of frequency of samples having the value x is equal to Pr ( x = x ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Why is it so important? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Why is it so important? E.g., a doctor diagnoses you as having a disease by letting x be “symptom” and y be “disease” P ( x | y ) and P ( y ) may be estimated from sample frequencies more easily Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

Point Estimation Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 21 / 76

Point Estimation Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let { x ( 1 ) , ··· , x ( n ) } be a set of n independent and identically distributed ( i.i.d. ) samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ ˆ θ n is called the estimate of θ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 21 / 76

Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 A sample covariance matrix: n Σ x = 1 ( x ( i ) − ˆ µ x )( x ( i ) − ˆ ˆ µ x ) ⊤ ∑ n i = 1 s = 1 ( x ( s ) µ x i )( x ( s ) σ 2 x i , x j = 1 n ∑ n ˆ − ˆ − ˆ µ x j ) i j Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 A sample covariance matrix: n Σ x = 1 ( x ( i ) − ˆ µ x )( x ( i ) − ˆ ˆ µ x ) ⊤ ∑ n i = 1 s = 1 ( x ( s ) µ x i )( x ( s ) σ 2 x i , x j = 1 n ∑ n ˆ − ˆ − ˆ µ x j ) i j If each x ( i ) is centered (by subtracting ˆ µ x first), then ˆ Σ x = 1 n X ⊤ X Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

Principal Components Analysis (PCA) I i = 1 , where x ( i ) ∈ R D Give a collection of data points X = { x ( i ) } N Suppose we want to lossily compress X , i.e., to find a function f such that f ( x ( i ) ) = z ( i ) ∈ R K , where K < D How to keep the maximum info in X ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 24 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Why � w ( j ) � = 1 for all j ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Why � w ( j ) � = 1 for all j ? Only directions matter—we don’t want to maximize Var ( z j ) by finding a long w ( j ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max X ⊤ X is symmetric thus can be eigendecomposed Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max X ⊤ X is symmetric thus can be eigendecomposed By Rayleigh’s Quotient, the optimal w ( 1 ) is given by the eigenvector of X ⊤ X corresponding to the largest eigenvalue Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max By Rayleigh’s Quotient again, w ( 2 ) is the eigenvector corresponding to the 2 -nd largest eigenvalue Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max By Rayleigh’s Quotient again, w ( 2 ) is the eigenvector corresponding to the 2 -nd largest eigenvalue For general case where K > 1 , the w ( 1 ) , ··· , w ( K ) are eigenvectors of X ⊤ X corresponding to the largest K eigenvalues Proof by induction [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

Visualization Figure: PCA learns a linear projection that aligns the direction of greatest variance with the axes of the new space. With these new axes, the estimated Σ z = W ⊤ ˆ Σ x W ∈ R K × K is always diagonal. covariance matrix ˆ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 28 / 76

Sure and Almost Sure Events Given a continuous random variable x , we have Pr ( x = x ) = 0 for any value x Will the event x = x occur? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 30 / 76

Sure and Almost Sure Events Given a continuous random variable x , we have Pr ( x = x ) = 0 for any value x Will the event x = x occur? Yes! An event A happens surely if always occurs An event A happens almost surely if Pr ( A ) = 1 (e.g., Pr ( x � = x ) = 1 ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 30 / 76

Equality of Random Variables I Definition (Equality in Distribution) Two random variables x and y are equal in distribution iff Pr ( x ≤ a ) = Pr ( y ≤ a ) for all a . Definition (Almost Sure Equality) Two random variables x and y are equal almost surely iff Pr ( x = y ) = 1 . Definition (Equality) Two random variables x and y are equal iff they maps the same events to same values. Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 31 / 76

Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true E.g., let x and y be binary random variables and P x ( 0 ) = P x ( 1 ) = P y ( 0 ) = P y ( 1 ) = 0 . 5 They are equal in distribution But Pr ( x = y ) = 0 . 5 � = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

Convergence of Random Variables I Definition (Convergence in Distribution) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges in distribution x ( n ) = x � � to x iff lim n → ∞ P = P ( x = x ) Definition (Convergence in Probability) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges in probability | x ( n ) − x | < ε � � to x iff for any ε > 0 , lim n → ∞ Pr = 1 . Definition (Almost Sure Convergence) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges almost surely lim n → ∞ x ( n ) = x � � to x iff Pr = 1 . Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 33 / 76

Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true | x ( n ) − x | < ε � � lim n → ∞ Pr = 1 leaves open the possibility that | x ( n ) − x | > ε happens an infinite number of times lim n → ∞ x ( n ) = x = 1 guarantees that | x ( n ) − x | > ε almost surely � � Pr will not occur Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

Distribution of Derived Variables I Suppose y = f ( x ) and f − 1 exists, does P ( y = y ) = P ( x = f − 1 ( y )) always hold? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 35 / 76

Distribution of Derived Variables I Suppose y = f ( x ) and f − 1 exists, does P ( y = y ) = P ( x = f − 1 ( y )) always hold? No , when x and y are continuous Suppose x ∼ Uniform ( 0 , 1 ) is continuous and p ( x ) = c for x ∈ ( 0 , 1 ) Let y = x / 2 ∼ Uniform ( 0 , 1 / 2 ) If p y ( y ) = p x ( 2 y ) , then � 1 / 2 � 1 / 2 y = 0 c · dy = 1 y = 0 p y ( y ) dy = 2 � = 1 Violates the axiom of probability Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 35 / 76

Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Since f may distort space, we need to ensure that | p y ( f ( x )) dy | = | p x ( x ) dx | We have ∂ f − 1 ( y ) � � � ∂ f ( x ) � p y ( y ) = p x ( f − 1 ( y )) � � � � � (or p x ( x ) = p y ( f ( x )) � ) � � � � ∂ y ∂ x � � In previous example: p y ( y ) = 2 · p x ( 2 y ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Since f may distort space, we need to ensure that | p y ( f ( x )) dy | = | p x ( x ) dx | We have ∂ f − 1 ( y ) � � � ∂ f ( x ) � p y ( y ) = p x ( f − 1 ( y )) � � � � � (or p x ( x ) = p y ( f ( x )) � ) � � � � ∂ y ∂ x � � In previous example: p y ( y ) = 2 · p x ( 2 y ) In multivariate case, we have p y ( y ) = p x ( f − 1 ( y )) J ( f − 1 )( y ) � �� det � , where J ( f − 1 )( y ) is the Jacobian matrix of f − 1 at input y J ( f − 1 )( y ) i , j = ∂ f − 1 ( y ) / ∂ y j i Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

Random Experiments The value of a random variable x can be think of as the outcome of an random experiment Helps us define P ( x ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 38 / 76

Bernoulli Distribution (Discrete) Let x ∈ { 0 , 1 } be the outcome of tossing a coin, we have: � ρ , if x = 1 or ρ x ( 1 − ρ ) 1 − x Bernoulli ( x = x ; ρ ) = 1 − ρ , otherwise Properties: [Proof] E ( x ) = ρ Var ( x ) = ρ ( 1 − ρ ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 39 / 76

Categorical Distribution (Discrete) Let x ∈ { 1 , ··· , k } be the outcome of rolling a k -sided dice, we have: k ρ 1 ( x ; x = i ) , where 1 ⊤ ρ = 1 ∏ Categorical ( x = x ; ρ ) = i i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 40 / 76

Categorical Distribution (Discrete) Let x ∈ { 1 , ··· , k } be the outcome of rolling a k -sided dice, we have: k ρ 1 ( x ; x = i ) , where 1 ⊤ ρ = 1 ∏ Categorical ( x = x ; ρ ) = i i = 1 An extension of the Bernoulli distribution for k states Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 40 / 76

Multinomial Distribution (Discrete) Let x ∈ R k be a random vector where x i the number of the outcome i after rolling a k -sided dice n times: k n ! ρ x i i , where 1 ⊤ ρ = 1 and 1 ⊤ x = n ∏ Multinomial ( x = x ; n , ρ ) = x 1 ! ··· x k ! i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 41 / 76

Multinomial Distribution (Discrete) Let x ∈ R k be a random vector where x i the number of the outcome i after rolling a k -sided dice n times: k n ! ρ x i i , where 1 ⊤ ρ = 1 and 1 ⊤ x = n ∏ Multinomial ( x = x ; n , ρ ) = x 1 ! ··· x k ! i = 1 Properties: [Proof] E ( x ) = n ρ diag ( ρ ) − ρρ ⊤ � � Var ( x ) = n (i.e., Var ( x i ) = n ρ i ( 1 − ρ i ) and Var ( x i , x j ) = − n ρ i ρ j ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 41 / 76

Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables µ x = µ and σ 2 x = σ 2 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables µ x = µ and σ 2 x = σ 2 To avoid inverting σ 2 , we can parametrize the distribution using the precision β : � β � − β � N ( x = x ; µ , β − 1 ) = 2 ( x − µ ) 2 2 π exp Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

Confidence Intervals Figure: Graph of N ( µ , σ 2 ) . Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 43 / 76

Confidence Intervals Figure: Graph of N ( µ , σ 2 ) . We say the interval [ µ − 2 σ , µ + 2 σ ] has about the 95% confidence Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 43 / 76

Why Is Gaussian Distribution Common in ML? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Out of all possible probability distributions (over real numbers) with 2 the same variance, it encodes the maximum amount of uncertainty Assuming P ( y | x ) ∼ N , we insert the least amount of prior knowledge into a model Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Out of all possible probability distributions (over real numbers) with 2 the same variance, it encodes the maximum amount of uncertainty Assuming P ( y | x ) ∼ N , we insert the least amount of prior knowledge into a model Convenient for many analytical manipulations 3 Closed under affine transformation, summation, marginalization, conditioning, etc. Many of the integrals involving Gaussian distributions that arise in practice have simple closed form solutions Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

Probability & Information Theory Shan-Hung Wu - PowerPoint PPT Presentation

Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 1 / 76 Outline

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

Outline 1. Bayes Law L7: Probability Basics 2. Probability distributions CS 344R/393R:

Asymptotic Theory Part I Review of Asymptotic Theory James J. Heckman University of Chicago

Weak convergence of rescaled discrete objects in combinatorics Jean-Fran cois Marckert (LaBRI -

Confluence and Convergence in Probabilistically Terminating Reduction Systems Maja H. Kirkeby

ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS Giorgio Picci with T. Taylor, ASU Tempe AZ.

Randomness in C 2 and Pluripotential Theory Randomness in C 2 and Pluripotential Theory Outline 1

Multistage robust convex optimization problems: A sampling based approach Fabrizio Dabbene/

18.175: Lecture 14 Weak convergence and characteristic functions Scott Sheffield MIT 1 18.175

18.175: Lecture 12 DeMoivre-Laplace and weak convergence Scott Sheffield MIT 1 18.175 Lecture 12