probability information theory
play

Probability & Information Theory Shan-Hung Wu - PowerPoint PPT Presentation

Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 1 / 76 Outline


  1. Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

  2. Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Var ( x + y ) = Var ( x )+ Var ( y ) if x and y are independent Cov ( a x + b , c y + d ) = ac Cov ( x , y ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

  3. Covariance II Var ( a x + b y ) = a 2 Var ( x )+ b 2 Var ( y )+ 2 ab Cov ( x , y ) [Proof] Var ( x + y ) = Var ( x )+ Var ( y ) if x and y are independent Cov ( a x + b , c y + d ) = ac Cov ( x , y ) [Proof] Cov ( a x + b y , c w + d v ) = ac Cov ( x , w )+ ad Cov ( x , v )+ bc Cov ( y , w )+ bd Cov ( y , v ) [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 13 / 76

  4. Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 14 / 76

  5. Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

  6. Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

  7. Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ The covariance matrix of x is defined as:  σ 2  σ x 1 , x 2 ··· σ x 1 , x d x 1 σ 2 σ x 2 , x 1 ··· σ x 2 , x d  x 2  Σ x =  . . .  ... . . .   . . .   σ 2 σ x d , x 1 σ x d , x 2 ··· x d σ x i , x j = Cov ( x i , x j ) = E [( x i − µ x i )( x j − µ x j )] = E ( x i x j ) − µ x i µ x j Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

  8. Multivariate Random Variables I A multivariate random variable is denoted by x = [ x 1 , ··· , x d ] ⊤ Normally, x i ’s ( attributes or variables or features ) are dependent with each other P ( x ) is a joint distribution of x 1 , ··· , x d The mean of x is defined as µ x = E ( x ) = [ µ x 1 , ··· , µ x d ] ⊤ The covariance matrix of x is defined as:  σ 2  σ x 1 , x 2 ··· σ x 1 , x d x 1 σ 2 σ x 2 , x 1 ··· σ x 2 , x d  x 2  Σ x =  . . .  ... . . .   . . .   σ 2 σ x d , x 1 σ x d , x 2 ··· x d σ x i , x j = Cov ( x i , x j ) = E [( x i − µ x i )( x j − µ x j )] = E ( x i x j ) − µ x i µ x j � ( x − µ x )( x − µ x ) ⊤ � = E ( xx ⊤ ) − µ x µ ⊤ Σ x = Cov ( x ) = E x Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 15 / 76

  9. Multivariate Random Variables II Σ x is always symmetric Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

  10. Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

  11. Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Σ x is nonsingular iff it is positive definite Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

  12. Multivariate Random Variables II Σ x is always symmetric Σ x is always positive semidefinite [Homework] Σ x is nonsingular iff it is positive definite Σ x is singular implies that x has either: Deterministic/independent/non-linearly dependent attributes causing zero rows, or Redundant attributes causing linear dependency between rows Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 16 / 76

  13. Derived Random Variables Let y = f ( x ; w ) = w ⊤ x be a random variable transformed from x µ y = E ( w ⊤ x ) = w ⊤ E ( x ) = w ⊤ µ x σ 2 y = w ⊤ Σ x w [Homework] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 17 / 76

  14. Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 18 / 76

  15. What Does Pr ( x = x ) Mean? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

  16. What Does Pr ( x = x ) Mean? Bayesian probability : it’s a degree of belief or qualitative levels of 1 certainty Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

  17. What Does Pr ( x = x ) Mean? Bayesian probability : it’s a degree of belief or qualitative levels of 1 certainty Frequentist probability : if we can draw samples of x , then the 2 proportion of frequency of samples having the value x is equal to Pr ( x = x ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 19 / 76

  18. Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

  19. Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Why is it so important? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

  20. Bayes’ Rule P ( y | x ) = P ( x | y ) P ( y ) P ( x | y ) P ( y ) = P ( x ) Σ y P ( x | y = y ) P ( y = y ) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = ( likelihoodof y ) × ( priorof y ) evidence Why is it so important? E.g., a doctor diagnoses you as having a disease by letting x be “symptom” and y be “disease” P ( x | y ) and P ( y ) may be estimated from sample frequencies more easily Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 20 / 76

  21. Point Estimation Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 21 / 76

  22. Point Estimation Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let { x ( 1 ) , ··· , x ( n ) } be a set of n independent and identically distributed ( i.i.d. ) samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ ˆ θ n is called the estimate of θ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 21 / 76

  23. Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

  24. Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

  25. Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 A sample covariance matrix: n Σ x = 1 ( x ( i ) − ˆ µ x )( x ( i ) − ˆ ˆ µ x ) ⊤ ∑ n i = 1 s = 1 ( x ( s ) µ x i )( x ( s ) σ 2 x i , x j = 1 n ∑ n ˆ − ˆ − ˆ µ x j ) i j Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

  26. Sample Mean and Covariance Given X = [ x ( 1 ) , ··· , x ( n ) ] ⊤ ∈ R n × d the i.i.d samples, what are the estimates of the mean and covariance of x ? A sample mean: n µ x = 1 x ( i ) ∑ ˆ n i = 1 A sample covariance matrix: n Σ x = 1 ( x ( i ) − ˆ µ x )( x ( i ) − ˆ ˆ µ x ) ⊤ ∑ n i = 1 s = 1 ( x ( s ) µ x i )( x ( s ) σ 2 x i , x j = 1 n ∑ n ˆ − ˆ − ˆ µ x j ) i j If each x ( i ) is centered (by subtracting ˆ µ x first), then ˆ Σ x = 1 n X ⊤ X Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 22 / 76

  27. Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 23 / 76

  28. Principal Components Analysis (PCA) I i = 1 , where x ( i ) ∈ R D Give a collection of data points X = { x ( i ) } N Suppose we want to lossily compress X , i.e., to find a function f such that f ( x ( i ) ) = z ( i ) ∈ R K , where K < D How to keep the maximum info in X ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 24 / 76

  29. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  30. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  31. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  32. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  33. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Why � w ( j ) � = 1 for all j ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  34. Principal Components Analysis (PCA) II Let x ( i ) ’s be i.i.d. samples of a random variable x Let f be linear, i.e., f ( x ) = W ⊤ x for some W ∈ R D × K Principal Component Analysis (PCA) finds K orthonormal vectors w ( 1 ) , ··· , w ( K ) � such that the transformed variable z = W ⊤ x has � W = the most “spread out” attributes, i.e., each attribute z j = w ( j ) ⊤ x has the maximum variance Var ( z j ) w ( 1 ) , ··· , w ( K ) are called the principle components Why w ( 1 ) , ··· , w ( K ) need to be orthogonal with each other? Each w ( j ) keeps information that cannot be explained by others, so together they preserve the most info Why � w ( j ) � = 1 for all j ? Only directions matter—we don’t want to maximize Var ( z j ) by finding a long w ( j ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 25 / 76

  35. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  36. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  37. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  38. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  39. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max X ⊤ X is symmetric thus can be eigendecomposed Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  40. Solving W I For simplicity, let’s consider K = 1 first How to evaluate Var ( z 1 ) ? z 1 = w ( 1 ) ⊤ Σ x w ( 1 ) [Homework] Recall that z 1 = w ( 1 ) ⊤ x implies σ 2 How to get Σ x ? An estimate: ˆ N X ⊤ X (assuming x ( i ) ’s are centered first) Σ x = 1 Optimization problem to solve: w ( 1 ) ∈ R D w ( 1 ) ⊤ X ⊤ Xw ( 1 ) , subject to � w ( 1 ) � = 1 arg max X ⊤ X is symmetric thus can be eigendecomposed By Rayleigh’s Quotient, the optimal w ( 1 ) is given by the eigenvector of X ⊤ X corresponding to the largest eigenvalue Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 26 / 76

  41. Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

  42. Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max By Rayleigh’s Quotient again, w ( 2 ) is the eigenvector corresponding to the 2 -nd largest eigenvalue Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

  43. Solving W II Optimization problem for w ( 2 ) : w ( 2 ) ∈ R D w ( 2 ) ⊤ X ⊤ Xw ( 2 ) , subject to � w ( 2 ) � = 1 and w ( 2 ) ⊤ w ( 1 ) = 0 arg max By Rayleigh’s Quotient again, w ( 2 ) is the eigenvector corresponding to the 2 -nd largest eigenvalue For general case where K > 1 , the w ( 1 ) , ··· , w ( K ) are eigenvectors of X ⊤ X corresponding to the largest K eigenvalues Proof by induction [Proof] Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 27 / 76

  44. Visualization Figure: PCA learns a linear projection that aligns the direction of greatest variance with the axes of the new space. With these new axes, the estimated Σ z = W ⊤ ˆ Σ x W ∈ R K × K is always diagonal. covariance matrix ˆ Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 28 / 76

  45. Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 29 / 76

  46. Sure and Almost Sure Events Given a continuous random variable x , we have Pr ( x = x ) = 0 for any value x Will the event x = x occur? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 30 / 76

  47. Sure and Almost Sure Events Given a continuous random variable x , we have Pr ( x = x ) = 0 for any value x Will the event x = x occur? Yes! An event A happens surely if always occurs An event A happens almost surely if Pr ( A ) = 1 (e.g., Pr ( x � = x ) = 1 ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 30 / 76

  48. Equality of Random Variables I Definition (Equality in Distribution) Two random variables x and y are equal in distribution iff Pr ( x ≤ a ) = Pr ( y ≤ a ) for all a . Definition (Almost Sure Equality) Two random variables x and y are equal almost surely iff Pr ( x = y ) = 1 . Definition (Equality) Two random variables x and y are equal iff they maps the same events to same values. Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 31 / 76

  49. Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

  50. Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

  51. Equality of Random Variables II What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true E.g., let x and y be binary random variables and P x ( 0 ) = P x ( 1 ) = P y ( 0 ) = P y ( 1 ) = 0 . 5 They are equal in distribution But Pr ( x = y ) = 0 . 5 � = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 32 / 76

  52. Convergence of Random Variables I Definition (Convergence in Distribution) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges in distribution x ( n ) = x � � to x iff lim n → ∞ P = P ( x = x ) Definition (Convergence in Probability) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges in probability | x ( n ) − x | < ε � � to x iff for any ε > 0 , lim n → ∞ Pr = 1 . Definition (Almost Sure Convergence) A sequence of random variables { x ( 1 ) , x ( 2 ) , ···} converges almost surely lim n → ∞ x ( n ) = x � � to x iff Pr = 1 . Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 33 / 76

  53. Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

  54. Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

  55. Convergence of Random Variables II What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true | x ( n ) − x | < ε � � lim n → ∞ Pr = 1 leaves open the possibility that | x ( n ) − x | > ε happens an infinite number of times lim n → ∞ x ( n ) = x = 1 guarantees that | x ( n ) − x | > ε almost surely � � Pr will not occur Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 34 / 76

  56. Distribution of Derived Variables I Suppose y = f ( x ) and f − 1 exists, does P ( y = y ) = P ( x = f − 1 ( y )) always hold? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 35 / 76

  57. Distribution of Derived Variables I Suppose y = f ( x ) and f − 1 exists, does P ( y = y ) = P ( x = f − 1 ( y )) always hold? No , when x and y are continuous Suppose x ∼ Uniform ( 0 , 1 ) is continuous and p ( x ) = c for x ∈ ( 0 , 1 ) Let y = x / 2 ∼ Uniform ( 0 , 1 / 2 ) If p y ( y ) = p x ( 2 y ) , then � 1 / 2 � 1 / 2 y = 0 c · dy = 1 y = 0 p y ( y ) dy = 2 � = 1 Violates the axiom of probability Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 35 / 76

  58. Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

  59. Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Since f may distort space, we need to ensure that | p y ( f ( x )) dy | = | p x ( x ) dx | We have ∂ f − 1 ( y ) � � � ∂ f ( x ) � p y ( y ) = p x ( f − 1 ( y )) � � � � � (or p x ( x ) = p y ( f ( x )) � ) � � � � ∂ y ∂ x � � In previous example: p y ( y ) = 2 · p x ( 2 y ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

  60. Distribution of Derived Variables II Recall that Pr ( y = y ) = p y ( y ) dy and Pr ( x = x ) = p x ( x ) dx Since f may distort space, we need to ensure that | p y ( f ( x )) dy | = | p x ( x ) dx | We have ∂ f − 1 ( y ) � � � ∂ f ( x ) � p y ( y ) = p x ( f − 1 ( y )) � � � � � (or p x ( x ) = p y ( f ( x )) � ) � � � � ∂ y ∂ x � � In previous example: p y ( y ) = 2 · p x ( 2 y ) In multivariate case, we have p y ( y ) = p x ( f − 1 ( y )) J ( f − 1 )( y ) � �� � � det � , where J ( f − 1 )( y ) is the Jacobian matrix of f − 1 at input y J ( f − 1 )( y ) i , j = ∂ f − 1 ( y ) / ∂ y j i Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 36 / 76

  61. Outline Random Variables & Probability Distributions 1 Multivariate & Derived Random Variables 2 Bayes’ Rule & Statistics 3 Application: Principal Components Analysis 4 Technical Details of Random Variables 5 Common Probability Distributions 6 Common Parametrizing Functions 7 Information Theory 8 Application: Decision Trees & Random Forest 9 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 37 / 76

  62. Random Experiments The value of a random variable x can be think of as the outcome of an random experiment Helps us define P ( x ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 38 / 76

  63. Bernoulli Distribution (Discrete) Let x ∈ { 0 , 1 } be the outcome of tossing a coin, we have: � ρ , if x = 1 or ρ x ( 1 − ρ ) 1 − x Bernoulli ( x = x ; ρ ) = 1 − ρ , otherwise Properties: [Proof] E ( x ) = ρ Var ( x ) = ρ ( 1 − ρ ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 39 / 76

  64. Categorical Distribution (Discrete) Let x ∈ { 1 , ··· , k } be the outcome of rolling a k -sided dice, we have: k ρ 1 ( x ; x = i ) , where 1 ⊤ ρ = 1 ∏ Categorical ( x = x ; ρ ) = i i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 40 / 76

  65. Categorical Distribution (Discrete) Let x ∈ { 1 , ··· , k } be the outcome of rolling a k -sided dice, we have: k ρ 1 ( x ; x = i ) , where 1 ⊤ ρ = 1 ∏ Categorical ( x = x ; ρ ) = i i = 1 An extension of the Bernoulli distribution for k states Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 40 / 76

  66. Multinomial Distribution (Discrete) Let x ∈ R k be a random vector where x i the number of the outcome i after rolling a k -sided dice n times: k n ! ρ x i i , where 1 ⊤ ρ = 1 and 1 ⊤ x = n ∏ Multinomial ( x = x ; n , ρ ) = x 1 ! ··· x k ! i = 1 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 41 / 76

  67. Multinomial Distribution (Discrete) Let x ∈ R k be a random vector where x i the number of the outcome i after rolling a k -sided dice n times: k n ! ρ x i i , where 1 ⊤ ρ = 1 and 1 ⊤ x = n ∏ Multinomial ( x = x ; n , ρ ) = x 1 ! ··· x k ! i = 1 Properties: [Proof] E ( x ) = n ρ diag ( ρ ) − ρρ ⊤ � � Var ( x ) = n (i.e., Var ( x i ) = n ρ i ( 1 − ρ i ) and Var ( x i , x j ) = − n ρ i ρ j ) Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 41 / 76

  68. Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

  69. Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

  70. Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables µ x = µ and σ 2 x = σ 2 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

  71. Normal/Gaussian Distribution (Continuous) Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: � � � 1 − 1 N ( x = x ; µ , σ 2 ) = 2 σ 2 ( x − µ ) 2 . 2 πσ 2 exp Holds regardless of the original distributions of individual variables µ x = µ and σ 2 x = σ 2 To avoid inverting σ 2 , we can parametrize the distribution using the precision β : � β � − β � N ( x = x ; µ , β − 1 ) = 2 ( x − µ ) 2 2 π exp Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 42 / 76

  72. Confidence Intervals Figure: Graph of N ( µ , σ 2 ) . Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 43 / 76

  73. Confidence Intervals Figure: Graph of N ( µ , σ 2 ) . We say the interval [ µ − 2 σ , µ + 2 σ ] has about the 95% confidence Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 43 / 76

  74. Why Is Gaussian Distribution Common in ML? Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

  75. Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

  76. Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Out of all possible probability distributions (over real numbers) with 2 the same variance, it encodes the maximum amount of uncertainty Assuming P ( y | x ) ∼ N , we insert the least amount of prior knowledge into a model Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

  77. Why Is Gaussian Distribution Common in ML? It can model noise in data (e.g., Gaussian white noise) 1 Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process Out of all possible probability distributions (over real numbers) with 2 the same variance, it encodes the maximum amount of uncertainty Assuming P ( y | x ) ∼ N , we insert the least amount of prior knowledge into a model Convenient for many analytical manipulations 3 Closed under affine transformation, summation, marginalization, conditioning, etc. Many of the integrals involving Gaussian distributions that arise in practice have simple closed form solutions Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 44 / 76

Recommend


More recommend