6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola - PowerPoint PPT Presentation

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

The Normal Distribution

Gaussians in Space

Gaussians in Space samples in R 2

  The Normal Distribution • Density for scalar variables   1 1 2 σ 2 ( x − µ ) p ( x ) = 2 πσ 2 e − √ • Density in d dimensions 2 ( x − µ ) > Σ � 1 ( x − µ ) p ( x ) = (2 π ) − d 2 | Σ | − 1 e − 1 • Principal components • Eigenvalue decomposition Σ = U > Λ U • Product representation 2 ( U ( x − µ )) > Λ � 1 U ( x − µ ) p ( x ) = (2 π ) − d 2 e − 1

Recall - Gaussian is in the Exponential Family • Binomial Distribution φ ( x ) = x • Discrete Distribution   φ ( x ) = e x (e x is unit vector for x) ✓ ◆ x, 1 2 xx > φ ( x ) = • Gaussian • Poisson (counting measure 1/x!) φ ( x ) = x • Dirichlet, Beta, Gamma, Wishart, ...

Recall - Gaussian is in the Exponential Family • Binomial Distribution φ ( x ) = x • Discrete Distribution   φ ( x ) = e x (e x is unit vector for x) ✓ ◆ x, 1 2 xx > φ ( x ) = • Gaussian • Poisson (counting measure 1/x!) φ ( x ) = x • Dirichlet, Beta, Gamma, Wishart, ... " # n E [ φ ( x )] − 1 X − ∂ θ log p ( X ; θ ) = m φ ( x i ) m i =1

The Normal Distribution principal principal components components Σ = U > Λ U d 2 ( U ( x − µ )) > Λ � 1 U ( x − µ ) p ( x ) = (2 π ) − d ii e − 1 Y Λ − 1 2 i =1

    Why do we care? • Central limit theorem shows that in the limit all averages behave like Gaussians • Easy to estimate parameters (MLE)   m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 • Distribution with largest uncertainty (entropy) for a given mean and covariance. • Works well even if the assumptions are wrong

        Why do we care? • Central limit theorem shows that in the limit all averages behave like Gaussians • Easy to estimate parameters (MLE)   m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 X: data   m: sample size   mu = (1/m)*sum(X,2)   sigma = (1/m)*X*X’- mu*mu’

Sampling from a Gaussian • Case 1 - We have a normal distribution (randn) • We want x ∼ N ( µ, Σ ) • Recipe: where and Σ = LL > z ∼ N (0 , 1 ) x = µ + Lz • Proof: ( x − µ )( x − µ ) > ⇤ Lzz > L > ⇤ ⇥ ⇥ = E E L > = LL > = Σ zz > ⇤ ⇥ = L E • Case 2 - Box-Müller transform for U[0,1] 2 k x k 2 = p ( x ) = 1 ⇒ p ( φ , r ) = 1 2 r 2 2 π e � 1 2 π e � 1 F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π ·

Sampling from a Gaussian Φ r 2 k x k 2 = p ( x ) = 1 ⇒ p ( φ , r ) = 1 2 r 2 2 π e � 1 2 π e � 1 F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π ·

      Sampling from a Gaussian • Cumulative distribution function   F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π · Draw radial and angle component separately   tmp1 = rand()   tmp2 = rand()   r = sqrt(-2*log(tmp1))   x1 = r*sin(tmp2/(2*pi))   x2 = r*cos(tmp2/(2*pi))  

      Sampling from a Gaussian • Cumulative distribution function   F ( φ , r ) = φ h 2 r 2 i 1 − e − 1 2 π · Draw radial and angle component separately   Why can we use tmp1 tmp1 = rand()   instead of 1-tmp1? tmp2 = rand()   r = sqrt(-2*log(tmp1))   x1 = r*sin(tmp2/(2*pi))   x2 = r*cos(tmp2/(2*pi))  

Principal Component Analysis

Data Visualization H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 Instances A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000 Features • 53 Blood and urine samples from 65 people • Difficult to see the correlations between features

Data Visualization 1000 900 800 700 600 Value 500 400 300 200 100 0 0 10 20 30 40 50 60 measurement Measurement • Spectral format (65 curves, one for each person) • Difficult to compare different patients

Data Visualization 1.8 1.6 1.4 1.2 H-Bands 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 Person One plot per person …

Data Visualization Bi-variate Tri-variate 550 500 450 4 400 C-LDH 3 350 M-EPI 2 300 250 1 200 0 150 600 100200300400500 100 400 50 200 C-LDH 0 50 150 250 350 450 0 0 C-Triglycerides C-Triglycerides Even 3 dimensions are already difficult. How to extend this?

Compact Summaries via PCA m m 1 k x i � Px i k 2 where 1 X X minimize x i = µ m m rank P = k i =1 i =1 " # m assume 1 X x i x > i − Px i x > i P > tr m centering i =1 maximize tr Σ − tr P Σ P > this

Compact Summaries via PCA • Is there a representation better than the coordinate axes? • Is it really necessary to show all the 53 dimensions? • What if there are strong correlations between features? • What if there’s some additive noise? m m 1 k x i � Px i k 2 where 1 X X minimize x i = µ m m rank P = k i =1 i =1 " # m assume 1 X x i x > i − Px i x > i P > tr m centering i =1 maximize tr Σ − tr P Σ P > this

Compact Summaries via PCA • Is there a representation better than the coordinate axes? • Is it really necessary to show all the 53 dimensions? • What if there are strong correlations between features? • What if there’s some additive noise? • Find the smallest subspace that keeps most information m m 1 k x i � Px i k 2 where 1 X X minimize x i = µ m m rank P = k i =1 i =1 " # m assume 1 X x i x > i − Px i x > i P > tr m centering i =1 maximize tr Σ − tr P Σ P > this

Compact Summaries via PCA maximize this Residual = tr Σ − tr P Σ P > d k d X X X σ 2 σ 2 σ 2 = i = i − i i =1 i =1 i = k +1 x = z + ✏ where z ∼ N ( µ, Σ ) and ✏ ∼ N (0 , � 2 1 ) Σ + σ 2 1 σ 2 i + σ 2

        Compact Summaries via PCA maximize • Subspace optimization   this Residual = tr Σ − tr P Σ P > d k d X X X σ 2 σ 2 σ 2 = i = i − i i =1 i =1 i = k +1 • Signal to Noise ratio optimization • Assume data x is generated with additive noise x = z + ✏ where z ∼ N ( µ, Σ ) and ✏ ∼ N (0 , � 2 1 ) Σ + σ 2 1 • Joint covariance matrix is σ 2 i + σ 2 • Joint eigenvalues are , so we can ignore everything below the noise threshold

2d dataset

First principal axis

Second principal axis

Eigenfaces (PCA on images)

When projecting strange data • Original images • Reconstruction doesn’t look like the original

height weight Inference

Correlating weight and height

Correlating weight and height assume Gaussian correlation

p ( weight | height ) = p ( height , weight ) ∝ p ( height , weight ) p ( height )

� >  Σ 11  x 1 − µ 1 � � 1  x 1 − µ 1 " �# − 1 Σ 12 p ( x 2 | x 1 ) ∝ exp Σ 12 Σ 22 2 x 2 − µ 2 x 2 − µ 2 keep linear and quadratic terms of exponent

The gory math Correlated Observations Assume that the random variables t 2 R n , t 0 2 R n 0 are jointly normal with mean ( µ, µ 0 ) and covariance matrix K � >  K tt K tt 0 �!  � � 1  � 1 t � µ t � µ p ( t, t 0 ) / exp . t 0 � µ 0 t 0 � µ 0 K > tt 0 K t 0 t 0 2 Inference Given t , estimate t 0 via p ( t 0 | t ) . Translation into machine learning language: we learn t 0 from t . Practical Solution µ, ˜ Since t 0 | t ⇠ N (˜ K ) , we only need to collect all terms in p ( t, t 0 ) depending on t 0 by matrix inversion, hence µ = µ 0 + K > ⇥ ⇤ ˜ tt 0 K � 1 K � 1 K = K t 0 t 0 � K > tt K tt 0 and ˜ tt ( t � µ ) tt 0 | {z } independent of t 0 Handbook of Matrices, Lütkepohl 1997 (big timesaver)

  Mini Summary • Normal distribution 2 ( x − µ ) > Σ � 1 ( x − µ ) p ( x ) = (2 π ) − d 2 | Σ | − 1 e − 1 • Sampling from   x ∼ N ( µ, Σ ) Use where and Σ = LL > z ∼ N (0 , 1 ) x = µ + Lz • Estimating mean and variance   m m µ = 1 x i and Σ = 1 X X x i x > i − µµ > m m i =1 i =1 • Conditional distribution is Gaussian, too! � >  Σ 11  x 1 − µ 1 � � 1  x 1 − µ 1 " �# − 1 Σ 12 p ( x 2 | x 1 ) ∝ exp Σ 12 Σ 22 2 x 2 − µ 2 x 2 − µ 2

6.2 Gaussian Processes 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

Gaussian Process

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola - PowerPoint PPT Presentation

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 Normal Distribution http://www.gaussianprocess.org/gpml/chapters/ The Normal Distribution Gaussians in

Lecture 1: Review of DTFT, Gaussians, and Linear Algebra Mark Hasegawa-Johnson ECE 417:

Tutorial on Estimation and Multivariate Gaussians STAT 27725/CMSC 25400: Machine Learning

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

Lattice-Based Cryptography: Trapdoors, Discrete Gaussians, and Applications Chris Peikert

Lecture 20 Lecture 20 Nov 12 th 2008 Clustering with Mixture of Gaussians Clustering with Mixture

Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt Eric Price Google Research

CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters, Gaussians Pieter Abbeel UC

k-Maximum Likelihood Estimator for mixtures of generalized Gaussians ICPR 2012, Tokyo, Japan

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu

Continuous RVs Continued: Independence, Conditioning, Gaussians, CLT CS 70, Summer 2019 Lecture

Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians Peiyun Hu, UC Irvine

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Contents Clustering K-means Mixture of Gaussians Expectation Maximization

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision Jia-Bin Huang,

Introduction to Zonal Polynomials Lin Jiu Dalhousie University Number Theory Seminar Jan. 22,

Modeling the Universe Interfacing Theory, Simulations, Statistical Methods, and Observations Tim

Random Matrix Theory in a nutshell and applications Manuela Girotti IFT 6085, February 27th,

Asymptotic properties of entanglement polytopes for large number of qubits and RMT Adam Sawicki

On Links Between the Random Matrix and Random Operator Theories L. Pastur Institute for Low

Random quantum states Ion Nechita CNRS, Laboratoire de Physique Th eorique, Toulouse ement

Delivery Open Integrated and Competitive European Market in Electricity Market Access

Speeding up Permutation Testing in Neuroimaging Chris Hinrichs , Vamsi Ithapu , Qinyuan