Active Learning and Optimized Information Gathering Lecture 6 – Gaussian Process Optimization CS 101.2 Andreas Krause
Announcements Homework 1: out tomorrow Due Thu Jan 29 Project Proposal due Tue Jan 27 Office hours Come to office hours before your presentation! Andreas: Friday 1:30-3pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2
Course outline Online decision making 1. Statistical active learning 2. Combinatorial approaches 3. 3
Recap Bandit problems … p 2 p 3 p k p 1 K-arms ε n greedy, UCB1 have regret O(log(T) K ) What about infinite arms (K= ∞ ) Have to make assumptions! 4
Bandits = Noisy function optimization We are given black box access to function f f(x) = mean payoff for arm x x f y = f(x) + noise Evaluating f is very expensive Want to (quickly) find x* = argmax x f(x) 5
Bandits with ∞ -many arms f(x)=w T x Lipschitz-continuous Linear (bounded slope) Can only hope to perform well if we make some assumptions 6
Regret depends on complexity Bandit linear optimization over R n “strong” assumptions Regret O(T 2/3 n) Bandit problems for optimizing Lipschitz functions “weak” assumptions Regret O(C(n) T n/(n+1) ) Curse of dimensionality! Today: Flexible (Bayesian) approach for encoding assumptions about function complexity 7
What if we believe, the function looks like: Piece-wise linear? Analytic? ( ∞ ∞ -diff.’able) ∞ ∞ Want flexible way to encode assumptions about functions! 8
Bayesian inference Two Bernoulli variables A(larm), B(urglar) P(B=1) = 0.1; P(A=1 | B=1)=0.9; P(A=1 | B=0)=0.1 What is P(B | A)? P(B) “prior” P(A | B) “likelihood” P(B | A) “posterior” 9
A Bayesian approach Bayesian models for functions Likelihood P(data | f) Prior P(f) Posterior P(f | data) + + + + Uff… Why is this useful? 10
Probability of data P(y 1 ,…,y k ) = Can compute P(y’ | y 1 ,…,y k ) = 11
Regression with uncertainty about predictions! + + + + 12
How can we do this? Want to compute P(y’ | y 1 ,…,y k ) P(y 1 ,…,y k ) = ∫ P(f, y 1 ,…,y k ) df Horribly complicated integral?? � Will see how we can compute this (more or less) efficiently In closed form! … if P(f) is a Gaussian Process 13
Gaussian distribution σ = Standard deviation µ = mean 14
Bivariate Gaussian distribution 0.2 2 0.15 0.4 1 0.3 2 0.1 0.2 1 0 0.05 0.1 0 -1 0 -1 0 -2 -1.5 -1 -0.5 -2 -1.5 0 -1 -2 0.5 -0.5 -2 0 1 0.5 1 1.5 1.5 2 2 15
Multivariate Gaussian distribution Joint distribution over n random variables P(Y 1 ,…Y n ) σ jk = E[ (Y j – µ j ) (Y k - µ k ) ] Y j and Y k independent � σ jk =0 16
Marginalization Suppose (Y 1 ,…,Y n ) ~ N( µ , Σ ) What is P(Y 1 )?? More generally: Let A={i 1 ,…,i k } ⊆ {1,…,N} Write Y A = (Y i1 ,…,Y ik ) Y A ~ N( µ A , Σ AA ) 17
Conditioning Suppose (Y 1 ,…,Y n ) ~ N( µ , Σ ) Decompose as (Y A ,Y B ) What is P(Y A | Y B )?? P(Y A = y A | Y B = y B ) = N(y A ; µ A|B , Σ A|B ) where Computable using linear algebra! 18
Conditioning 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 P(Y 2 | Y 1 =0.75) 0 0.5 -2 1 1.5 2 Y 1 =0.75 19
High dimensional Gaussians Gaussian Bivariate Gaussian 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 0 0.5 -2 1 1.5 2 Multivariate Gaussian Gaussian Process = “ ∞ -variate Gaussian” 20
Gaussian process A Gaussian Process (GP) is a (infinite) set of random variables, indexed by some set V i.e., for each x ∈ V there’s a RV Y x Let A ⊆ V, |A|= {x 1 ,…,x k } < ∞ Then Y A ~ N( µ A , Σ AA ) where K: V × V → R is called kernel (covariance) function µ : V → R is called mean function 21
Visualizing GPs x ∈ ∈ ∈ ∈ V Typically, only care about “marginals”, i.e., P(y) = N(y; µ (x), K(x,x)) 22
Mean functions Can encode prior knowledge Typically, one simply assumes µ (x) = 0 Will do that here to simplify notation 23
Kernel functions K must be symmetric K(x,x’) = K(x’,x) for all x, x’ K must be positive definite For all A: Σ AA is positive definite matrix Kernel function K: assumptions about correlation! 24
Kernel functions: Examples Squared exponential kernel 1 K(x,x’) = exp(-(x-x’) 2 /h 2 ) 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 Distance |x-x’| Samples from P(f) 3 2 2.5 1 2 1.5 0 1 0.5 -1 0 -2 -0.5 -1 -3 -1.5 -2 -4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Bandwidth h=.1 Bandwidth h=.3 25
Kernel functions: Examples Exponential kernel 1 0 . 9 0 . 8 K(x,x’) = exp(-|x-x’|/h) 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 Distance |x-x’| 2.5 1.5 2 1 1.5 0.5 1 0 0.5 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandwidth h=.3 Bandwidth h=1 26
Kernel functions: Examples Linear kernel: K(x,x’) = x T x’ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Corresponds to linear regression! 27
Kernel functions: Examples Linear kernel with features: K(x,x’) = Φ (x) T Φ (x’) Φ (x) = [0,x,x 2 ] E.g., Φ Φ Φ E.g., Φ Φ Φ Φ (x) = sin(x) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 28
Kernel functions: Examples White noise: K(x,x) = 1; K(x,x’) = 0 for x’ ≠ x 4 3 2 1 0 -1 -2 -3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 29
Constructing kernels from kernels If K 1 (x,x’) and K 2 (x,x’) are kernel functions then α K 1 (x,x’) + β K 2 (x,x’) is a kernel for α , β > 0 K 1 (x,x’)*K 2 (x,x’) is a kernel 30
GP Regression Suppose we know kernel function K Get data (x 1 ,y 1 ),…,(x n ,y n ) Want to predict y’ = f(x’) for some new x’ 31
Linear prediction Posterior mean µ x` | D = Σ x`,D Σ D,D-1 y D Hence, µ x`|D = ∑ i=1n w i y i Prediction µ x`|D depends linearly on inputs y i ! For fixed data set D = {(x 1 ,y 1 ),…,(x n ,y n )}, can precompute weights w i Like linear regression, but number of parameters w_i grows with training data � “Nonparametric regression” � Can fit any data set!! ☺ 32
Learning parameters Example: K(x,x’) = exp(-(x-x’) 2 /h 2 ) Need to specify h! + + + + + + + + + + + + + + + + + + + + + h too small h too large h “just right” “underfit” “overfit” In general, kernel function has parameters θ Want to learn θ from data 33
Learning parameters Pick parameters that make data most likely! log P(y | θ ) differentiable if K(x,x’) is! � Can do gradient descent, conjugate gradient, etc. Tends to work well (not over- or underfit) in practice! 34
Matlab demo [Rasmussen & Williams, Gaussian Processes for Machine Learning] http://www.gaussianprocess.org/gpml/ 35
Gaussian process A Gaussian Process (GP) is a (infinite) set of random variables, indexed by some set V i.e., for each x ∈ V there’s a RV Y x Let A ⊆ V, |A|= {x 1 ,…,x k } < ∞ Then Y A ~ N( µ A , Σ AA ) where K: V × V → R is called kernel (covariance) function µ : V → R is called mean function 36
GPs over other sets GP is collection of random variables, indexed by set V So far: Have seen GPs over V = R Can define GPs over Text (strings) Graphs Sets … Only need to choose appropriate kernel function 37
Example: Using GPs to model spatial phenomena ��������������� � ���������������������� ���������������������������� ������������ � ��������� �������� � � � � ���������������������� �������������������� 38
Other extensions (won’t cover here) GPs for classification Nonparametric generalization of logistic regression Like SVMs (but give confidence on predicted labels!) GPs for modeling non-Gaussian phenomena Model count data over space, … Active set methods for fast inference … Still active research area in machine learning 39
Bandits = Noisy function optimization We are given black box access to function f x f y = f(x) + noise Evaluating f is very expensive Want to (quickly) find x* = argmax x f(x) Idea: Assume f is a sample from a Gaussian Process! � Gaussian Process optimization (a.k.a.: Response surface optimization) 40
Upper confidence bound approach UCB(x | D) = µ (x | D) + 2* σ (x | D) Pick point x* = argmax x UCB(x | D) + + + + + x ∈ ∈ ∈ ∈ V 41
Matlab demo 42
Properties Implicitly trades off exploration and exploitation Exploits prior knowledge about function Can converge to optimal solution very quickly! ☺ Seems to work well in many applications Can perform poorly if our prior assumptions are wrong � 43
Recommend
More recommend