tutorials on the gaussian random process and its or
play

Tutorials on the Gaussian Random Process and its OR Applications By - PowerPoint PPT Presentation

Tutorials on the Gaussian Random Process and its OR Applications By Juta Pichitlamken Department of Industrial Engineering Kasetsart University, Bangkok juta.p@ku.ac.th http://pirun.ku.ac.th/ fengjtp/ September 3, 2009 Operations


  1. Tutorials on the Gaussian Random Process and its OR Applications By Juta Pichitlamken Department of Industrial Engineering Kasetsart University, Bangkok juta.p@ku.ac.th http://pirun.ku.ac.th/ ∼ fengjtp/ September 3, 2009 Operations Research Network of Thailand (OR-NET 2009)

  2. Motivation: Dagstuhl Seminar • Near Frankfurt, Germany. Web: http://www.dagstuhl.de/ • Concentrate on computer sciences, but the topic that I went to is Sampling-based Optimization in the Presence of Uncertainty organized by Jurgen Branke (Universitat Karlsruhe, Germany), Barry Nelson (Northwestern University, US), Warren Powell (Princeton University, US), and Thomas J. Santner (Ohio State University, US). • Small workshop of 20-30 people (invitation only) from different fields (Statistics, Simulation, OR, Computer Sciences, Business Administration, Mathematics, ...). • Despite these diversity, many people use Gaussian Random Processes (GP) as modeling tools.

  3. Guassian Process (GP) Application: Spatial Statistics • Model spatial distribution of environmental or socioeconomic data (e.g., geographical distributions of cases of type-A (H1N1) influenza, or data from Geographic Information System (GIS)) for statistical inference. • Intuition: “ Everything is related to everything else, but near things are more related than distant things. ” (Waldo Tobler) • GP prediction (known as kriging) can be used to model this spatial dependency , i.e., spatial data is viewed as a realization of a stochastic process . • To build a model, the stochastic process is assumed stable: mean is constant, and covariance depends on the distance. Covariance is a measure of how much two variables change together: Cov( X, Y ) = E[( X − µ X )( Y − µ Y )]. • Classic textbook: Cressie, N.A.C. 1993. Statistics for Spatial Data. Wiley. Source: Cˆ amara, G., A.M. Monteiro, S.D. Fucks, and M. S. Carvalho. 2004. Spatial Analysis and GIS Primer . Downloadable from www.dpi.inpe.br/gilberto/spatial analysis.html

  4. GP Application: Metamodeling of Deterministic Responses • In OR, a metamodel is a mathematical model of a set of related models (Xavier 2004). • Design and analysis of computer experiments: Data are generated from a computer code (e.g., finite element models) whose responses are deterministic; computation may be time consuming; and large number of factors are involved. • Given the training data { ( x i , y i ) } n i =1 , we want to find an approximation (metamodel) to the computer code. Called surrogates in the (deterministic) global optimization literature. • Ex: a computational fluid dynamics (CFD) model to study the oil mist separator system in the internal combustion engine (Satoh, Kawai, Ishikawa, and Matsuoka 2000). • Computer models with multiple levels of fidelity for optimization: Huang, Allen, Notz, and Miller (2006). • Textbooks: Fang K.-T., R. Li, and A. Sudjianto. 2006. Design and Modeling for Computer Experiments . Taylor and Francis. Santner, T.J., B.J. Williams, and W. I. Notz. 2003. The Design and Analysis of Computer Experiments . Springer-Verlag.

  5. GP Application: Metamodeling of Probabilistic Responses Probabilistic responses (e.g., discrete-event simulation outputs). Approaches: 1. Consider the variability in the observed responses and fitting errors simultaneously: Ankenman, Nelson, and Staum (2008). 2. Separate the trend in the data using least squares models. Then, apply kriging models to model the residuals: van Beers and Kleijnen (2003, 2007).

  6. GP Application: Machine learning • Supervised learning: f : input → output from empirical data (training data set). – For continuous outputs (aka responses or dependent variables), f is called regression. Ex: in manufacturing systems, WIP = f (throughput or production rate). – For discrete outputs, f is known as classification. Ex: classification of handwritten images into digits (0-9). • Approach: Assume prior distributions (beliefs over the types of f we will observe, e.g., mean and variance) to be GP. Once we get actual data, we can reject f that do not agree with the data, i.e., find the GP parameters that best fit the data. • Desirable properties of prior: smooth and stationary ← covariance function. • Resources: MacKay, D. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press. Rasmussen C.E., and C.K.I. Williams. 2006. Gaussian Processes for Machine Learning . MIT Press. Also available online: www.gaussianprocess.org/gpml

  7. Gaussian Process Regression Demo Thanks to Rasmussen C.E., and C.K.I. Williams. 2006. GPML Code. Sample Y from N (0 , 1) with exponential covariance function: � � − 1 Cov ( Y ( x p ) , Y ( x q )) = σ 2 2 ℓ 2 ( x p − x q ) 2 + σ 2 n 1 { x p = x q } f exp (1) Note: • Covariance between outputs are a function of inputs . • If x p ≈ x q , then Corr( Y ( x p ) , Y ( x q ) ≈ 1 . • Hyperparameters are length-scale ℓ = 1, signal variance σ 2 f = 1, and noise variance σ 2 n = 0 . 01.

  8. Gaussian Process Regression Demo (Cont’d) Consider the following cases: 1. Fit a GP assuming zero mean and using the true covariance (1). 2. Shorten the length-scale to 0.3 ( σ f = 1 . 08 , σ n = 0 . 00005); the noise level is much reduced. 3. Lengthen the length-scale to 3 ( σ f = 1 . 16 , σ n = 0 . 89); the noise level is higher. 4. Assume the hyperparameters are unknown and estimated as the maximizer of the posterior probability given the training data, using the same covariance family as (1). 5. Assume the hyperparameters are unknown, using the Mat´ ern covariance family. Directory: D:\GP\gpml_matlab\gpml-demo

  9. Definition of GP • Probability distribution characterizes a random variable (scalar or vector). Stochastic process governs the properties of functions. • GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. • Suppose that X ∈ R d having positive dimensional volume. We say that Y ( x ), for x ∈ X is a Gaussian Process if for any L ≥ 1 and any choice of x 1 , x 2 , . . . , x L , the vector ( Y ( x 1 ) , Y ( x 2 ) , . . . , Y ( x L )) has a multivariate normal distribution. • Other examples of GP: Brownian motion processes and Kalman filters.

  10. Review on Joint, Marginal and Conditional Probability Suppose we partition y into two groups: y A and y B , so that the joint probability is p ( y ) = p ( y A , y B ). Marginal probability of y A : � p ( y A ) = p ( y A , y B ) d y B . Conditional probability: p ( y A | y B ) = p ( y A , y B ) , for p ( y B ) > 0 . p ( y B ) Using the definitions of both p ( y A | y B ) and p ( y B | y A ), we get Bayes’theorem: p ( y A | y B ) = p ( y A ) p ( y B | y A ) . p ( y B ) In Bayesian terminology, this is likelihood × prior posterior = marginal likelihood .

  11. The Univariate Normal Distribution Consider Z ∼ Normal with mean 0 and variance 1 ( N (0 , 1)). We can transform Z to have mean µ and variance σ 2 : Y = σZ + µ, where Y is also normal with the density � � � − ( y − µ ) 2 1 � � µ, σ 2 ) = √ p ( y 2 πσ exp . 2 σ 2

  12. The Multivariate Normal Distribution (MVN) Consider Z i ∼ N (0 , 1) , i = 1 , 2 , . . . , d , and we group them into a random vector: Z = ( Z 1 , Z 2 , . . . , Z d ) ′ ∼ N d ( 0 , I ) . Similar to a N (0 , 1) case, we can transform Z to have mean vector µ and covariance matrix Σ : Y = Σ 1 / 2 Z + µ, where Σ 1 / 2 Σ 1 / 2 = Σ , a positive definite matrix. Y is normal with � � 1 − 1 2( y − µ ) ′ Σ − 1 ( y − µ ) √ p ( y | µ, Σ ) = 2 π ) d | Σ | 1 / 2 exp . ( Small | Σ | Large | Σ |

  13. The Multivariate Normal Distribution (cont’d) Suppose that � � �� � � �� W 1 µ 1 Σ 1 , 1 Σ 1 , 2 ∼ N m + n , . W 2 µ 2 Σ 2 , 1 Σ 2 , 2 Then, the marginal distribution of W 1 is N ( µ 1 , Σ 1 , 1 ) . Conditional distribution: � � µ 1 + Σ 1 , 2 Σ − 1 2 , 2 ( W 2 − µ 2 ) , Σ 1 , 1 − Σ 1 , 2 Σ − 1 [ W 1 | W 2 ] ∼ N m 2 , 2 Σ 2 , 1 . (2) Product of two Gaussians give another (unnormalized) Guassian: N d ( x | a , A ) N d ( x | b , B ) = Z − 1 N d ( x | c , C ) , where c = C ( A − 1 a + B − 1 b ) , C = ( A − 1 + B − 1 ) − 1 , and � � − 1 Z − 1 = (2 π ) − d/ 2 | A + B | − 1 / 2 exp 2( a − b ) ′ ( A + B ) − 1 ( a − b ) .

  14. GP Characterization Mostly, we consider strongly stationary GP: ( Y ( x 1 ) , Y ( x 2 ) , . . . , Y ( x L )) ∼ ( Y ( x 1 + h ) , Y ( x 2 + h ) , . . . , Y ( x L + h )) , i.e., it is invariant to translation. Therefore, the distribution of Y ( x ), mean and covariance matrix are the same for all x . The covariance function must depend only on x 1 − x 2 . Thus, we can characterize GPs by: 1. Mean function E( Y ( x )). Generally, we consider E[ Y ( x )] = 0. 2. Process variance, Cov(0 , 0), and correlation function: R ( x 1 − x 2 ) or covariance function C ( x 1 − x 2 ). Because correlation functions dictate the behavior of the GP (e.g., continuity and differentiability), certain families of correlation functions are generally used. Specification of C ( x 1 − x 2 ) implies a distribution over functions.

Recommend


More recommend