Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1 , 2 , 3 , Burnaev Evgeny 1 , 2 , 3 , Kapushev Yermek 1 , 2 1 Institute for Information Transmission Problems, Moscow, Russia 2 DATADVANCE, llc, Moscow, Russia 3 PreMoLab, MIPT, Dolgoprudny, Russia ICML 2014 workshop on New Learning Frameworks and Models for Big Data Beijing, 2014 1/39 Burnaev Evgeny
Approximation task Problem statement Let y = g ( x ) be some unknown function. The training sample is given D = { x i , y i } , g ( x i ) = y i , i = 1 , . . . , N. The task is to construct ˆ f ( x ) such that: ˆ f ( x ) ≈ g ( x ) . 2/39 Burnaev Evgeny
Factorial Design of Experiments Factors: s k = { x k i k ∈ X k , i k = 1 , . . . ,n k } , X k ∈ R d k , k = 1 , . . . , K ; 1.0 d k — dimensionality of the factor s k . 0.8 Factorial Design of Experiments: 0.6 i =1 = s 1 × s 2 × · · · × s K . S = { x i } N x 2 0.4 Dimensionality of x ∈ S : d = � K k =1 d k . 0.2 Sample size: N = � K k =1 n k 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 3/39 Burnaev Evgeny
Example of Factorial DoE 1 0.9 0.8 0.7 0.6 0.5 x 3 0.4 0.3 0.2 1 0.1 0.5 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0 0.2 0.1 0 x 1 x 2 Figure : Factorial DoE with multidimensional factor 4/39 Burnaev Evgeny
DoE in engineering problems 1 Independent groups of variables — factors. 2 Training sample generation procedure: Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments. 3 Take into account knowledge from a subject domain. Data properties Has special structure Can have large sample size Factors’ sizes can differ significantly 5/39 Burnaev Evgeny
Example of Factorial DoE Modeling of pressure distribution over the aircraft wing: Angle of attack: s 1 = { 0 , 0 . 8 , 1 . 6 , 2 . 4 , 3 . 2 , 4 . 0 } . Mach number: s 2 = { 0 . 77 , 0 . 78 , 0 . 79 , 0 . 8 , 0 . 81 , 0 . 82 , 0 . 83 } . Wing points coordinates: s 3 — 5000 point in R 3 . The training sample: S = s 1 × s 2 × s 3 . Dimensionality d = 1 + 1 + 3 = 5 . Sample size N = 6 ∗ 7 ∗ 5000 = 210000 . 6/39 Burnaev Evgeny
Existing solutions Universal techniques: Disadvantages: don’t take into account sample structure ⇒ low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points 7/39 Burnaev Evgeny
The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments 8/39 Burnaev Evgeny
Gaussian Process Regression Function model g ( x ) = f ( x ) + ε ( x ) , where f ( x ) — Gaussian process (GP), ε ( x ) — Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f ( x ) � � � d i ( x ( i ) − x ′ ( i ) ) 2 K f ( x, x ′ ) = σ 2 θ 2 f exp − , i =1 where x ( i ) — i -th component of vector, θ = ( σ 2 f , θ 1 , . . . , θ d ) — parameters of the covariance function. The covariance function of g ( x ) : K g ( x, x ′ ) = K ( x, x ′ ) + σ 2 noise δ ( x, x ′ ) , δ ( x, x ′ ) — Kronecker delta. 9/39 Burnaev Evgeny
Choosing parameters Maximum Likelihood Estimation Loglikelihood log p ( y | X, θ ) = − 1 g y − 1 2 log | K g | − N 2 y T K − 1 2 log 2 π, where | K g | — determinant of matrix K g = � K g ( x i , x j ) � N i,j =1 , x i , x j ∈ S . Parameters θ ∗ are chosen such that θ ∗ = arg max (log p ( y | X, θ )) θ 10/39 Burnaev Evgeny
Final model Prediction of g ( x ) at point x f ( x ) = k T K − 1 ˆ g y , where k = ( K f ( x 1 , x ) , . . . , K f ( x n , x )) . Posterior variance σ 2 ( x ) = K f ( x, x ) − k T K − 1 g k . 11/39 Burnaev Evgeny
Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny
Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny
Estimation of covariance function parameters Loglikelihood: log p ( y | X, θ ) = − 1 g y − 1 2 log | K g | − N 2 y T K − 1 2 log 2 π, Derivatives: ∂θ (log p ( y | X , σ f , σ noise )) = − 1 ∂ g K ′ ) + 1 2 y T K − 1 2Tr( K − 1 g K ′ K − 1 g y , where θ is a parameter of covariance function (component of θ i , σ noise or σ f,i , i = 1 , . . . , d ), and K ′ = ∂ K ∂θ 13/39 Burnaev Evgeny
Tensors and Kronecker product Definition Tensor Y is a K -dimensional matrix of size n 1 ∗ n 2 ∗ · · · ∗ n K : � � y i 1 ,i 2 ,...,i K , { i k = 1 , . . . , n k } K Y = . k =1 Definition The Kronecker product of matrices A and B is a block matrix a 11 B · · · a 1 n B . . ... . . A ⊗ B = . . . a m 1 B · · · a mn B 14/39 Burnaev Evgeny
Related operations Operation vec : vec( Y ) = [ Y 1 , 1 ,..., 1 , Y 2 , 1 ,..., 1 , . . . , Y n 1 , 1 ,..., 1 , Y 1 , 2 ,..., 1 , . . . , Y n 1 ,n 2 ,...,n K ] . Multiplication of a tensor by a matrix along k -th direction � Z = Y ⊗ k B ⇔ Z i 1 ,...,i k − 1 ,j,i k +1 ,...i K = Y i 1 ,...,i k ,...,i K B i k j . i k Connection between tensors and the Kronecker product: vec( Y ⊗ 1 B 1 · · · ⊗ K B K ) = (B 1 ⊗ · · · ⊗ B K )vec( Y ) (1) Complexity of computation of the left part — O ( N � k n k ) , of the right part — O ( N 2 ) . 15/39 Burnaev Evgeny
Covariance function Form of the covariance function: � K x i , y i ∈ s i , k i ( x i , y i ) , K f ( x, y ) = i =1 where k i is an arbitrary covariance function for i -th factor. Covariance matrix: � K K = K i , i =1 K i is a covariance matrix for i -th factor. 16/39 Burnaev Evgeny
Fast computation of loglikelihood Proposition Let K i = U i D i U T i be a Singular Value Decomposition (SVD) of the matrix K i , where U i is an orthogonal matrix, and D i is diagonal. Then: � | K g | = D i 1 ,...,i K , i 1 ,...,i K �� � � � � − 1 �� � (2) K − 1 U T D k + σ 2 = noise I U k , g k k k k � g y = vec [(( Y ⊗ 1 U 1 · · · ⊗ K U K ) ∗ D − 1 � K − 1 ⊗ 1 U T 1 · · · ⊗ K U T , K noise I + � where D is a tensor of diagonal elements of the matrix σ 2 k D k 17/39 Burnaev Evgeny
Computational complexity Proposition Calculation of the loglikelihood using (2) has the following computation complexity � � � K � K n 3 O N n i + . i i =1 i =1 Assuming n i ≪ N (number of factors is large and their sizes are close) we get � � � � � N 1+ 1 O N n i = O . K 18/39 Burnaev Evgeny
Fast computation of derivatives Proposition The following statements hold � � � D − 1 � � K � � Tr( K − 1 ˆ g K ′ ) = U i K ′ diag , diag , i U i i =1 1 2 y T K − 1 g K ′ K − 1 g y = �A , A ⊗ 1 K T 1 ⊗ 2 · · · ⊗ i − 1 K T i − 1 ⊗ i � ⊗ i ∂ K T ⊗ i +1 K T i +1 ⊗ i +2 · · · ⊗ K K T i , K ∂θ noise I + � where ˆ D = σ 2 k D k , and vec( A ) = K − 1 g y . The computational complexity is � � � K � K n 3 O N n i + . i i =1 i =1 19/39 Burnaev Evgeny
Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny
Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny
Degeneracy Example of degeneracy 2.5 GP regression 2.5 True function training set training set 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 1 1 0.5 0.5 1 0 1 0 0.5 0.5 −0.5 0 0 −0.5 −0.5 −0.5 −1 x 2 −1 −1 x 2 −1 x 1 x 1 Figure : Approximation obtained using GP Figure : Original function from GPML toolbox 21/39 Burnaev Evgeny
Regularization Prior distribution: θ ( i ) − a ( i ) ∼ B e ( α, β ) , { i = 1 , . . . , d k } K k k k =1 , b ( i ) k − a ( i ) k c k C k a ( i ) b ( i ) = x,y ∈ s k ( x ( i ) − y ( i ) ) , = x,y ∈ s k ,x � = y ( x ( i ) − y ( i ) ) k k max min where B e ( α, β ) is the Beta distribution, c k and C k are parameters of the algorithm (we use c k = 0 . 01 and C k = 2 ). Initialization � 1 � �� − 1 θ ( i ) x ∈ s k ( x ( i ) ) − min x ∈ s k ( x ( i ) ) = max . k n k 22/39 Burnaev Evgeny
Recommend
More recommend