Multi-task Gaussian Process Prediction Chris Williams Joint Work - PowerPoint PPT Presentation

Multi-task Gaussian Process Prediction Chris Williams Joint Work with Edwin Bonilla, Kian Ming A. Chai, Stefan Klanke and Sethu Vijayakumar Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK September 2008 Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 1 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Assuming task relatedness can be detrimental (Caruana, 1997; Baxter, 2000) Task descriptors unavailable or difficult to define ◮ e.g. Compiler performance prediction: code features, responses Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Motivation: Multi-task Learning Sharing information across tasks e.g. Exam score prediction, compiler performance prediction, robot inverse dynamics Assuming task relatedness can be detrimental (Caruana, 1997; Baxter, 2000) Task descriptors unavailable or difficult to define ◮ e.g. Compiler performance prediction: code features, responses Learning inter-task dependencies based on task identities Correlations between tasks directly induced GP framework Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 2 / 24

Outline The Model Making Predictions and Learning Hyperparameters Cancellation of Transfer Related Work Experiments and Results MTL in Robot Inverse Dynamics Conclusions and Discussion Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 3 / 24

Multi-task Setting Given a set X of N distinct inputs x 1 , . . . , x N : Complete set of responses: y = ( y 11 , . . . , y N 1 , . . . , y 12 , . . . , y N 2 , . . . , y 1 M , . . . , y NM ) T y i ℓ : response for the ℓ th task on the i th input x i Y : N × M matrix such y = vec Y Goal: Given observations y o ⊂ y : ◮ make predictions of unobserved values y u Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 4 / 24

Multi-task GP We place a (zero mean) GP prior over the latent functions { f ℓ } : The Model � f ℓ ( x ) f m ( x ′ ) � = K f ℓ m k x ( x , x ′ ) y i ℓ ∼ N ( f ℓ ( x i ) , σ 2 ℓ ) , K f : PSD matrix that specifies the inter-task similarities k x : Covariance function over inputs ℓ : Noise variance for the ℓ th task. σ 2 Additionally, k x : stationary, correlation function e.g. squared exponential Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 5 / 24

Multi-task GP (2) θ f 1 f 2 f 3 y 1 y 2 y 3 Other approaches Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others ℓ m = k f ( t ℓ , t m ) Bonilla et. al (2007), Yu et. al (2007): K f Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

Multi-task GP (2) θ f 1 f 2 f 3 f 1 f 2 f 3 y 1 y 2 y 3 y 1 y 2 y 3 Other approaches Our approach Observations on one task can affect predictions on the others ℓ m = k f ( t ℓ , t m ) Bonilla et. al (2007), Yu et. al (2007): K f Multi-task clustering easily modelled Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 6 / 24

x f Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 7 / 24

Making Predictions The mean prediction on a new data-point x ∗ for task ℓ is given by: ¯ ∗ ) T Σ − 1 y , with ( k f ℓ ⊗ k x f ℓ ( x ∗ ) = K f ⊗ K x + D ⊗ I Σ = where: ℓ selects the ℓ th column of K f k f k x ∗ : vector of covariances between x ∗ and the training points K x : matrix of covariances between all pairs of training points D : diagonal matrix in which the ( ℓ, ℓ ) th element is σ 2 ℓ Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 8 / 24

Learning Hyperparameters Given y o : Learn θ x of k x , K f , σ 2 ℓ to maximize p ( y o | X ). We note that: y | X ∼ N ( 0 , Σ) (a) Gradient-based method: ◮ K f = LL T (Recall K f must be PSD) ◮ Kronecker structure (b) EM : ◮ learning of θ x and K f in the M -step is decoupled ◮ closed-form updates for K f and D ◮ K f guaranteed PSD � � F T � � − 1 K f = N − 1 � K x ( � θ x ) F Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 9 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 We can show that if there is a grid design and no observation noise then: ∗ ) T ( K x ) − 1 y · ℓ f ( x ∗ , ℓ ) = ( k x Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Noiseless observations + grid = Cancellation of Transfer x x x x x 2 f 1 * 3 4 1 f 2 f 3 We can show that if there is a grid design and no observation noise then: ∗ ) T ( K x ) − 1 y · ℓ f ( x ∗ , ℓ ) = ( k x The predictions for task ℓ depend only on the targets y · ℓ Similar result for the covariances This is know as autokrigeability in geostatistics Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 10 / 24

Related Work Early work on MTL: Thrun (1996), Caruana (1997) Minka (1997) and some other later GP work assumes that multiple tasks share the same hyperparameters but are otherwise uncorrelated Co-kriging in geostatistics Evgeniou et al (2005) induce correlations between tasks based on a correlated prior over linear regression parameters Conti & O’Hagan (2007): emulating multi-output simulators ℓ m = k f ( t ℓ , t m ), e.g. Yu et al (2007), Use of task descriptors so that K f Bonilla et al (2007). Semiparametric latent factor model (SLFM) of Teh et al (2005) has P latent processes each with its own covariance function. Noiseless outputs are obtained by linear mixing of these latent functions. Our model is similar, but simpler, in that all of the P latent processes share the same covariance function; this reduces the number of free parameters to be fitted and should help to minimize overfitting Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 11 / 24

Experiments Compiler performance prediction y : Speed-up of a program (task) when applying a transformation sequence x 11 C programs, 13 transformations, 5-length sequences “bag-of-characters” representation for x Chris Williams (Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK) Multi-task Gaussian Process Prediction September 2008 12 / 24

Multi-task Gaussian Process Prediction Chris Williams Joint Work - PowerPoint PPT Presentation

Multi-task Gaussian Process Prediction Chris Williams Joint Work with Edwin Bonilla, Kian Ming A. Chai, Stefan Klanke and Sethu Vijayakumar Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh, UK

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia

Overview Prediction with Gaussian Processes: Basic Ideas Bayesian Prediction Chris Williams

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muoz Antonio

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

PRINTOLOGY Our Hedgehog Concept Product Overview PRESENTATION Subsystem Analyses

15-869 References 1/28 I was very excited to see the first paper listed below appear in SIGGRAPH

Scientific Workflows at DataWarp-Speed: Accelerated Data-Intensive Science using NERSCs Burst

Capital Values, Job Values, and the Joint Behavior of Hiring and Investment Eran Yashiv

Coordinated motions of repetitive structures from a mechanical point of view Hiro Tanaka, Dept.

Yasser F. O. Mohammad REMINDER 1: The whole Jacobian (METHOD 1) Revolute Joint Prismatic

Industrial Robots Industrial Robots Kinematic chains Kinematic chains Kinematic chains Kinematic

pain during lockdown and beyond 22 July 2020 www.local.gov.uk Welcome www.local.gov.uk