the computations of acting agents and the agents acting
play

The computations of acting agents and the agents acting in - PowerPoint PPT Presentation

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5 June 2017 Research Group for Probabilistic Numerics Max Planck Institute for Intelligent Systems Tbingen, Germany Some of the presented work was


  1. The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5 June 2017 Research Group for Probabilistic Numerics Max Planck Institute for Intelligent Systems Tübingen, Germany Some of the presented work was supported by the Emmy Noether Programme of the DFG

  2. Part I: The computations of acting agents 09:00–09:45 � a minimal introduction to machine learning � the computational tasks of learning agents � some special challenges, some house numbers Part II: The agents acting in computations 10:30–11:15 � computation is inference � new challenges require new answers � a computer science view on numerical computations 1

  3. An Acting Agent autonomous interaction with a data-source from Hennig, Osborne, Girolami, Proc. Roy. Soc. A, 2015 parameters data variables inference by estimation by θ x t D quadrature optimization learning / inference / system id. prediction by analysis action by a t x t + δ t control prediction action environment machine 2

  4. The Very Foundation probabilistic inference p ( x ) p ( D | x ) p ( x | D ) = � p ( x ) p ( D | x ) dx prior explicit representation of assumptions about latent variables likelihood explicit representation of assumptions about generation of data posterior structured uncertainty over prediction evidence marginal likelihood of model � � 1 − 1 2( x − µ ) ⊺ Σ − 1 ( x − µ ) N ( x ; µ , Σ ) = � exp 2 π | Σ | 3

  5. Gaussian Inference the link between probabilistic inference and linear algebra C := ( A − 1 + B − 1 ) − 1 c := C ( A − 1 a + B − 1 b ) � products of Gaussians are Gaussians N ( x ; a , A ) N ( x ; b , B ) = N ( x ; c , C ) N ( a ; b , A + B ) � marginals of Gaussians are Gaussians � �� x � � µ x � � Σ xx �� Σ xy N dy = N ( x ; µ x , Σ xx ) ; , Σ yx Σ yy y µ y � (linear) conditionals of Gaussians are Gaussians � � p ( x | y ) = p ( x , y ) = N x ; µ x + Σ xy Σ − 1 yy ( y − µ y ), Σ xx − Σ xy Σ − 1 yy Σ yx p ( y ) � linear projections of Gaussians are Gaussians p ( z ) = N ( z ; µ , Σ ) ⇒ p ( Az ) = N ( Az , A µ , A Σ A ⊺ ) Bayesian inference becomes linear algebra p ( x ) = N ( x ; µ , Σ ) p ( y | x ) = N ( y ; A ⊺ x + b , Λ ) p ( B ⊺ x + c | y ) = N [ B ⊺ x + c ; B ⊺ µ + c + B ⊺ Σ A ( A ⊺ Σ A + Λ ) − 1 ( y − A ⊺ µ − b ), B ⊺ Σ B − B ⊺ Σ A ( A ⊺ Σ A + Λ ) − 1 A ⊺ Σ B ] 4

  6. A Minimal Machine Learning Setup nonlinear regression problem 20 10 y 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( y | f X ) = N ( y ; f X , σ I ) 5

  7. Gaussian Parametric Regression aka. general linear least-squares 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � f ( x ) = φ ( x ) ⊺ w = w i φ i ( x ) p ( w ) = N ( w ; µ , Σ ) i ⇒ p ( f ) = N ( f , φ ⊺ µ , φ ⊺ Σ φ ) φ i ( x ) = I ( x > a i ) · c i ( x − a i ) (RELU) 6

  8. Gaussian Parametric Regression aka. general linear least-squares 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � f ( x ) = φ ( x ) ⊺ w = w i φ i ( x ) p ( w ) = N ( w ; µ , Σ ) i ⇒ p ( f ) = N ( f , φ ⊺ µ , φ ⊺ Σ φ ) φ i ( x ) = I ( x > a i ) · c i ( x − a i ) (RELU) 6

  9. Gaussian Parametric Regression aka. general linear least-squares 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( y | w , φ X ) = N ( y ; φ ⊺ X w , σ 2 I ) X Σ φ X + σ 2 I ) − 1 ( y − φ ⊺ p ( f x | y , φ X ) = N ( f x ; φ ⊺ x µ + φ ⊺ x Σ φ X ( φ ⊺ X µ ), X Σ φ X + σ 2 I ) − 1 φ ⊺ φ ⊺ x Σ φ x − φ ⊺ x Σ φ X ( φ ⊺ X Σ φ x ) 6

  10. The Choice of Prior Matters Bayesian framework provides flexible yet explicit modelling language 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � � − ( x − c i ) 2 φ i ( x ) = θ exp 2 λ 2 7

  11. The Choice of Prior Matters Bayesian framework provides flexible yet explicit modelling language 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � � − ( x − c i ) 2 φ i ( x ) = θ exp 2 λ 2 7

  12. popular extension no. 1 requires large-scale linear algebra X Σ φ X + σ 2 I ) − 1 ( y − φ ⊺ p ( f x | y , φ X ) = N ( f x ; φ ⊺ x µ + φ ⊺ x Σ φ X ( φ ⊺ X µ ), X Σ φ X + σ 2 I ) − 1 φ ⊺ φ ⊺ x Σ φ x − φ ⊺ x Σ φ X ( φ ⊺ X Σ φ x ) � set µ = 0 � aim for closed-form expression of kernel φ ⊺ a Σ φ b 8

  13. Features are cheap, so let’s use a lot an example [DJC MacKay, 1998] � For simplicity, let’s fix Σ = σ 2 ( c max − c min ) I F φ ( x i ) ⊺ Σ φ ( x j ) = σ 2 ( c max − c min ) F � φ ℓ ( x i ) φ ℓ ( x j ) thus: F ℓ =1 � � − ( x − c ℓ ) 2 � especially, for φ ℓ ( x ) = exp 2 λ 2 φ ( x i ) ⊺ Σ φ ( x j ) � � � � = σ 2 ( c max − c min ) F − ( x i − c ℓ ) 2 − ( x j − c ℓ ) 2 � exp exp 2 λ 2 2 λ 2 F ℓ =1 � � � � F − ( c ℓ − 1 = σ 2 ( c max − c min ) − ( x i − x j ) 2 2 ( x i + x j )) 2 � exp exp 4 λ 2 λ 2 F ℓ 9

  14. Features are cheap, so let’s use a lot an example [DJC MacKay, 1998] φ ( x i ) ⊺ Σ φ ( x j ) = � � � � − ( x i − x j ) 2 F − ( c ℓ − 1 σ 2 ( c max − c min ) 2 ( x i + x j )) 2 � exp exp 4 λ 2 λ 2 F ℓ F · δ c � now increase F so # of features in δ c approaches ( c max − c min ) φ ( x i ) ⊺ Σ φ ( x j ) � � � � � � c max − ( x i − x j ) 2 − ( c − 1 2 ( x i + x j )) 2 σ 2 exp exp dc 4 λ 2 λ 2 c min � let c min � −∞ , c max � ∞ � � √ − ( x i − x j ) 2 2 πλσ 2 exp k ( x i , x j ) := φ ( x i ) ⊺ Σ φ ( x j ) � 4 λ 2 10

  15. Gaussian Process Regression aka. Kriging, kernel-ridge regression,... 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � � − ( a − b ) 2 p ( f ) = GP (0, k ) k ( a , b ) = exp 2 λ 2 11

  16. Gaussian Process Regression aka. Kriging, kernel-ridge regression,... 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( f | y ) = GP ( f x ; k xX ( k XX + σ 2 I ) − 1 y , k xx − k xX ( k XX + σ 2 I ) − 1 k Xx ) 11

  17. The prior still matters just one other example out of the space of kernels 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x For φ i ( x ) = I ( x > c i )( x − c i ) , an analogous limit gives 12

  18. The prior still matters just one other example out of the space of kernels 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( f ) = GP (0, k ) with k ( a , b ) = θ 21 / 3 min( a , b ) 3 + | a − b | min( a , b ) 2 . the integrated Wiener process , aka. cubic splines . More on GPs in Paris Perdikaris ’ tutorial. more on nonparametric models in Neil Lawrence ’s and Tamara Broderick ’s talks? 12

  19. The Computational Challenge large-scale linear algebra ( k XX + σ 2 I ) − 1 k aX ( k XX + σ 2 I ) − 1 k Xb log | k XX + σ 2 I | α := y � �� � ∈ R N × N , symm. pos. def. 13

  20. The Computational Challenge large-scale linear algebra ( k XX + σ 2 I ) − 1 k aX ( k XX + σ 2 I ) − 1 k Xb log | k XX + σ 2 I | α := y � �� � ∈ R N × N , symm. pos. def. Methods in wide use: � exact linear algebra (BLAS), for N � 10 4 (because O ( N 3 ) ) � (rarely:) iterative Krylov solvers (in part. conjugate gradients), for N � 10 5 For large-scale ( O ( NM 2 ) ): � inducing point methods, Nyström, etc.: using iid. structure of data Ω − 1 ∈ R M × M k au Ω − 1 ˜ k ab ≈ ˜ k ub Williams & Seeger, 2001; Quiñonero & Rasmussen, 2005; Snelson & Ghahramani, 2007; Titsias, 2009 � spectral expansions using algebraic properties of kernel Rahimi & Recht 2008; 2009 � in univariate setting: filtering using Markov structure Särkkä 2013 Both are linear time , with finite error . Bridge to iterative methods is beginning to form, via sub-space recycling ( de Roos & P .H., arXiv 1706.00241 2017) 13

  21. popular extensions no. 2: requires large-scale nonlinear optimization Maximum Likelihood estimation: Assume φ ( x ) = φ θ ( x ) N � 1 L ( y ; θ , w ) = log p ( y | φ , w ) = � y i − φ θ ( x i ) ⊺ w � 2 + const. 2 σ 2 i =1 y i w φ 1 ( x i ) φ 2 ( x i ) φ ... ( x i ) φ ... ( x i ) φ M ( x i ) θ x i (A feed-forward network) 14

  22. Learning Features a (in general) non-convex , non-linear optimization problem N � 1 L ( y ; θ , w ) = log p ( y | φ , w ) = � y i − φ θ ( x i ) ⊺ w � 2 + const. 2 σ 2 i =1 N � ∇ θ L = 1 − ( y i − φ θ ( x i ) ⊺ w ) · w ⊺ ∇ θ φ ( x i ) σ 2 i =1 � �� � “back-propagation” 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x 15

  23. Deep Learning (really just a quick peek) in practice: � multiple input dimensions (e.g. pixel intensities) � multi-dimensional output (e.g. structured sentences) � multiple feature layers � structured layers (convolutions, pooling, pyramids, etc.) y 1 y 2 ... ... y M o i i i ξ 1 ξ 2 ξ M 2 ... ... i φ 1 φ 2 ... ... φ M 1 i i i x M 0 x 1 ... ... x 2 i i 16

  24. Deep Learning has become Mainstream an increasingly professional industry Krizhevsky, Sutskever & Hinton “ImageNet Classification with Deep Convolutional Neural Networks” Adv. in Neural Information Processing Systems (NIPS 2012) 25 , pp. 1097–1105 17

Recommend


More recommend