Bayesian neural networks: a function space view tour Yingzhen Li - PowerPoint PPT Presentation

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge

Neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) A typical neural network: f W ( x ) = W L φ ( W L − 1 φ ( ...φ ( W 1 x + b 1 )) + b L − 1 ) + b L for the l th layer: h l = φ ( W l h l − 1 + b l ) , h 1 = φ ( W 1 x + b 1 ) Parameters: W = { W 1 , b 1 , ..., W L , b L } ; nonlinearity: φ ( · ) 1

Neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) Typical deep learning solution: Training the neural network weights: • Maximum likelihood estimation (MLE) given a dataset D = { ( x n , y n ) } N n =1 : N � W ∗ = arg min log p ( y n | x n , W ) n =1 1

Bayesian neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) A Bayesian solution: Put a prior distribution p ( W ) over W • compute posterior p ( W |D ) given a dataset D = { ( x n , y n ) } N n =1 : N � p ( W |D ) ∝ p ( W ) p ( y n | x n , W ) n =1 • Bayesian predictive inference: p ( y ∗ | x ∗ , D ) = E p ( W |D ) [ p ( y ∗ | x ∗ , W )] 2

Bayesian neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) In practice: p ( W |D ) is intractable • First find approximation q ( W ) ≈ p ( W |D ) • In prediction, do Monte Carlo sampling: K p ( y ∗ | x ∗ , D ) ≈ 1 � W k ∼ q ( W ) p ( y ∗ | x ∗ , W k ) , K k =1 2

Applications of Bayesian neural networks Detecting adversarial examples: Li and Gal 2017 3

Applications of Bayesian neural networks Image segmentation Kendall and Gal 2017 3

Applications of Bayesian neural networks Medical imaging (super resolution): Tanno et al. 2019 3

Bayesian neural networks vs Gaussian processes Why learning about BNNs in a summer school about GPs? • mean-field BNNs have GP limits • approximate inference on GPs has links to BNNs • approximate inference on BNNs can leverage GP techniques Bayesian Deep Learning 4

BNN → GP

Bayesian neural networks → Gaussian process Quick refresher: Central limit theorem Theorem Let x 1 , ..., x N be i.i.d. samples from p ( x ) and p ( x ) has mean µ and covariance Σ , then N � � 1 µ, 1 � d x n → N N Σ , N → + ∞ N n =1 5

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 W 1 = [ w 1 , ..., w m ] T , W = { W 1 , b , W 2 } , b = [ b 1 , ..., b m ] , W 2 = [ v 1 , ..., v m ] , mean-field prior � � � p ( W ) = p ( W 1 ) p ( b ) p ( W 2 ) , p ( W 1 ) = p ( w m ) , p ( b ) = p ( b m ) , p ( W 2 ) = p ( v m ) , m m m the same prior for each connection weight/bias: p ( w i ) = p ( w j ) , p ( b i ) = p ( b j ) , p ( v i ) = p ( v j ) , ∀ i , j 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 the same prior for each connection weight/bias: ∀ i , j p ( w i ) = p ( w j ) , p ( b i ) = p ( b j ) , ⇒ the same distribution of the hidden unit outputs: d h i ( x ) = φ ( w T h i ( x ) ⊥ h j ( x ) , h i ( x ) = h j ( x ) , i x + b i ) ⇒ i.e. h 1 ( x ) , ..., h M ( x ) are i.i.d. samples from some implicitly defined distribution 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 mean-field prior with the same distribution for second layer connection weights: v i ⊥ W 1 , b , p ( v i ) = p ( v j ) , ∀ i , j d ⇒ v i h i ( x ) ⊥ v j h j ( x ) , v i h i ( x ) = v j h j ( x ) so f ( x ) is a sum of i.i.d. random variables 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): M � E [ f ( x )] = E [ v m ] E [ h m ( x )] = 0 m =1 M M � � σ 2 v E [ h m ( x ) 2 ] → σ 2 v E [ h ( x ) 2 ] V [ f ( x )] = V [ v m h m ( x )] = m =1 m =1 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): M � Cov[ f ( x ) , f ( x ′ )] = σ 2 v E [ h m ( x ) h m ( x ′ )] → σ 2 v E [ h ( x ) h ( x ′ )] m =1 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): d ( f ( x ) , f ( x ′ )) K ( x , x ′ ) = σ 2 v E [ h ( x ) h ( x ′ )] → N ( 0 , K ) , (CLT) it holds for any x , x ′ ⇒ f ∼ GP (0 , K ( x , x ′ )) 1 Radford Neal’s derivation in his PhD thesis (1994) 6

Bayesian neural networks → Gaussian process Recent extensions of Radford Neal’s result: • deep and wide BNNs have GP limits • mean-field prior over weights • the activation function satisfies | φ ( x ) | ≤ c + A | x | • hidden layer widths strictly increasing to infinity Matthews et al. 2018, Lee et al. 2018 7

Bayesian neural networks → Gaussian process Recent extensions of Radford Neal’s result: • Bayesian CNNs have GP limits • Convolution in CNN = fully connected layer applied to different locations in the image • # channels in CNN = # hidden units in fully connected NN Garriga-Alonso et al. 2019, Novak et al. 2019 7

GP → BNN

Gaussian process → Bayesian neural networks Exact GP inference can be very expensive: predictive inference for GP regression: p ( f ∗ | X ∗ , X , y ) = N ( f ∗ ; K ∗ n ( K nn + σ 2 I ) − 1 y , K ∗∗ − K ∗ n ( K nn + σ 2 I ) − 1 K n ∗ ) K nn ∈ R N × N ( K nn ) ij = K ( x i , x j ) , Inverting K nn + σ 2 I has O ( N 3 ) cost! 8

Gaussian process → Bayesian neural networks Quick refresher: Fourier (inverse) transform � s ( t ) e − itw dt S ( w ) = � S ( w ) e itw dw s ( t ) = 9

Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 e i w T ( x − x ′ ) � K ( x , x ′ ) = E p ( w ) for some distribution p ( w ) . � σ 2 e i w T ( x − x ′ ) � σ 2 cos ( w T ( x − x ′ )) � � • Real value kernel ⇒ E p ( w ) = E p ( w ) • cos ( x − x ′ ) = 2 E p ( b ) [ cos ( x + b ) cos ( x ′ + b )] , p ( b ) = Uniform[0 , 2 π ] Rahimi and Recht 2007 10

Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . � σ 2 e i w T ( x − x ′ ) � σ 2 cos ( w T ( x − x ′ )) � � • Real value kernel ⇒ E p ( w ) = E p ( w ) • cos ( x − x ′ ) = 2 E p ( b ) [ cos ( x + b ) cos ( x ′ + b )] , p ( b ) = Uniform[0 , 2 π ] Rahimi and Recht 2007 10

Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . • Monte Carlo approximation: M K ( x , x ′ ) = σ 2 � m x ′ + b m ) , K ( x , x ′ ) ≈ ˜ cos ( w T m x + b m ) cos ( w T w m , b m ∼ p ( w ) p ( b m ) M m =1 Rahimi and Recht 2007 10

Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . • Monte Carlo approximation: Define h m ( x ) = cos ( w T h ( x ) = [ h 1 ( x ) , ..., h M ( x )] , m x + b m ) , w m ∼ p ( w ) , b m ∼ p ( b ) K ( x , x ′ ) = σ 2 ⇒ ˜ M h ( x ) T h ( x ′ ) Rahimi and Recht 2007 10

Bayesian neural networks: a function space view tour Yingzhen Li - PowerPoint PPT Presentation

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge Neural networks 101 Lets say we want to classify different types of cats x : input images; y : output label "cat" build a neural

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Hyperbolic Neural Networks Hyperbolic Neural Networks Use hyperbolic space instead of Euclidean

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Interac(vely Building Geospa(al Mashups Craig A. Knoblock University of Southern California

Image-to-Markup Generation with Coarse-to-Fine Attention Anssi Kanervisto 2 Jeffrey Ling 1 Yuntian

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey

Overview from Nuclear Lattice Effective Field Theory Serdar Elhatisari Nuclear Lattice EFT

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

The devil is in the details How cybercriminals, leakers, State-sponsored hackers failed their

Section 6: Field and Galois theory Matthew Macauley Department of Mathematical Sciences Clemson

A M U LT I - WAV E L E N G T H V I E W O F LY M A N - A L P H A N E B U L A E : S T U DY I N G

Bayesian neural networks: a function space view tour Yingzhen Li - PowerPoint PPT Presentation

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge Neural networks 101 Lets say we want to classify different types of cats x : input images; y : output label "cat" build a neural

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel &amp; Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Hyperbolic Neural Networks Hyperbolic Neural Networks Use hyperbolic space instead of Euclidean

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Interac(vely Building Geospa(al Mashups Craig A. Knoblock University of Southern California

Image-to-Markup Generation with Coarse-to-Fine Attention Anssi Kanervisto 2 Jeffrey Ling 1 Yuntian

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey

Overview from Nuclear Lattice Effective Field Theory Serdar Elhatisari Nuclear Lattice EFT

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

The devil is in the details How cybercriminals, leakers, State-sponsored hackers failed their

Section 6: Field and Galois theory Matthew Macauley Department of Mathematical Sciences Clemson

A M U LT I - WAV E L E N G T H V I E W O F LY M A N - A L P H A N E B U L A E : S T U DY I N G

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour