Some Bayesian extensions of neural network-based graphon - PowerPoint PPT Presentation

Some Bayesian extensions of neural network-based graphon approximations Creighton Heaukulani Joint work with Onno Kampman (Hong Kong) EcoSta 2018, Hong Kong June 2018

Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful?

Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why.

Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why. 3. Implement an infinite stochastic blockmodel, with good reason.

Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why. 3. Implement an infinite stochastic blockmodel, with good reason. 4. Review the pros and cons of being Bayesian here and other lessons learned along the way.

Relational data modeling Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

Relational data modeling “Minibatch learning” with these two data structures... ◮ What’s the appropriate minibatch? ◮ Which entries are missing? Lee et al. [2017] Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

Matrix factorization... linear models Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

Matrix factorization... linear models The ( n, m )-th entry of the matrix is modeled as D � X n,m ≈ U T n V m = U n,d V m,d d =1 Some U n ∈ R D and V m ∈ R D , with D small. A linear model. Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

Neural network matrix factorization (Dziugaite and Roy [2015]) f ( · ; θ ) is a neural network with parameters θ

Neural network matrix factorization (Dziugaite and Roy [2015]) f ( · ; θ ) is a neural network with parameters θ The ( n, m )-th entry of the matrix is modeled as n V m = � D X n,m ≈ U T d =1 U n,d V m,d f ( U n , V m ; θ ) Generalized to a nonlinear model.

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o Within the graphon modeling/approximation framework (Lloyd et al. [2012], Orbanz and Roy [2015]) .

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o Within the graphon modeling/approximation framework (Lloyd et al. [2012], Orbanz and Roy [2015]) . Note: Inputs of the nnet are now parameters. (A Bayesian habit?)

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ ))

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) Gradient-based inference targeting, for example, � ( X n,m − f ( U n , V m ; θ )) 2 Loss = ( n,m ) + λ 1 ( || U || 2 F + || V || 2 F ) regularize inputs (?) + λ 2 || θ || 2 L1/L2 regularization F

Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) Gradient-based inference targeting, for example, � ( X n,m − f ( U n , V m ; θ )) 2 Loss = ( n,m ) + λ 1 ( || U || 2 F + || V || 2 F ) regularize inputs (?) + λ 2 || θ || 2 L1/L2 regularization F Competitive performance; dominates linear baselines

When are deep learning architectures useful? ... for this matrix factorization problem anyway ...

When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Pros: ◮ Black-box for incorporating side information

When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Pros: ◮ Black-box for incorporating side information ◮ Gradient-based learning tools (e.g., Tensorflow/Torch/etc.)

When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Cons: ◮ Lack of interpretability ◮ (What does that really mean? Why is this a problem?)

When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Cons: ◮ Lack of interpretability ◮ (What does that really mean? Why is this a problem?) Motivates things like a “stochastic blockmodel”... ◮ In some (most?) cases, consumers don’t necessarily need to interpret the inferred nnet... ◮ Will often settle for some interpretable (inferred) components ◮ like convincing clusterings of the users.

A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned.

A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k .

A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k . ◮ Construct entries like: Matrix factorization Network modeling X n,m ≈ f ( U Z n , V m ; θ ) P { X i,j = 1 } ≈ σ ( f ( U Z i , U Z j ; θ ))

A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k . ◮ Construct entries like: Matrix factorization Network modeling X n,m ≈ f ( U Z n , V m ; θ ) P { X i,j = 1 } ≈ σ ( f ( U Z i , U Z j ; θ )) ◮ So, reduced N sets of parameters to just K ◮ ... like clustering the users (rows of the matrix)

A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data.

A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference.

A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference. ◮ Straightforward application: “Variational inference for Dirichlet process mixtures” Blei and Jordan [2006]

A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference. ◮ Straightforward application: “Variational inference for Dirichlet process mixtures” Blei and Jordan [2006] ◮ Informally, prediction looks like P { X ∗ i,j = 1 } ≈ E q ( Z ) [ σ ( f ( U Z i , U Z j ; θ ))] q ( Z ) ≈ p ( Z | X ) an approximation to the posterior.

Stick-breaking, mean-field variational inference Stick-breaking construction: Let V i ∼ beta(1 , c ), i = 1 , 2 , . . . and k − 1 � π k = V k (1 − V ℓ ) , k = 1 , 2 , . . . , ℓ =1 Z n | π ∼ Discrete( π ) , n ≤ N. Log likelihood is, for example, � log p ( X i,j | f ( U Z i , U Z j ; θ )) + log p ( Z | V ) + log p ( V ) ( i,j )

Stick-breaking, mean-field variational inference Let q denote a “variational approximation” to the posterior: q ( V k ) = beta( V k ; a k , b k ) , q ( Z n ) = Discrete( Z n ; η n ) . Maximize the following lower bound on the log marginal likelihood �� log p ( X ) ≥ E q ( Z,V ) log p ( X i,j | f ( U Z i , U Z j ; θ )) ( i,j ) − KL[ q ( Z, V ) || p ( Z, V )] KL the Kullback–Leibler divergence.

Stick-breaking, mean-field variational inference Algorithm: ◮ Initialize q . ◮ Iterate: ◮ Update � � q ( Z n = k ) ∝ exp E q [log V k ] + E q [log(1 − V ℓ )] ℓ ≥ k +1 �� + E q log p ( X i,j | Z, { Z n = k } ) , ( i,j ) ◮ Take a gradient step � �� Θ ← Θ + η ∇ Θ log p ( X i,j | f ( U Z i , U Z j ; θ )) E q ( i,j ) � − KL[ q ( Z, V ) || p ( Z, V )] some schedule η and all parameters Θ

Stick-breaking, mean-field variational inference ◮ Easily integrates with gradient-based learning (i.e., use Tensorflow/Torch/etc.)

Stick-breaking, mean-field variational inference ◮ Easily integrates with gradient-based learning (i.e., use Tensorflow/Torch/etc.) ◮ Computing gradients requires stochastic approximation ◮ Stochastic reparameterizations Salimans and Knowles [2013], Kingma and Welling [2014] ◮ Score function estimators with control variates Ranganath et al. [2014], Paisley et al. [2012]

Some Bayesian extensions of neural network-based graphon - PowerPoint PPT Presentation

Some Bayesian extensions of neural network-based graphon approximations Creighton Heaukulani Joint work with Onno Kampman (Hong Kong) EcoSta 2018, Hong Kong June 2018 Overview 1. Review neural network graphon approximation and its

Graphons and sampled networks A graphon W : [0 , 1] 2 [0 , 1] is a measurable function . W ( u

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Graphon Estimation: Minimax Rates and Posterior Contraction Chao Gao Yale University @Leiden,

Mean Field Games on Unbounded Networks and the Graphon MFG Equations Peter E. Caines McGill

Method of cumulants and mod-Gaussian convergence of the graphon models Pierre-Loc Mliot

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Building a Bayesian Network 223 / 385 The construction of a Bayesian network Construction of a

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

A Problem-Solving Methodology using the Extremality Principle and its Application to CS Education

Raft and Other Stories Consensus Trilogy: Part III Rough Timeline for Today Talk about

Kickoff IA Chaire BiSCottE ( Bridging Statistical and Computational Efficiency in AI) Gilles

ENHANCING CONNECTIVITY THROUGH TRANSPORT INFRASTRUCTURE The Role of Official Development Finance

Bare-Bones Measurement Data Archiving Dave Plonka University of Wisconsin Madison DoIT

Social Psychology Session 9 SOCIAL PERCEPTION Lecturer: Dr. Peace Mamle Tetteh, Department of

Market Design in Display Advertising R. Preston McAfee Yahoo! Research - 1 - Yahoo!

Meeting and protocol Joseph Zennamo Your goal Our goal in DC is two fold: Advocate for