some bayesian extensions of neural network based graphon
play

Some Bayesian extensions of neural network-based graphon - PowerPoint PPT Presentation

Some Bayesian extensions of neural network-based graphon approximations Creighton Heaukulani Joint work with Onno Kampman (Hong Kong) EcoSta 2018, Hong Kong June 2018 Overview 1. Review neural network graphon approximation and its


  1. Some Bayesian extensions of neural network-based graphon approximations Creighton Heaukulani Joint work with Onno Kampman (Hong Kong) EcoSta 2018, Hong Kong June 2018

  2. Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful?

  3. Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why.

  4. Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why. 3. Implement an infinite stochastic blockmodel, with good reason.

  5. Overview 1. Review neural network graphon approximation and its gradient-based inference. When are nnets useful? 2. Consider variational inference in such a model and why. 3. Implement an infinite stochastic blockmodel, with good reason. 4. Review the pros and cons of being Bayesian here and other lessons learned along the way.

  6. Relational data modeling Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

  7. Relational data modeling “Minibatch learning” with these two data structures... ◮ What’s the appropriate minibatch? ◮ Which entries are missing? Lee et al. [2017] Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

  8. Matrix factorization... linear models Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

  9. Matrix factorization... linear models The ( n, m )-th entry of the matrix is modeled as D � X n,m ≈ U T n V m = U n,d V m,d d =1 Some U n ∈ R D and V m ∈ R D , with D small. A linear model. Figure from: https://buildingrecommenders.wordpress.com/2015/11/18/overview-of-recommender-algorithms-part-2/

  10. Neural network matrix factorization (Dziugaite and Roy [2015]) f ( · ; θ ) is a neural network with parameters θ

  11. Neural network matrix factorization (Dziugaite and Roy [2015]) f ( · ; θ ) is a neural network with parameters θ The ( n, m )-th entry of the matrix is modeled as n V m = � D X n,m ≈ U T d =1 U n,d V m,d f ( U n , V m ; θ ) Generalized to a nonlinear model.

  12. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o

  13. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o Within the graphon modeling/approximation framework (Lloyd et al. [2012], Orbanz and Roy [2015]) .

  14. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) E.g., X n,m ≈ W o σ ( W h · [ U n , V m ] + b h ) + b o Within the graphon modeling/approximation framework (Lloyd et al. [2012], Orbanz and Roy [2015]) . Note: Inputs of the nnet are now parameters. (A Bayesian habit?)

  15. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ ))

  16. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) Gradient-based inference targeting, for example, � ( X n,m − f ( U n , V m ; θ )) 2 Loss = ( n,m ) + λ 1 ( || U || 2 F + || V || 2 F ) regularize inputs (?) + λ 2 || θ || 2 L1/L2 regularization F

  17. Neural network matrix factorization (Dziugaite and Roy [2015]) Matrix factorization Network model X n,m ≈ f ( U n , V m ; θ ) P { X n,m = 1 } ≈ σ ( f ( U n , V m ; θ )) Gradient-based inference targeting, for example, � ( X n,m − f ( U n , V m ; θ )) 2 Loss = ( n,m ) + λ 1 ( || U || 2 F + || V || 2 F ) regularize inputs (?) + λ 2 || θ || 2 L1/L2 regularization F Competitive performance; dominates linear baselines

  18. When are deep learning architectures useful? ... for this matrix factorization problem anyway ...

  19. When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Pros: ◮ Black-box for incorporating side information

  20. When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Pros: ◮ Black-box for incorporating side information ◮ Gradient-based learning tools (e.g., Tensorflow/Torch/etc.)

  21. When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Cons: ◮ Lack of interpretability ◮ (What does that really mean? Why is this a problem?)

  22. When are deep learning architectures useful? ... for this matrix factorization problem anyway ... Cons: ◮ Lack of interpretability ◮ (What does that really mean? Why is this a problem?) Motivates things like a “stochastic blockmodel”... ◮ In some (most?) cases, consumers don’t necessarily need to interpret the inferred nnet... ◮ Will often settle for some interpretable (inferred) components ◮ like convincing clusterings of the users.

  23. A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned.

  24. A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k .

  25. A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k . ◮ Construct entries like: Matrix factorization Network modeling X n,m ≈ f ( U Z n , V m ; θ ) P { X i,j = 1 } ≈ σ ( f ( U Z i , U Z j ; θ ))

  26. A stochastic blockmodeling extension ◮ Let Z n ∈ { 1 , . . . , K } denote to which of K clusters/components user n is assigned. ◮ Let U k ∈ R D be the features for cluster k . ◮ Construct entries like: Matrix factorization Network modeling X n,m ≈ f ( U Z n , V m ; θ ) P { X i,j = 1 } ≈ σ ( f ( U Z i , U Z j ; θ )) ◮ So, reduced N sets of parameters to just K ◮ ... like clustering the users (rows of the matrix)

  27. A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data.

  28. A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference.

  29. A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference. ◮ Straightforward application: “Variational inference for Dirichlet process mixtures” Blei and Jordan [2006]

  30. A stochastic blockmodeling extension ◮ Without knowledge of Z n ⇒ infer from data. ◮ Requires (IMO) a Bayesian approach... Variational inference. ◮ Straightforward application: “Variational inference for Dirichlet process mixtures” Blei and Jordan [2006] ◮ Informally, prediction looks like P { X ∗ i,j = 1 } ≈ E q ( Z ) [ σ ( f ( U Z i , U Z j ; θ ))] q ( Z ) ≈ p ( Z | X ) an approximation to the posterior.

  31. Stick-breaking, mean-field variational inference Stick-breaking construction: Let V i ∼ beta(1 , c ), i = 1 , 2 , . . . and k − 1 � π k = V k (1 − V ℓ ) , k = 1 , 2 , . . . , ℓ =1 Z n | π ∼ Discrete( π ) , n ≤ N. Log likelihood is, for example, � log p ( X i,j | f ( U Z i , U Z j ; θ )) + log p ( Z | V ) + log p ( V ) ( i,j )

  32. Stick-breaking, mean-field variational inference Let q denote a “variational approximation” to the posterior: q ( V k ) = beta( V k ; a k , b k ) , q ( Z n ) = Discrete( Z n ; η n ) . Maximize the following lower bound on the log marginal likelihood �� � log p ( X ) ≥ E q ( Z,V ) log p ( X i,j | f ( U Z i , U Z j ; θ )) ( i,j ) − KL[ q ( Z, V ) || p ( Z, V )] KL the Kullback–Leibler divergence.

  33. Stick-breaking, mean-field variational inference Algorithm: ◮ Initialize q . ◮ Iterate: ◮ Update � � q ( Z n = k ) ∝ exp E q [log V k ] + E q [log(1 − V ℓ )] ℓ ≥ k +1 �� �� + E q log p ( X i,j | Z, { Z n = k } ) , ( i,j ) ◮ Take a gradient step � �� � Θ ← Θ + η ∇ Θ log p ( X i,j | f ( U Z i , U Z j ; θ )) E q ( i,j ) � − KL[ q ( Z, V ) || p ( Z, V )] some schedule η and all parameters Θ

  34. Stick-breaking, mean-field variational inference ◮ Easily integrates with gradient-based learning (i.e., use Tensorflow/Torch/etc.)

  35. Stick-breaking, mean-field variational inference ◮ Easily integrates with gradient-based learning (i.e., use Tensorflow/Torch/etc.) ◮ Computing gradients requires stochastic approximation ◮ Stochastic reparameterizations Salimans and Knowles [2013], Kingma and Welling [2014] ◮ Score function estimators with control variates Ranganath et al. [2014], Paisley et al. [2012]

Recommend


More recommend