doubly stochastic variational inference for neural
play

Doubly Stochastic Variational Inference for Neural Processes with - PowerPoint PPT Presentation

Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables Q. Wang & Herke van Hoof Amsterdam Machine Learning Lab ICML 2020 1 / 69 Highlights in this Work 2 / 69 Highlights in this Work A


  1. Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables Q. Wang & Herke van Hoof Amsterdam Machine Learning Lab ICML 2020 1 / 69

  2. Highlights in this Work 2 / 69

  3. Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs 3 / 69

  4. Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs A novel exchangeable SP within a Hierarchical Bayesian Framework ◮ formalization of a hierarchical SP ◮ plausible approximate inference method 4 / 69

  5. Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs A novel exchangeable SP within a Hierarchical Bayesian Framework ◮ formalization of a hierarchical SP ◮ plausible approximate inference method Competitive performance on extensive Uncertainty-aware Applications ◮ high dimensional regressions on simulators/real-world dataset ◮ classification and o.o.d. detection on image dataset 5 / 69

  6. Outline of this Talk Motivation for SP s 1 Study of SP s with LVMs 2 NP with Hierarchical Latent Variables 3 Experiments and Applications 4 6 / 69

  7. Motivation for SP s 7 / 69

  8. Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) 8 / 69

  9. Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) 9 / 69

  10. Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; 10 / 69

  11. Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p ( s t +1 | s t , a t ) in autonomous driving [2] ; 11 / 69

  12. Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p ( s t +1 | s t , a t ) in autonomous driving [2] ; Model distributions instead of point estimates : working as a generative model for more realizations [3]. 12 / 69

  13. Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : 13 / 69

  14. Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables { y 1 , y 2 , . . . , y N + M } , the probability after marginalization over subset is unchanged. � ρ x 1: N + M ( y 1: N + M ) dy N +1: N + M = ρ x 1: N ( y 1: N ) (1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρ x 1: N ( y 1: N ) = ρ x π (1: N ) ( y π (1: N ) ) (1.2) 14 / 69

  15. Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables { y 1 , y 2 , . . . , y N + M } , the probability after marginalization over subset is unchanged. � ρ x 1: N + M ( y 1: N + M ) dy N +1: N + M = ρ x 1: N ( y 1: N ) (1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρ x 1: N ( y 1: N ) = ρ x π (1: N ) ( y π (1: N ) ) (1.2) With these two conditions, an exchangeable SP can be induced. (Refer to Kolmogorov Extension Theorem ) 15 / 69

  16. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: Flexibility in distributions: Neural Processes (NPs) Extension to high dimensions: 16 / 69

  17. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) Extension to high dimensions: 17 / 69

  18. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: 18 / 69

  19. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 19 / 69

  20. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 20 / 69

  21. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 21 / 69

  22. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property → more scalable with computational complexity O ( N ) Extension to high dimensions: → Correlations among or across Input/Output 22 / 69

  23. SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property → more scalable with computational complexity O ( N ) → more flexible with no explicit Extension to high dimensions: → Correlations among or across distributions Input/Output 23 / 69

  24. Study of SP s with LVMs 24 / 69

  25. Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 25 / 69

  26. Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 26 / 69

  27. Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term y i = ϕ ( x i , z i ) + (2.2) ζ i ���� � �� � ���� obs . trans. obs. noise Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 27 / 69

  28. Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term y i = ϕ ( x i , z i ) + (2.2) ζ i ���� � �� � ���� obs . trans. obs. noise Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is p ( z C , z T ) p θ ( z T | x C , y C , x T ) = , (2.3) � p ( z C , z T ) dz C mostly intractable. 28 / 69

Recommend


More recommend