Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables Q. Wang & Herke van Hoof Amsterdam Machine Learning Lab ICML 2020 1 / 69
Highlights in this Work 2 / 69
Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs 3 / 69
Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs A novel exchangeable SP within a Hierarchical Bayesian Framework ◮ formalization of a hierarchical SP ◮ plausible approximate inference method 4 / 69
Highlights in this Work A systematical revisit to SP s with an Implicit Latent Variable Model ◮ conceptualization of latent SP models ◮ comprehension about SP s with LVMs A novel exchangeable SP within a Hierarchical Bayesian Framework ◮ formalization of a hierarchical SP ◮ plausible approximate inference method Competitive performance on extensive Uncertainty-aware Applications ◮ high dimensional regressions on simulators/real-world dataset ◮ classification and o.o.d. detection on image dataset 5 / 69
Outline of this Talk Motivation for SP s 1 Study of SP s with LVMs 2 NP with Hierarchical Latent Variables 3 Experiments and Applications 4 6 / 69
Motivation for SP s 7 / 69
Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) 8 / 69
Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) 9 / 69
Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; 10 / 69
Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p ( s t +1 | s t , a t ) in autonomous driving [2] ; 11 / 69
Why Do We Need Stochastic Processes ? The stochastic process ( SP ) is a math tool to describe the distribution over functions. (Fig. refers to [1]) Flexible to handle correlations among samples : significant for non- i.i.d. dataset ; Quantify uncertainty in risk-sensitive applications : e.g. forecast p ( s t +1 | s t , a t ) in autonomous driving [2] ; Model distributions instead of point estimates : working as a generative model for more realizations [3]. 12 / 69
Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : 13 / 69
Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables { y 1 , y 2 , . . . , y N + M } , the probability after marginalization over subset is unchanged. � ρ x 1: N + M ( y 1: N + M ) dy N +1: N + M = ρ x 1: N ( y 1: N ) (1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρ x 1: N ( y 1: N ) = ρ x π (1: N ) ( y π (1: N ) ) (1.2) 14 / 69
Two Consistencies in Exchangeable SP s Some required properties for exchangeable stochastic process ρ [4] : Marginalization Consistency. For any finite collection of random variables { y 1 , y 2 , . . . , y N + M } , the probability after marginalization over subset is unchanged. � ρ x 1: N + M ( y 1: N + M ) dy N +1: N + M = ρ x 1: N ( y 1: N ) (1.1) Exchangeability Consistency. Any random permutation over set of variables does not influence joint probability. ρ x 1: N ( y 1: N ) = ρ x π (1: N ) ( y π (1: N ) ) (1.2) With these two conditions, an exchangeable SP can be induced. (Refer to Kolmogorov Extension Theorem ) 15 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: Flexibility in distributions: Neural Processes (NPs) Extension to high dimensions: 16 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) Extension to high dimensions: 17 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: 18 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → Optimization/Computational bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 19 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 20 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property Extension to high dimensions: → Correlations among or across Input/Output 21 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property → more scalable with computational complexity O ( N ) Extension to high dimensions: → Correlations among or across Input/Output 22 / 69
SP s in Progress and Primary Concerns Crucial properties for SP s : Analysis on GPs/NPs : Gaussian Processes (GPs) Scalability in large-scale dataset: → less scalable with computational → Optimization/Computational complexity O ( N 3 ) bottleneck → less flexible with Gaussian distributions Flexibility in distributions: Neural Processes (NPs) → Non-Gaussian or Multi-modal property → more scalable with computational complexity O ( N ) → more flexible with no explicit Extension to high dimensions: → Correlations among or across distributions Input/Output 23 / 69
Study of SP s with LVMs 24 / 69
Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 25 / 69
Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 26 / 69
Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term y i = ϕ ( x i , z i ) + (2.2) ζ i ���� � �� � ���� obs . trans. obs. noise Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is (2.3) mostly intractable. 27 / 69
Deep Latent Variable Model as SP s Here we present an implicit Latent Variable Model for SP s : Generation paradigm with (potentially correlated) latent variables : z i = φ ( x i ) + ǫ ( x i ) (2.1) ���� � �� � ���� index depend. l.v. deter. term stoch. term y i = ϕ ( x i , z i ) + (2.2) ζ i ���� � �� � ���� obs . trans. obs. noise Predictive distribution in SP s : Let the context and target input be C = { ( x i , y i ) | i = 1 , 2 , . . . , N } and x T , the computation is p ( z C , z T ) p θ ( z T | x C , y C , x T ) = , (2.3) � p ( z C , z T ) dz C mostly intractable. 28 / 69
Recommend
More recommend