Fast approximate inference in hybrid Bayesian networks using dynamic discretisation Helge Langseth 1 , David Marquez 2 , Martin Neil 2 1 Dept. of Computer and Information Science, The Norwegian University of Science and Technology, Norway 2 School of Electronic Engineering and Computer Science, Queen Mary, University of London, UK IWINAC2013, June 2013 Dynamic discretisation 1
Exact inference and continuous variables BN’s exact calculation procedure only supports a restricted set of “classical” distributional families: Continuous variables must have Gaussian distributions. Discrete variables should only have discrete parents. Gaussian parents of Gaussians are partial regression coefficients of their children. Disc. Disc. Cont. X 1 X 2 X 3 Y 1 Y 2 Y 3 N ( µ x , σ 2 x ) Disc. Disc. Dynamic discretisation Background 2
Requirements for efficient inference in BNs A distribution family F over X must be closed under the three operations: Restriction: f ( x ) ∈ F = ⇒ f ( y , E = e ) ∈ F for any subset of variables { Y 1 , . . . , Y k } ⊆ { X 1 , . . . , X n } and E = X \ Y . Combination: { f 1 ( x ) ∈ F , f 2 ( y ) ∈ F} = ⇒ f ( x ∪ y ) = f 1 ( x ) · f 2 ( y ) ∈ F . � Elimination: f ( y , z ) ∈ F = ⇒ y f ( y , z ) d y ∈ F for every Y ⊆ X and Z = X \ Y . This is very convenient from an operational point of view, as all the operations required during the inference process can be carried out using a single unique data structure with bounded complexity . One examples is discrete variables , giving the idea to use discretisation . By discretization , we mean to translate a continuous variable X into a discrete one, with labels that partition X ’s domain into hypercubes . E.g., R maps into states ω (1) = ( −∞ , 0] , ω (2) = (0 , 1] , ω (3) = (1 , ∞ ] . x x x � � X ∈ ω ( ℓ ) Thus, f ( x ) is replaced by a probability distribution P . x Dynamic discretisation Background 3
Common ideas for discretization Complexity increases with the number of discretized states. Thus we want an accurate yet efficient representation. “Equal width”: Each bin has the same length . “Equal mass”: Each bin has the same probability mass . 0.4 0.3 0.2 0.1 0 -4 -2 0 2 4 Different behavior, but which one is “better”? Dynamic discretisation Discretization of univariate distributions 4
Distance measure: Kullback-Leibler Divergence The Kullback-Leibler divergence from f to g is defined as � f ( x ) � � D ( f � g ) = f ( x ) log d x . g ( x ) x With ¯ f a discretization of f with hypercubes ω ℓ , note that � � f ( x ) � � � � f � ¯ D f = f ( x ) log d x . ¯ f ℓ x ∈ ω ℓ ℓ Each term can be bounded (Kozlov & Koller, 1997): � f ( x ) � � f ( x ) log ≤ d x ¯ f ℓ x ∈ ω ℓ � � � � �� ℓ − ¯ ¯ f ↑ f ↓ f ℓ − f ↓ f ↑ f ℓ f ↓ f ↑ ℓ ℓ ℓ | ω ℓ | . ℓ log + ℓ log ¯ ¯ f ↑ ℓ − f ↓ f ↑ ℓ − f ↓ f ℓ f ℓ ℓ ℓ Dynamic discretisation Discretization of univariate distributions 5
A KL-based strategy Efficient calculation of f ↓ ℓ and f ↑ ℓ (Neil, et al., 2007): 0.4 0.2 0 0 0.4 0.8 1.2 1.6 2 Obvious approach to discretize a univariate: Roughly initialize , then calculate KL bound for each interval ω ℓ . 1 Choose “worst” interval wrt. KL-bound, and insert a new 2 split-point in the middle of that interval. Calculate KL-bounds for the two new intervals and their 3 neighbors . (The bounds for the other intervals unchanged .) Go to 2. 4 Dynamic discretisation Discretization of univariate distributions 6
A KL-based strategy – Results Results: “Optimal” results (approximated through simulated annealing). Results from the proposed strategy . 0.4 0.2 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 (a) Discretisation w/ 10 intervals (b) Discontinuity points, 24 splits The proposed method focuses too much on the steepest areas: The bound is looser there; The approximations of f ↓ ℓ and f ↑ ℓ are less accurate when | f ′ | is large. Dynamic discretisation Discretization of univariate distributions 7
Discretization of a Bayes net Discretization of a full Bayesian net is difficult because. . . When discretizing a variable, we are determining also how we can 1 discretize its children in the model: In the model X → Y , assume X is Uniform (0 , 1) and � x, σ 2 � Y |{ X = x } ∼ N . Let X be discretized into the two intervals � 1 � � � ω (1) and ω (2) 0 , 1 = = 2 , 1 . x x 2 The conditional distribution for Y in the discretized model can only be � � Y ∈ ω ( · ) y | X ∈ ω ( j ) defined through P for j = 1 , 2 . x Therefore, it can be impossible to capture the correlation between X and Y ; in particular if σ is small. This is the case no matter how many intervals is used to discretize Y . Dynamic discretisation Discretization of multivariate distributions 8
Discretization of a Bayes net Discretization of a full Bayesian net is difficult because. . . When discretizing a variable, we are determining also how we can 1 discretize its children in the model. A discretization that is clever before evidence is observed can be useless 2 afterwards . � x, . 1 2 � Assume X → Y , X ∼ N (0 , 1) and Y |{ X = x } ∼ N . 0.4 5 4 0.3 3 0.2 2 0.1 1 0 0 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 (a) f ( x ) (b) f ( x | y = 2) Dynamic discretisation Discretization of multivariate distributions 8
An apparently naïve approach This apparently naïve approach to do dynamic discretization was proposed by Neil et al. (2007): Initialize by discretizing each continuous variable “roughly” based 1 on its marginal. Continuous evidence nodes are discretized so that there is one interval closely around the observation. Do a belief update in the discretized model. 2 For each unobserved continuous variable: If applicable, add one 3 new split-point where it helps that marginal the most. If we are not finished: Go to step 2. 4 Dynamic discretisation Discretization of multivariate distributions 9
An apparently naïve approach This apparently naïve approach to do dynamic discretization was proposed by Neil et al. (2007): Initialize by discretizing each continuous variable “roughly” based 1 on its marginal. Continuous evidence nodes are discretized so that there is one interval closely around the observation. Do a belief update in the discretized model. 2 For each unobserved continuous variable: If applicable, add one 3 new split-point where it helps that marginal the most. If we are not finished: Go to step 2. 4 Mathematical property: � � This algorithm minimizes � f ( x i ) � ¯ i D f ( x i ) instead of � � � � � f ( x ) � ¯ f ( x i | pa ( x i )) � ¯ D f ( x ) = D f ( x i | pa ( x i )) . i Dynamic discretisation Discretization of multivariate distributions 9
An apparently naïve approach This apparently naïve approach to do dynamic discretization was proposed by Neil et al. (2007): Initialize by discretizing each continuous variable “roughly” based 1 on its marginal. Continuous evidence nodes are discretized so that there is one interval closely around the observation. Do a belief update in the discretized model. 2 For each unobserved continuous variable: If applicable, add one 3 new split-point where it helps that marginal the most. If we are not finished: Go to step 2. 4 Mathematical property: � � This algorithm minimizes � f ( x i ) � ¯ i D f ( x i ) instead of � � � � � f ( x ) � ¯ f ( x i | pa ( x i )) � ¯ D f ( x ) = D f ( x i | pa ( x i )) . i Stress-test: A worst-case scenario for the naïve algorithm is the model X → Y , � � X ∼ N (0 , σ 2 x, σ 2 when σ 2 x ≫ σ 2 x ) and Y |{ X = x } ∼ N y . y Dynamic discretisation Discretization of multivariate distributions 9
Results of the stress-test � x, 10 − 6 � Model: X → Y , X ∼ N (0 , 10 10 ) and Y |{ X = x } ∼ N . � 0 , 10 10 + 10 − 6 � Task: Calculate f ( y ) — Although we know it is N . Vanilla version of the algorithm: -6 x 10 4 3 2 1 0 -3 -2 -1 0 1 2 3 5 x 10 Unsatisfactory: Result is way too “bumpy”. Dynamic discretisation Discretization of multivariate distributions 10
Results of the stress-test � x, 10 − 6 � Model: X → Y , X ∼ N (0 , 10 10 ) and Y |{ X = x } ∼ N . � 0 , 10 10 + 10 − 6 � Task: Calculate f ( y ) — Although we know it is N . Vanilla version of the algorithm: Examination of the error shows the problem is due to numerical instability when we calculate Small support. � � � �� � P ( Y ∈ ω y | x ∈ ω x ) ∝ f ( y | x ) dy f ( x ) dx. x ∈ ω x y ∈ ω y � �� � Difficult. We propose a smoothing technique based on tempering used in MCMC. Salvages the numerical problems, without significant increase in computational burden. Dynamic discretisation Discretization of multivariate distributions 10
Results of the stress-test � x, 10 − 6 � Model: X → Y , X ∼ N (0 , 10 10 ) and Y |{ X = x } ∼ N . � 0 , 10 10 + 10 − 6 � Task: Calculate f ( y ) — Although we know it is N . Tempering/Smoothing version of the algorithm: -6 x 10 4 3 2 1 0 -3 -2 -1 0 1 2 3 5 x 10 Satisfactory: Result are close to correct result. Dynamic discretisation Discretization of multivariate distributions 10
Recommend
More recommend