Graphical Model ( defquery gaussian-model [data] √ x ∼ Normal(1 , 5) ( let [x ( sample ( normal 1 ( sqrt 5))) √ sigma ( sqrt 2)] y i | x ∼ Normal( x, 2) (map ( fn [y] ( observe ( normal x sigma) y)) data) x)) ( def dataset [9 8]) y 1 = 9 , y 2 = 8 ( def posterior (( conditional gaussian-model x | y ∼ Normal(7 . 25 , 0 . 91) :pgibbs :number-of-particles 1000) dataset)) ( def posterior-samples ( repeatedly 20000 # ( sample posterior))) 32
Anglican : Syntax ≈ Clojure, Semantics ≠ Clojure ( defquery gaussian-model [data] √ x ∼ Normal(1 , 5) ( let [x ( sample ( normal 1 ( sqrt 5))) √ sigma ( sqrt 2)] y i | x ∼ Normal( x, 2) (map ( fn [y] ( observe ( normal x sigma) y)) data) x)) ( def dataset [9 8]) y 1 = 9 , y 2 = 8 ( def posterior (( conditional gaussian-model x | y ∼ Normal(7 . 25 , 0 . 91) :pgibbs :number-of-particles 1000) dataset)) ( def posterior-samples ( repeatedly 20000 # ( sample posterior))) https://www.java.com/ 33
Bayes Net ( defquery sprinkler-bayes-net [sprinkler wet-grass] ( let [is-cloudy ( sample ( flip 0.5)) is-raining (cond (= is-cloudy true ) ( sample ( flip 0.8)) (= is-cloudy false) ( sample ( flip 0.2))) sprinkler-dist (cond (= is-cloudy true) ( flip 0.1) (= is-cloudy false) ( flip 0.5)) wet-grass-dist (cond (and (= sprinkler true) (= is-raining true)) ( flip 0.99) (and (= sprinkler false) (= is-raining false)) ( flip 0.0) (or (= sprinkler true) (= is-raining true)) ( flip 0.9))] ( observe sprinkler-dist sprinkler) ( observe wet-grass-dist wet-grass) is-raining)) 34
One Hidden Markov Model x 0 x 1 x 2 x 3 · · · ( defquery hmm ( let [init-dist ( discrete [1 1 1]) y 1 y 2 y 3 trans-dist ( fn [s] (cond (= s 0) ( discrete [0 1 1]) (= s 1) ( discrete [0 0 1]) (= s 2) ( dirac 2))) obs-dist ( fn [s] ( normal s 1)) y-1 1 y-2 1 x-0 ( sample init-dist) x-1 ( sample ( trans-dist x-0)) x-2 ( sample ( trans-dist x-1))] ( observe ( obs-dist x-1) y-1) ( observe ( obs-dist x-2) y-2) [x-0 x-1 x-2]))
All Hidden Markov Models ( defquery hmm [ys init-dist trans-dists obs-dists] (reduce ( fn [xs y] ( let [x ( sample (get trans-dists (peek xs)))] ( observe (get obs-dists x) y) (conj xs x))) [( sample init-dist)] ys)) x 0 x 1 x 2 x 3 · · · y 1 y 2 y 3
New Primitives ( defquery geometric [p] "geometric distribution" ( let [dist ( flip p) samp ( loop [n 0] ( if ( sample dist) n ( recur (+ n 1))))] samp)) p 1-p 0 p 1-p 1 p 1-p 2
A Hard Inference Problem ( defquery md5-inverse [L md5str] "conditional distribution of strings that map to the same MD5 hashed string" ( let [mesg ( sample ( string-generative-model L)) ] ( observe ( dirac md5str) ( md5 mesg)) mesg)))
Evaluation-Based Inference for Higher-Order PPLs
The Gist • Explore as many “traces” as possible, intelligently • Each trace contains all random choices made during the execution of a generative model • Compute trace “goodness” (probability) as side-effect • Combine weighted traces probabilistically coherently • Report projection of posterior over traces • If it’s going to be “hard,” let’s at least make it fast • First generation - interpreted • Second generation - compiled 40
Traces x 2 = 0 ( poisson 7) x 2 = 1 x 2 = 2 x 1 = 0 ( discrete `(1 1 1)) x 1 = 1 . . . x 1 = 2 x 2 = 2 ( poisson 9) x 2 = 1 x 2 = 0 . ( let [t-1 3 . . x-1 ( sample ( discrete (repeat t-1 1)))] ( if (not= x-1 1) ( let [t-2 (+ x-1 7) x-2 ( sample ( poisson t-2))])))
Goodness of Trace ( normpdf 0 1 0.0001) x 2 = 0 ( normpdf 1 1 0.0001) ( poisson 7) x 2 = 1 ( normpdf 2 1 0.0001) x 2 = 2 x 1 = 0 ( discrete `(1 1 1)) x 1 = 1 . . . ( normpdf 2 1 0.0001) x 1 = 2 x 2 = 2 ( normpdf 1 1 0.0001) ( poisson 9) x 2 = 1 x 2 = 0 ( normpdf 0 1 0.0001) ( let [t-1 3 . . . x-1 ( sample ( discrete (repeat t-1 1)))] ( if (not= x-1 1) ( let [t-2 (+ x-1 7) x-2 ( sample ( poisson t-2))] ( observe ( gaussian x-2 0.0001) 1))))
Trace • Sequence of N observe ’s e encounter N s { ( g i , φ i , y i ) } N i =1 to the sample • Sequence of M sample ’s . This yields seq d { ( f j , θ j ) } M j =1 sampled values • Sequence of M sampled values ments, wi e) { x j } M j =1 . own norm • Conditioned on these sampled values the entire computation is deterministic
Trace Probability • Defined as (up to a normalization constant) N M Y Y γ ( x ) , p ( x , y ) = g i ( y i | φ i ) f j ( x j | θ j ) . i =1 j =1 • Hides true dependency structure ◆ M N � � ✓ ✓ ◆ � ˜ ˜ � ˜ Y Y � � γ ( x ) = p ( x , y ) = g i ( x n i ) ˜ φ i ( x n i ) f j ( x j − 1 ) θ j ( x j − 1 ) y i x j � � i =1 j =1 x 6 { alue x j = x 1 × · · · × x j denote x 4 { sampled values (with x 1 x 2 x 3 x 4 x 5 x 6 etc y 1 y 2
Inference Goal • Posterior over traces π ( x ) , p ( x | y ) = γ ( x ) Z Z = p ( y ) = γ ( x ) d x , Z • Output Q ( x ) π ( x ) d x = 1 Q ( x ) γ ( x ) Z Z E [ z ] = E [ Q ( x )] = q ( x ) q ( x ) d x Z
Three Base Algorithms • Likelihood Weighting • Sequential Monte Carlo • Metropolis Hastings
Likelihood Weighting • Run K independent copies of program simulating from the prior M k Y q ( x k ) = f j ( x k j | θ k j ) j =1 • Accumulate unnormalized weights (likelihoods) N k w ( x k ) = γ ( x k ) Y g k i ( y k i | φ k q ( x k ) = i ) X i =1 b • Use in approximate (Monte Carlo) integration X K w ( x k ) W k = b W k Q ( x k ) E ⇡ [ Q ( x )] = P K ` =1 w ( x ` ) k =1 BLOG default inference engine: http://bayesianlogic.github.io/pages/users-manual.html
Likelihood Weighting Schematic z 1 , w 1 z 2 , w 2 . . . . . . z K , w K
Sequential Monte Carlo subspace of x which is with ˜ x 1: n = ˜ x 1 × · · · × ˜ • Notation x n such disjoint. While there are alw ˜ ˜ x 1 x 2 { { etc x 1 x 2 x 3 x 4 x 6 x 5 y 1 y 2 • Incrementalized joint N Y γ n (˜ x 1: n ) = g ( y n | ˜ x 1: n ) p (˜ x n | ˜ x 1: n − 1 ) , n =1 • Incrementalized target ed incremental targets x 1: n ) = 1 π n (˜ γ n (˜ x 1: n ) Z n
SMC for Probabilistic Programming Want samples from π n (˜ x 1: n ) ∝ p ( y n | ˜ x 1: n ) p (˜ x n | ˜ x 1: n − 1 ) π n − 1 (˜ x 1: n − 1 ) Have a sample-based approximation to K X W k x 1: n − 1 ) , π n − 1 (˜ ˆ 1: n � 1 (˜ x 1: n − 1 ) n − 1 δ ˜ x k k =1 Sample from n � 1 n � 1 a k a k a k x k ˜ 1: n � 1 ∼ ˆ n � 1 π n � 1 (˜ x 1: n � 1 ) ˜ n | ˜ 1: n � 1 ∼ p (˜ x n | ˜ n � 1 1: n � 1 ) n � 1 x x x a k x k x k ˜ 1: n = ˜ 1: n − 1 × ˜ n − 1 n . x X Importance weight by k =1 n | x k w (˜ 1: n ) 1: n � 1 W k n , x k x k 1: n ) = g k x k w (˜ 1: n ) = p ( y n | ˜ n ( y n | ˜ 1: n ) P K x k 0 k 0 =1 w (˜ 1: n ) Wood, van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
SMC for Probabilistic Programming Want samples from π n (˜ x 1: n ) ∝ p ( y n | ˜ x 1: n ) p (˜ x n | ˜ x 1: n − 1 ) π n − 1 (˜ x 1: n − 1 ) Have a sample-based approximation to K X W k x 1: n − 1 ) , π n − 1 (˜ ˆ 1: n � 1 (˜ x 1: n − 1 ) n − 1 δ ˜ x k k =1 Sample from n � 1 n � 1 a k a k a k x k ˜ 1: n � 1 ∼ ˆ n � 1 π n � 1 (˜ x 1: n � 1 ) ˜ n | ˜ 1: n � 1 ∼ p (˜ x n | ˜ n � 1 1: n � 1 ) n � 1 x x x a k x k x k ˜ 1: n = ˜ 1: n − 1 × ˜ n − 1 n . x X Importance weight by k =1 n | x k w (˜ 1: n ) 1: n � 1 W k n , x k x k 1: n ) = g k x k w (˜ 1: n ) = p ( y n | ˜ n ( y n | ˜ 1: n ) P K x k 0 k 0 =1 w (˜ 1: n ) Wood, van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
SMC for Probabilistic Programming Want samples from π n (˜ x 1: n ) ∝ p ( y n | ˜ x 1: n ) p (˜ x n | ˜ x 1: n − 1 ) π n − 1 (˜ x 1: n − 1 ) Have a sample-based approximation to K X W k x 1: n − 1 ) , π n − 1 (˜ ˆ 1: n � 1 (˜ x 1: n − 1 ) n − 1 δ ˜ x k k =1 Sample from n � 1 n � 1 a k a k a k x k ˜ 1: n � 1 ∼ ˆ n � 1 π n � 1 (˜ x 1: n � 1 ) ˜ n | ˜ 1: n � 1 ∼ p (˜ x n | ˜ n � 1 1: n � 1 ) n � 1 x x x a k x k x k ˜ 1: n = ˜ 1: n − 1 × ˜ n − 1 n . x X Importance weight by k =1 n | x k w (˜ 1: n ) 1: n � 1 W k n , x k x k 1: n ) = g k x k w (˜ 1: n ) = p ( y n | ˜ n ( y n | ˜ 1: n ) P K x k 0 k 0 =1 w (˜ 1: n ) Wood, van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
SMC for Probabilistic Programming Want samples from π n (˜ x 1: n ) ∝ p ( y n | ˜ x 1: n ) p (˜ x n | ˜ x 1: n − 1 ) π n − 1 (˜ x 1: n − 1 ) Have a sample-based approximation to K X W k x 1: n − 1 ) , π n − 1 (˜ ˆ 1: n � 1 (˜ x 1: n − 1 ) n − 1 δ ˜ x k k =1 Sample from n � 1 n � 1 a k a k a k x k ˜ 1: n � 1 ∼ ˆ n � 1 π n � 1 (˜ x 1: n � 1 ) ˜ n | ˜ 1: n � 1 ∼ p (˜ x n | ˜ n � 1 1: n � 1 ) n � 1 x x x a k x k x k ˜ 1: n = ˜ 1: n − 1 × ˜ n − 1 n . x X Importance weight by k =1 n | x k w (˜ 1: n ) 1: n � 1 W k n , x k x k 1: n ) = g k x k w (˜ 1: n ) = p ( y n | ˜ n ( y n | ˜ 1: n ) P K x k 0 k 0 =1 w (˜ 1: n ) Wood, van de Meent, and Mansinghka “A New Approach to Probabilistic Programming Inference” AISTATS 2014 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
SMC Schematic Intuitively - run - wait/weight Threads - continue continuations observe
Metropolis Hastings = “Single Site” MCMC = LMH Posterior distribution of execution traces is proportional to trace score with observed values plugged in N M π ( x ) , p ( x | y ) = γ ( x ) Y Y γ ( x ) , p ( x , y ) = Z , Z g i ( y i | φ i ) f j ( x j | θ j ) . i =1 j =1 Metropolis-Hastings acceptance rule ✓ ◆ 1 , π ( x 0 ) q ( x | x 0 ) α = min π ( x ) q ( x 0 | x ) ▪ Need proposal Milch and Russell “General-Purpose MCMC Inference over Relational Structures.” UAI 2006. Goodman, Mansinghka, Roy, Bonawitz, and Tenenbaum “Church: a language for generative models.” UAI 2008. 55 Wingate, Stuhlmüller, Goodman “Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation” AISTATS 2011
LMH Proposal Probability of new part of proposed execution trace M 0 1 Y q ( x 0 | x s ) = ` | x s M s ( x 0 f 0 j ( x 0 j | ✓ 0 ` ) j ) j = ` +1 Number of samples in original trace
LMH Acceptance Ratio “Single site update” = sample from the prior = run program forward κ ( x 0 m | x m ) = f m ( x 0 m | θ m ) , θ m = θ 0 m MH acceptance ratio Probability of original trace continuation restarting proposal trace at m th sample Number of sample statements in original trace � ( x 0 ) M Q M ! j = m f j ( x j | ✓ j ) ↵ = min 1 , � ( x ) M 0 Q M 0 j ( x 0 j | ✓ 0 j ) j = m f 0 Number of sample statements Probability of proposal trace continuation in new trace restarting original trace at m th sample 57
LMH Schematic z 1 z 1 z 3 . . . . . . z K
Implementation Strategy • Interpreted • Interpreter tracks side effects and directs control flow for inference • Compiled • Leverages existing compiler infrastructure • Can only exert control over flow from within function calls • e.g. sample, observe, predict Wingate, Stuhlmüller, Goodman “Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation” AISTATS 2011 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
Probabilistic C Standard C plus new directives: observe and predict observe constrains program execution predict emits sampled values
Probabilistic C Implementation Actually - run - wait/weight Processes - fork new processes/continuations observe Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
Continuations • A continuation is a function that encapsulates the “rest of the computation” • A Continuation Passing Style (CPS) transformation rewrites programs so • no function ever returns • every function takes an extra argument, a function called the continuation • Standard programming language technique • No limitations Friedman and Wand. “Essentials of programming languages.” MIT press, 2008. Fischer, Kiselyov, and Shan “Purely functional lazy non-deterministic programming” ACM Sigplan 2009 Goodman and Stuhlmüller http://dippl.org/ 2014 Tolpin https://bitbucket.org/probprog/anglican/ 2014
Example CPS Transformation ;; Standard Clojure: (println (+ (* 2 3) 4)) ;; CPS transformed: (*& 2 3 (fn [x] (+& x 4 println))) Second cont. First continuation ;; CPS-transformed "primitives" (defn +& [a b k] (k (+ a b))) (defn *& [a b k] (k (* a b)))
CPS Explicitly Linearizes Execution (defn pythag& "compute sqrt(x^2 + y^2)" [x y k] (square& x xx = x 2 (fn [xx] yy = y 2 (square& y (fn [yy] xxyy = xx + yy (+& xx yy · = √ xxyy (fn [xxyy] (sqrt& xxyy k)))))))) • Compiling to a pure language with lexical scoping ensures A. variables needed in subsequent computation are bound in the environment B. can’t be modified by multiple calls to the continuation function
Anglican Programs (defquery flip-example [outcome] (let [u (uniform-continuous 0 1) (let [p (sample (uniform-continuous 0 1))] (observe (flip p) outcome) p (sample u) (predict :p p)) dist (flip p)] (flip-example true) (observe dist outcome) (predict :p p)) Anglican Anglican “linearized”
Are “Compiled” to Native CPS-Clojure (defn flip-query& [outcome k1] (let [u (uniform-continuous 0 1) (uniform-continuous& 0 1 (fn [dist1] (sample& dist1 p (sample u) (fn [p] ((fn [p k2] (flip& p dist (flip p)] (fn [dist2] (observe& dist2 outcome (observe dist outcome) (fn [] (predict& :p p k2)))))) (predict :p p)) p k1)))))) ;; CPS-ed distribution constructors (defn uniform-continuous& [a b k] (k (uniform-continuous a b))) (defn flip& [p k] (k (flip p))) Clojure Anglican “linearized”
Are “Compiled” to Native CPS-Clojure (defn flip-query& [outcome k1] (let [u (uniform-continuous 0 1) (uniform-continuous& 0 1 (fn [dist1] (sample& dist1 p (sample u) (fn [p] ((fn [p k2] (flip& p dist (flip p)] (fn [dist2] (observe& dist2 outcome (observe dist outcome) (fn [] (predict& :p p k2)))))) (predict :p p)) p k1)))))) ;; CPS-ed distribution constructors (defn uniform-continuous& [a b k] (k (uniform-continuous a b))) (defn flip& [p k] (k (flip p))) Clojure Anglican “linearized”
Explicit Functional Form for “Rest of Program” (defn flip-query& [outcome k1] (uniform-continuous& 0 1 continuation functions (fn [dist1] (sample& dist1 (fn [p] ((fn [p k2] (flip& p (fn [dist2] (observe& dist2 outcome (fn [] (predict& :p p k2)))))) p k1))))))
Interruptible (defn flip-query& [outcome k1] (uniform-continuous& 0 1 Anglican primitives continuation functions (fn [dist1] (sample& dist1 (fn [p] ((fn [p k2] (flip& p (fn [dist2] (observe& dist2 outcome (fn [] (predict& :p p k2)))))) p k1))))))
Controllable (defn flip-query& [outcome k1] (defn flip-query& [outcome k1] (uniform-continuous& 0 1 (uniform-continuous& 0 1 Anglican primitives continuation functions (fn [dist1] (fn [dist1] (sample& dist1 (sample& dist1 (fn [p] ((fn [p k2] (fn [p] ((fn [p k2] (flip& p (flip& p (fn [dist2] (fn [dist2] (observe& dist2 outcome (observe& dist2 outcome (fn [] (fn [] (predict& :p p k2)))))) (predict& :p p k2)))))) p k1)))))) p k1)))))) inference “backend” interface webPPL CPS compiles to pure functional Javascript
Inference “Backend” ;; Implement a "backend" (defn sample& [dist k] ;; [ ALGORITHM-SPECIFIC IMPLEMENTATION HERE ] ;; Pass the sampled value to the continuation (k (sample dist))) (defn observe& [dist value k] (println "log-weight =" (observe dist value)) ;; [ ALGORITHM-SPECIFIC IMPLEMENTATION HERE ] ;; Call continuation with no arguments (k)) (defn predict& [label value k] ;; [ ALGORITHM-SPECIFIC IMPLEMENTATION HERE ] (k label value))
Common Framework Pure compiled deterministic computation P start P continue P terminate continue P ameter vector ameter vector call (k x) call (k) . call (k) . “Backend” sample observe predict terminate ( g, φ , y, k ) ( f, θ , k ) ( z, k )
Likelihood Weighting “Backend” (defn sample& [dist k] ;; Call the continuation with a sampled value (k (sample dist))) (defn observe& [dist value k] ;; Compute and record the log weight (add-log-weight! (observe dist value)) ;; Call the continuation with no arguments (k)) (defn predict& [label value k] ;; Store predict, and call continuation (store! label value) (k))
Likelihood Weighting Example Compiled pure deterministic computation terminate P start P continue P continue P sample & predict & observe & terminate w ← p I ( outcome = true ) (1 − p ) I ( outcome = false ) p ∼ U (0 , 1) “Backend” (defquery flip-example [outcome] (let [p (sample (uniform-continuous 0 1))] (observe (flip p) outcome) (predict :p p))
SMC Backend (defn sample& [dist k] ;; Call the continuation with a sampled value (k (sample dist))) (defn observe& [dist value k] ;; Block and wait for K calls to reach observe& ;; Compute weights ;; Use weights to subselect continuations to call ;; Call K sampled continuations (often multiple times) ) (defn predict& [label value k] ;; Store predict, and call continuation (store! label value) (k))
LMH Backend (defn sample& [a dist k] (let [ ;; reuse previous value, ;; or sample from prior x (or (get-cache a) (sample dist))] ;; add to log-weight when reused (when (get-cache a) (add-log-weight! (observe dist x))) ;; store value and its log prob in trace (store-in-trace! a x dist) ;; continue with value x (k x))) (defn observe& [dist value k] ;; Compute and record the log weight (add-log-weight! (observe dist value)) ;; Call the continuation with no arguments (k))
LMH Variants D. Wingate, A. Stuhlmueller, and N. D. Goodman. "Lightweight implementations of probabilistic programming languages via transformational compilation." AISTATS (2011). WebPPL Anglican "C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching." D. Ritchie, A. Stuhlmuller, and N. D. Goodman. arXiv:1509.02151 (2015). "Venture: a higher-order probabilistic programming platform with programmable inference." V. Mansinghka, D. Selsam, and Y. Perov. arXiv:1404.0099 (2014).
Inference Improvements Relevant to in Higher-Order PPLs
Add Hill Climbing n n n • PMCMC = MH with SMC … proposals, e.g. - PIMH : “particle n n n independent Metropolis- Sweep Hastings” … - PGIBBS : “iterated conditional SMC” n n n … Andrieu, Doucet, Holenstein “Particle Markov chain Monte Carlo methods.“ JRSSB 2010
Blockwise Anytime Algorithm • PIMH is MH that accepts entire n n n new particle sets w.p. ˆ Z 1 … ! ˆ Z ? α s PIMH = min 1 , ˆ Z s − 1 • Each SMC sweep computes n n n Sweep marginal likelihood estimate ˆ Z 2 … N N K 1 ˆ Y ˆ Y X x k Z = Z n = w (˜ 1: n ) K n =1 n =1 k =1 • And all particles can be used n n n ˆ Z ∗ S K … E PIMH [ Q ( x )] = 1 ˆ X X W s,k Q ( x s,k ) . S s =1 k =1 Paige and Wood “A Compilation Target for Probabilistic Programming Languages” ICML 2014
PMCMC For Probabilistic Programming Inference 81 Wood , van de Meent, Mansinghka “A new approach to probabilistic programming inference” AISTATS 2014
Remove Synchronization SMC in LDS slowed down for clarity
Particle Cascade n = 1 n = 2 Asynchronously - simulate - weight - branch Paige, Wood , Doucet, Teh “Asynchronous Anytime Sequential Monte Carlo” NIPS 2014
Particle Cascade
Shared Memory Scalability: Multiple Cores 85
Distributed SMC Nodes 2 4 6 8 10 12 14 16 18 20 MCMC Iteration, r iPMCMC I For each MCMC iteration r = 1 , 2 , . . . 1. Nodes c j 2 { 1 , . . . , M } , j = 1 , . . . , P run CSMC, the rest run SMC 2. Each node m returns a marginal likelihood estimate ˆ Z m and candidate retained particle x 0 1: T,m 3. A loop of Gibbs updates is applied to the retained particle indices: ˆ Z m 1 m/ 2 c 1: P \ j P ( c j = m | c 1: P \ j ) = (3) P M n =1 ˆ Z n 1 n/ 2 c 1: P \ j 4. The retained particles for the next iteration are set x 0 1: T,j [ r ] = x 0 1: T,c j Rainforth, Naesseth, Lindsten, Paige, van de Meent, Doucet, Wood , “Interacting Particle Markov Chain Monte Carlo” ICML 2016
CSMC Exploitation / SMC Exploration 87
Inference Backends in Anglican • 14+ algorithms • Average 165 lines of code per! • Can implement and use without touching core code base. Lines of Algorithm Type Citation Description Code smc IS 127 Sequential Monte Carlo Wood et al. AISTATS, 2014 importance IS 21 Likelihood weighting Particle cascade: Anytime asynchronous sequential Monte pcascade IS Paige et al., NIPS, 2014 176 Carlo pgibbs PMCMC 121 Wood et al. AISTATS, 2014 Particle Gibbs (iterated conditional SMC) pimh PMCMC 68 Wood et al. AISTATS, 2014 Particle independent Metropolis-Hastings van de Meent et al., AISTATS, pgas PMCMC Particle Gibbs with ancestor sampling 179 2015 lmh MCMC 177 Wingate et al., AISTATS, 2011 Lightweight Metropolis-Hastings ipmcmc MCMC Rain forth et al., ICML, 2016 Interacting PMCMC 193 almh MCMC 320 Tolpin et al., ECML PKDD, 2015 Adaptive scheduling lightweight Metropolis-Hastings rmh* MCMC 319 - Random-walk Metropolis-Hastings Parallelised adaptive scheduling lightweight Metropolis- palmh MCMC - 66 Hastings plmh MCMC 62 - Parallelised lightweight Metropolis-Hastings bamc MAP Tolpin et al., SoCS, 2015 Bayesian Ascent Monte Carlo 318 88 siman MAP 193 Tolpin et al., SoCS, 2015 MAP estimation via simulated annealing
What Next?
Commercial Impact INVREA Make Better Decisions https://invrea.com/plugin/excel/v1/download/ 90
Symbolic Inference via Program Transformations • Automated program transformations that simplify or eliminate inference (moving observes up and out) (defquery beta-bernoulli [observation] (defquery beta-bernoulli [observation] (let [dist (beta 1 1) (let [dist (beta theta (sample dist) (if observation 2 1) like (flip theta)] (if observation 1 2)) (observe like observation) theta (sample dist)] (predict :theta theta))) (predict :theta theta))) “Automatic Rao-Blackwellization” Carette and Shan. “Simplifying Probabilistic Programs Using Computer Algebra ⋆ .” T.R. 719, Indiana University (2015) Yang - Keynote Lecture, APLAS (2015)
Exact Inference via Compilation Anglican ( defquery simple [] ( def y ( sample ( flip 0.5))) ( def z ( if y ( dirac 0) ( dirac 1))) Figaro, etc. ( observe z 0) y) R x 1 ⇠ � 0 . 5 x 4 ⇠ � 0 variable elimination to compute x 6 ⇠ � 1 x 2 ⇠ � J flip x 1 K x 5 ⇠ � J dirac x 4 K p ( y ) x 7 ⇠ � J dirac x 6 K x 3 ⇠ P x 2 and x 8 ⇠ � if ( x 3 ,x 5 ,x 7 ) x 9 ⇠ � 0 p ( x | y ) = x 10 ⇠ � J = x 8 x 9 K exactly R Cornish, F Wood , and H Yang “Efficient exact inference in discrete Anglican programs” in prep. 2016
Inference Compilation - FOPPLs w 0 w 0 w 0 z n z n z n ϕ w w 1 w 1 w 1 t n t n t n w 2 w 2 w 2 N N N A probabilistic An inverse model Can we learn how to sample model generates latents from the inverse model? 6 = π ( x ) = p ( x | y ) approximating family q ( x | λ ) Target density , fit λ to learn an importance x | y ) Single dataset : argmin D KL ( π || q λ ) = sampling proposal λ Averaging over learn a mapping from all possible at λ = ϕ ( η , y ), arbitrary datasets to λ ⇥ ⇤ datasets: argmin D KL ( π || q ϕ ( η , y ) ) E p ( y ) …compiles away runtime η costs of inference! Paige, Wood “Inference Networks for Sequential Monte Carlo in Graphical Models” ICML (2016).
Compiled Inference Results Paige, Wood “Inference Networks for Sequential Monte Carlo in Graphical Models” ICML (2016).
Wrap Up
Learning Dichotomy Supervised Unsupervised x x x x + data = + inference = y y y y Needs lots of labeled data Needs only unlabeled data • • Training is slow No training • • Uninterpretable model Interpretable Model • • Fast at test time Slow at test time • •
Unified Learning Supervised Unsupervised x x x x + data = + inference = y y y x y x x + = y y y Needs only unlabeled data • Slow training • Interpretable model • Fast at test time •
HOPPL Compiled Inference p(letters | captcha) Compiled inference Classical inference 1) Compilation (1 day) 1) Inference (20 minutes) 2) Inference (1 second) (defquery ¡captcha ¡[baseline-‑image] ¡ ¡ ¡(let ¡[num-‑letters ¡(sample ¡(u-‑d ¡4 ¡7)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡x-‑offset ¡(sample ¡(u-‑d ¡min-‑x ¡max-‑x)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡y-‑offset ¡(sample ¡(u-‑d ¡min-‑y ¡max-‑y)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡distort-‑x ¡(sample ¡(u-‑d ¡8 ¡15)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡distort-‑y ¡(sample ¡(u-‑d ¡8 ¡15)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡kerning ¡(sample ¡(u-‑d ¡-‑1 ¡3)) ¡ Probabilistic ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡letter-‑ids ¡(repeatedly ¡num-‑letters ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡#(sample ¡(u-‑d ¡0 ¡dict-‑size))) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡letters ¡(get-‑letters ¡letter-‑ids) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡rendered-‑image ¡(render ¡letters ¡ ¡ program ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡x-‑offset ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡y-‑offset ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡kerning ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡distort-‑x ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡distort-‑y)] ¡ ¡ ¡ ¡ ¡;; ¡ABC-‑style ¡observe ¡ ¡ ¡ ¡ ¡(observe ¡(abc-‑dist ¡rendered-‑image ¡abc-‑sigma) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡baseline-‑image) ¡ ¡ ¡ ¡ ¡(predict ¡:letters ¡letters))) Training {x, y} data Compiled sequential Sequential Lightweight importance sampling Monte Carlo Metropolis-Hastings 1 particle 10k particles 10k iterations Dynamically assembled RNN num-‑letters ¡= ¡5 ¡ num-‑letters ¡= ¡4 ¡ num-‑letters ¡= ¡6 ¡ … ¡ … ¡ … ¡ letters ¡= ¡“gtRai” letters ¡= ¡“dF6D” letters ¡= ¡“q5ihGt” Trained RNN weights 98 Le, Baydin, Wood “Inference Compilation and Universal Probabilistic Programming” in prep 2016
Compiled HOPPL Models x x x + = y y y y x program source code program output scene description image policy and world observations and rewards neural net structures input/output pairs simulator constraints
Wrap Up
Recommend
More recommend