semantics of probabilistic and differential programming
play

Semantics of Probabilistic and Differential Programming Workshop on - PowerPoint PPT Presentation

Semantics of Probabilistic and Differential Programming Workshop on program transformations at NeurIPS Christine Tasson (tasson@irif.fr) December 2019 Institut de Recherche en Informatique Fondamentale Every programmer can perform data analysis


  1. Semantics of Probabilistic and Differential Programming Workshop on program transformations at NeurIPS Christine Tasson (tasson@irif.fr) December 2019 Institut de Recherche en Informatique Fondamentale

  2. Every programmer can perform data analysis by describing models as programs and key operations (inference and gradient) computations are delegated to compiler. Probabilistic programming languages BUGS (Spiegelhalter et al. 1995) , BLOG (Milch et al. 2005) , Church (Goodman et al. 2008) , WebPPL (Goodman et al. 2014) , Venture (Mansinghka et al. 2014) , Anglican (Wood et al. 2015) , Stan (Stan Development Team 2014) , Hakaru (Narayanan et al., 2016) BayesDB (Mansinghka et al. 2017) , Edward (Tran et al.Tran et al. 2017) , Birch (Murray et al. 2018) , Turing (Ge et al. 2018) , Gen (Cusumano-Towner et al. 2019) , Pyro (Bingham et al. 2019) , . . . Differential programming languages Theano (Bergstra et al. 2010) , Tensorflow 1.0 (Abadi et al. 2016, Yu et al.2018) , Tangent (van Merrienboer et al. 2018) , Autograd (Maclaurin et al. 2015) , TensorFlow Eager Mode (Shankar and Dobson 2017) , Chainer (Tokui 2018) , PyTorch (PyTorch 2018) , and JAX (Frostig et al. 2018) , . . . 1

  3. Probabilistic Programming Bayesian Inference

  4. Sampling Idea: How to model probability distributions by programs def p l i n k o ( n ) : 1 i f ( n==0) : 2 r e t u r n 0 3 e l s e : 4 i f coin () : 5 r e t u r n p l i n k o (n − 1)+1 6 e l s e : 7 r e t u r n p l i n k o (n − 1) − 1 8 By Matemateca (IME USP) 2

  5. Sampling Idea: How to model probability distributions by programs sample(plinko(4)) > 2 def p l i n k o ( n ) : 1 i f ( n==0) : 2 r e t u r n 0 3 e l s e : 4 i f coin () : 5 r e t u r n p l i n k o (n − 1)+1 6 e l s e : 7 r e t u r n p l i n k o (n − 1) − 1 8 2

  6. Sampling Idea: How to model probability distributions by programs sample(plinko(4)) > 2 nSample(plinko(4), 1000) def p l i n k o ( n ) : 1 plot(gaussian(0,1)) i f ( n==0) : 2 r e t u r n 0 3 e l s e : 4 i f coin () : 5 r e t u r n p l i n k o (n − 1)+1 6 e l s e : 7 r e t u r n p l i n k o (n − 1) − 1 8 2

  7. What is Bayesian Inference Gender Bias (Laplace): Paris, from 1745 to 1770 f 0 = 241 945 females out of B 0 = 493 472 births (49%). 3

  8. What is Bayesian Inference Gender Bias (Laplace): Paris, from 1745 to 1770 f 0 = 241 945 females out of B 0 = 493 472 births (49%). What is the probability to be born female ? • female births are independent and follow the same law with bias θ • the probability to get f females out of B births is � B � θ f (1 − θ ) B − f P ( f | θ, B ) = f Novelty: the bias θ to be born female follows a probabilistic distribution. 3

  9. What is Bayesian Inference Gender Bias (Laplace): Paris, from 1745 to 1770 f 0 = 241 945 females out of B 0 = 493 472 births (49%). What is the probability to be born female ? • female births are independent and follow the same law with bias θ • the probability to get f females out of B births is � B � θ f (1 − θ ) B − f P ( f | θ, B ) = f Novelty: the bias θ to be born female follows a probabilistic distribution. Inference paradigm: what is the law of θ conditioned on f and B ? • Sample θ from a postulated distribution π (prior) • Simulate data f from the outcome θ (likelihood) • Infer the distribution of θ (posterior) by Bayes Law P ( f | θ, B ) π ( θ ) P ( θ | f , B ) = θ P ( f | θ, B ) π ( θ ) = α · P ( f | θ, B ) π ( θ ) � 3

  10. Conditioning and inference # model 1 def f B i r t h ( theta , B) : 2 i f (B == 0) : 3 r e t u r n 0 4 e l s e : 5 f = f l i p ( theta ) 6 r e t u r n f + f B i r t h ( theta , B − 1) 7 8 # parameter ( p r i o r ) 9 theta = uniform (0 ,1) 10 11 # data 1747 − 1783 12 f0 = 241 945 13 B0 = 493 472 14 15 # i n f e r e n c e ( p o s t e r i o r ) 16 i n f e r ( f B i r t h , theta , f0 , B0) 17 Idea: adjust theta distribution by comparison to data by simulation. 4

  11. Inference by rejection sampling # p r i o r : Unit − > S 1 def g u e s s e r () : 2 sample ( uniform (0 ,1) ) 3 4 # p r e d i c a t e : i n t x i n t − > (S − > Boolean ) 5 def checker ( f0 , B0) : 6 lambda theta : gBirth ( theta , B0) == f0 7 8 # i n f e r : ( Unit − > S) − > (S − > Boolean ) − > S 9 def r e j e c t i o n ( guesser , checker ( f0 , B0) ) : 10 theta = g u e s s e r () 11 i f checker ( f0 , B0) ( theta ) : 12 r e t u r n theta 13 e l s e : 14 r e j e c t i o n ( guesser , checker ( f0 , B0) ) 15 Problem: inefficient, hence other approximated methods 5

  12. Inference by Metropolis-Hasting Infer θ by Bayes Law : P ( θ | f , B ) = α · P ( f | θ, B ) π ( θ ) # p r o p o r t i o n : S x S − > f l o a t 1 def p r o p o r t i o n ( x , y ) : 2 r e t u r n P( f | x , B0) / P( f | y , B0) 3 4 # Metropolis − Hasting : i n t ∗ i n t ∗ i n t − > S 5 def m e t r o p o l i s (n , f0 , B0) : 6 i f ( n=0) : 7 r e t u r n f0 /B0 8 e l s e : 9 x = m e t r o p o l i s (n − 1, f0 , B0) 10 y = g a u s s i a n ( x , 1) 11 z = b e r n o u i l l i ( p r o p o r t i o n ( x , y ) ) 12 i f ( z == 0) : 13 r e t u r n x 14 e l s e : 15 r e t u r n y 16 6

  13. Probabilistic Programming Semantics

  14. Problems in semantics • Prove formally the correspondence between algorithms, implementations and mathematics. • Prove that two programs have equivalent behavior Operational Semantics describes how probabilistic programs compute. Denotational Semantics describes what probabilistic programs compute 7

  15. Problems in semantics • Prove formally the correspondence between algorithms, implementations and mathematics. • Prove that two programs have equivalent behavior Operational Semantics describes how probabilistic programs compute. Proba ( M , N ) is the probability p that M reduces to N in one step, p M − → N defined by induction on the structure of M : 1 1 / 1 / 2 2 • ( λ x . M ) N → M [ N / x ] − • coin − → 0 • coin − → 1 . . . Denotational Semantics describes what probabilistic programs compute 7

  16. Problems in semantics • Prove formally the correspondence between algorithms, implementations and mathematics. • Prove that two programs have equivalent behavior Operational Semantics describes how probabilistic programs compute. Proba ( M , N ) is the probability p that M reduces to N in one step, p M − → N defined by induction on the structure of M : 1 1 / 1 / 2 2 • ( λ x . M ) N → M [ N / x ] − • coin − → 0 • coin − → 1 . . . Denotational Semantics describes what probabilistic programs compute � M � is a probabilistic distribution, if M is a closed ground type program. • If M has type nat , then � M � a discrete distribution over integers • If M has type real , then � M � a continuous distribution over reals 7

  17. Operational Semantics on an example (Borgström-Dal Lago-Gordon-Szymczak ICFP’16) 1 def addCoins ( ) : • ( λ x . M ) N − → M [ N / x ] a = coin 1 / 2 b = coin • coin − → 0 c = coin 1 / 2 return ( a + b + c ) • coin − → 1 . . . a = coin a = 0 a = 0 a = 0 b = coin 1 / b = coin 1 / b = 1 1 / b = 1 1 2 2 2 addCoins ( ) → − − → − → − → c = coin c = coin c = coin c = 1 ( a + b + c ) ( a + b + c ) ( a + b + c ) ( a + b + c ) b = 1 c = 1 1 1 1 1 − → − → − → − → c = 1 ( 0 + 1 + 1 ) 2 ( 0 + 1 + c ) ( 0 + b + c ) 8

  18. Operational Semantics on an example (Borgström-Dal Lago-Gordon-Szymczak ICFP’16) 1 def addCoins ( ) : • ( λ x . M ) N → M [ N / x ] − a = coin 1 / 2 b = coin • coin − → 0 c = coin 1 / 2 return ( a + b + c ) • coin − → 1 . . . a = coin a = 0 b = coin 1 / b = coin 1 / 1 / 1 1 ∗ 2 2 2 addCoins ( ) → − − − → − − → − − → − − → 2 c = coin c = coin a =0 c =1 b =1 ( a + b + c ) ( a + b + c ) a =1 b =1 c =0 1 / 8 ∗ 1 / 8 ∗ addCoins() 2 a =1 b =0 c =1 ∗ 1 / 8 a =0 b =1 c =1 8

  19. Operational Semantics on an example (Borgström-Dal Lago-Gordon-Szymczak ICFP’16) 1 def addCoins ( ) : • ( λ x . M ) N − → M [ N / x ] a = coin 1 / 2 b = coin • coin − → 0 c = coin 1 / 2 return ( a + b + c ) • coin − → 1 . . . a = coin a = 0 b = coin 1 / b = coin 1 / 1 / 1 1 ∗ 2 2 2 addCoins ( ) − → − − → − − → − − → − − → 2 c = coin c = coin a =0 c =1 b =1 ( a + b + c ) ( a + b + c ) a =1 b =1 c =0 1 / 8 ∗ Proba ∞ ( addCoins() , 2) = 3 1 / 8 ∗ addCoins() 2 8 a =1 b =0 c =1 ∗ 1 / 8 a =0 b =1 c =1 8

  20. Operational Semantics Proba ∞ ( M , N ) is the proba. that M reduces to N in any number of steps Behavioral equivalence: ∀ C [ ] , Proba ∞ ( C [ M 1 ] , 0) = Proba ∞ ( C [ M 2 ] , 0) M 1 ≃ M 2 iff def addCoins1 () : def addCoins2 () : 1 1 a = coin b = coin 2 2 b = coin a = coin 3 3 c = coin c = coin 4 4 r e t u r n ( a + b + c ) r e t u r n ( a + b + c ) 5 5 def i n f e r 1 ( f0 , B0) : 1 r e j e c t i o n ( guesser , checker ( f0 , B0) ) : 2 3 def i n f e r 2 ( f0 , B0) : 4 m e t r o p o l i s ( f0 , B0 , 1000) 5 9

Recommend


More recommend