Paradoxes of probabilistic programming and how to condition on - PowerPoint PPT Presentation

Paradoxes of probabilistic programming and how to condition on events of measure zero with infinitesimal probabilities (to appear at POPL’21) Jules Jacobs Radboud University Nijmegen julesjacobs@gmail.com November 23, 2020 1 / 25

Probabilistic programming ◮ Domain specific language for statistical and machine learning models ◮ Normal programming language extended with rand, observe, and run 2 / 25

Probabilistic programming Example: ◮ Men’s height is distributed according to Normal (1 . 8 , 0 . 5) meters ◮ Women’s height is distributed according to Normal (1 . 7 , 0 . 5) meters ◮ A scientist randomly samples a man and a woman and compares their height ◮ The scientist tells us that the heights are equal Question: What’s the expected value of the height in this situation? 3 / 25

Probabilistic programming Example: ◮ Men’s height is distributed according to Normal (1 . 8 , 0 . 5) meters ◮ Women’s height is distributed according to Normal (1 . 7 , 0 . 5) meters ◮ A scientist randomly samples a man and a woman and compares their height ◮ The scientist tells us that the heights are equal Question: What’s the expected value of the height in this situation? function meters (){ h = rand(Normal (1.7 , 0.5)) observe(Normal (1.8 , 0.5) , h) return h } samples = run(meters , 1000) estimate = average(samples) Answer: ≈ 1 . 75 3 / 25

Probabilistic programming Example: ◮ Men’s height is distributed according to Normal (1 . 8 , 0 . 5) meters ◮ Women’s height is distributed according to Normal (1 . 7 , 0 . 5) meters ◮ A scientist randomly samples a man and a woman and compares their height ◮ The scientist tells us that the heights are equal Question: What’s the expected value of the height in this situation? function meters (){ function centimeters (){ h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) return h return h } } samples = run(meters , 1000) samples = run(meters , 1000) estimate = average(samples) estimate = average(samples) Answer: ≈ 1 . 75 Answer: ≈ 175 3 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: h = rand(Normal (1.7 , 0.5)) if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) } return h Answer: ≈ 1 . 721 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: Centimeters: h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) } } return h return h Answer: ≈ 1 . 721 Answer: ≈ 170 . 2 ◮ The answer depends on whether the scientist uses meters or centimeters! 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: Centimeters: h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) } } return h return h Answer: ≈ 1 . 721 Answer: ≈ 170 . 2 ◮ The answer depends on whether the scientist uses meters or centimeters! ◮ Happens if we run this with importance sampling in Anglican 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: Centimeters: h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) } } return h return h Answer: ≈ 1 . 721 Answer: ≈ 170 . 2 ◮ The answer depends on whether the scientist uses meters or centimeters! ◮ Happens if we run this with importance sampling in Anglican ◮ The issue is fundamental and not limited to Anglican 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: Centimeters: h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) } } return h return h Answer: ≈ 1 . 721 Answer: ≈ 170 . 2 ◮ The answer depends on whether the scientist uses meters or centimeters! ◮ Happens if we run this with importance sampling in Anglican ◮ The issue is fundamental and not limited to Anglican ◮ Even happens in formal operational semantics (e.g. Commutative or Quasi-Borel) 4 / 25

Paradox 1 Suppose the scientist is lazy, and only does the measurement half of the time... Meters: Centimeters: h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) } } return h return h Answer: ≈ 1 . 721 Answer: ≈ 170 . 2 ◮ The answer depends on whether the scientist uses meters or centimeters! ◮ Happens if we run this with importance sampling in Anglican ◮ The issue is fundamental and not limited to Anglican ◮ Even happens in formal operational semantics (e.g. Commutative or Quasi-Borel) ◮ Unclear what the answer should be, or whether this program should be disallowed 4 / 25

Paradox 2 Objection: you shouldn’t do observe a variable number of times based on coin flip Suppose the scientist is drunk, and measures the weight half of the time... 5 / 25

Paradox 2 Objection: you shouldn’t do observe a variable number of times based on coin flip Suppose the scientist is drunk, and measures the weight half of the time... h = rand(Normal (1.7 , 0.5)) w = rand(Normal (60, 10)) if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) }else{ observe(Normal (70, 10), w) } return h Answer: ≈ 1 . 75 5 / 25

Paradox 2 Objection: you shouldn’t do observe a variable number of times based on coin flip Suppose the scientist is drunk, and measures the weight half of the time... h = rand(Normal (1.7 , 0.5)) h = rand(Normal (170 , 50)) w = rand(Normal (60, 10)) w = rand(Normal (60, 10)) if(flip (0.5)){ if(flip (0.5)){ observe(Normal (1.8 , 0.5) , h) observe(Normal (180 , 50), h) }else{ }else{ observe(Normal (70, 10), w) observe(Normal (70, 10), w) } } return h return h Answer: ≈ 1 . 75 Answer: ≈ 170 ◮ The same number of observes regardless of the outcome of the coin flip ◮ The output still depends on whether we use meters or centimeters 5 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... 6 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... Original program: h = rand(Normal (1.7 ,0.5)) observe(Normal (1.8 ,0.5) ,h) return h Answer: 1.75 6 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... Original program: Logarithmic ruler program: h = rand(Normal (1.7 ,0.5)) H = rand(LogNormal (1.7 ,0.5)) observe(Normal (1.8 ,0.5) ,h) observe(LogNormal (1.8 ,0.5) ,H) return h return log(H) Answer: 1.75 Answer: 1.62 ◮ Whether we use linear scale or log scale shouldn’t matter, just like meters or centimeters shouldn’t matter 6 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... Original program: Logarithmic ruler program: h = rand(Normal (1.7 ,0.5)) H = rand(LogNormal (1.7 ,0.5)) observe(Normal (1.8 ,0.5) ,h) observe(LogNormal (1.8 ,0.5) ,H) return h return log(H) Answer: 1.75 Answer: 1.62 ◮ Whether we use linear scale or log scale shouldn’t matter, just like meters or centimeters shouldn’t matter ◮ No conditionals at all, but the output still depends on the scale we use 6 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... Original program: Logarithmic ruler program: h = rand(Normal (1.7 ,0.5)) H = rand(LogNormal (1.7 ,0.5)) observe(Normal (1.8 ,0.5) ,h) observe(LogNormal (1.8 ,0.5) ,H) return h return log(H) Answer: 1.75 Answer: 1.62 ◮ Whether we use linear scale or log scale shouldn’t matter, just like meters or centimeters shouldn’t matter ◮ No conditionals at all, but the output still depends on the scale we use ◮ What do probabilistic programs really mean? 6 / 25

Paradox 3 Objection: you shouldn’t do observe inside a conditional Suppose the scientist uses a ruler marked in log scale... Original program: Logarithmic ruler program: h = rand(Normal (1.7 ,0.5)) H = rand(LogNormal (1.7 ,0.5)) observe(Normal (1.8 ,0.5) ,h) observe(LogNormal (1.8 ,0.5) ,H) return h return log(H) Answer: 1.75 Answer: 1.62 ◮ Whether we use linear scale or log scale shouldn’t matter, just like meters or centimeters shouldn’t matter ◮ No conditionals at all, but the output still depends on the scale we use ◮ What do probabilistic programs really mean? ◮ What does probililistic conditioning really mean? 6 / 25

Paradoxes of probabilistic programming and how to condition on - PowerPoint PPT Presentation

Paradoxes of probabilistic programming and how to condition on events of measure zero with infinitesimal probabilities (to appear at POPL21) Jules Jacobs Radboud University Nijmegen julesjacobs@gmail.com November 23, 2020 1 / 25

Paradoxes and the structure of reasoning David Ripley University of Connecticut

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Paradoxes, or The Art of the Impossible Thomas Jech Praha, February 2016 Thomas Jech Paradoxes,

Uncut David Ripley University of Connecticut http://davewripley.rocks The questions Paradoxes

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Verbal Bracketing Paradoxes What heavy drinkers can tell us about movement Zo Belk

Principles of Probabilistic Programming Lectures at EWSCS 2020 Winter School Joost-Pieter Katoen

Reactive Probabilistic Programming Semantics with Mixed Nondeterministic/Probabilistic Automata

An MCMC library for probabilistic programming Rob Zinkov June 13th, 2014 Rob Zinkov An MCMC

A Brief Introduction to Probabilistic and Quantum Programming Part II Ugo Dal Lago Universidade

Introduction to Probabilistic and Quantum Programming Part II Ugo Dal Lago BISS 2014, Bertinoro

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Time Dilation The Postulates 1. Physics is the same in all inertial reference frames (hopefully).

Skolems Paradox Daniel Mourad Tim Mercure DRP Talks, May 2014 Daniel Mourad, Tim

THE 3 PARADOXES OF A FRUITFUL LIFE John 12: 20-26 noun 1. a statement or proposition that

Set Theory: doubtful: This statement is false. Russell Paradox is it true or false?

CS 251 Fall 2019 CS 251 Fall 2019 Parallelism and Concurrency in 251 Principles of

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel

Parallel DBMS Chapter 21, Part A Slides by Joe Hellerstein, UCB, with some material from Jim

Talk Overview Paraphrases Paraphrasing and Translation What theyre useful for How