Tutorial on Probabilistic Programming in Machine Learning Frank - PowerPoint PPT Presentation

Tutorial on Probabilistic Programming in Machine Learning Frank Wood

Play Along 1. Download and install Leiningen • http://leiningen.org/ 2. Fork and clone the Anglican Examples repository • git@bitbucket.org:fwood/anglican-examples.git • https://fwood@bitbucket.org/fwood/anglican-examples.git 3. Enter repository directory and type • “lein gorilla” 4. Open local URL in browser 5. _OR_ http://www.robots.ox.ac.uk/~fwood/anglican/examples/index.html

Motivation and Background AI

Unsupervised Automata Induction Probabilistic Deterministic Infinite Problem • ~4000 lines of Java code Automata (PDIA) • New student re-implementation • World states learned – 6-12 months • Per-state emissions learned • Per-state deterministic transition functions learned A Prior over PDFA with a bounded number of states • Infinite-state limit of model to right δ µ ∼ Dir( α 0 / | Q | ) σ 0 σ 1 σ 2 φ j ∼ Dir( α µ ) j = 0 , . . . , | Σ | − 1 q 1 q 0 q 3 q 12 δ ( q i , σ j ) = δ ij ∼ φ j i = 0 , . . . , | Q | − 1 • Unsupervised PDFA structure learning q 3 q 1 · · π q i ∼ Dir( β / | Σ | ) i = 0 , . . . , | Q | − 1 q 6 q 4 q 2 · · ξ 0 = q 0 , ξ t = δ ( ξ t − 1 , x t − 1 ) biases towards compact (few states) q 3 q 3 q 5 x t ∼ π ξ t world-models that are fast approximate q 5 q 4 q 3 q 42 Notation: predictors q 1 q 2 q 5 · Q - finite set of states Probabilistic Deterministic Finite Automata . . . . . . . S/0.6 Σ - finite alphabet . ε X/0.4 . . . . 2 4 0 1 S/0.5 T/0.5 π δ : Q × Σ → Q - transition function ε from 6 to 0 X/0.5 0 0 1 σ 0 σ 1 σ 2 1 6 P/0.5 π : Q × Σ → [0 , 1] E/1.0 - emission distribution P/0.5 0 B/1.0 1 1 0 q 0 0 0 1 (iid probability vector) V/0.5 3 5 1 0 00 01 10 11 q 0 ∈ Q V/0.3 - initial state T/0.7 1 (iid probability vector) q 1 1 0 x t ∈ Σ - data at time t Reber Grammar . . . . Trigram as DFA (without probability) . . . . ξ t ∈ Q B/0.8125 . . . . - state at time t B/0.5 A/0.5625 A/0.5625 A/0.1875 q 0 q 1 q 3 q 3 q 3 q 5 q 2 q 4 q 5 q 2 . . . α , α 0 , β - hyperparameters 1 2 0 A/0.1875 3 0 1 A/0.5 1 1 0 0 2 2 1 2 1 0 . . . B/0.4375 A/0.25 A/0.75 B/0.4375 B/1.0 5 B/0.25 Even Process 4 6 B/0.0625 A/0.9375 B/0.75 Foulkes Grammar B/0.8125 Pfau, ¡Bartlett, ¡and ¡W. ¡ Probabilistic ¡Deterministic ¡Infinite ¡Automata, ¡NIPS, ¡2011 ¡ Doshi-‑Velez, ¡Pfau, ¡W., ¡and ¡Roy ¡ Bayesian ¡Nonparametric ¡Methods ¡for ¡Partially-‑Observable ¡Reinforcement ¡Learning, ¡TPAMI, ¡2013 ¡

Unsupervised Tracking Dependent Dirichlet Process Problem • ~5000 lines of Matlab code Mixture of Objects • Implementation • World state ≈ dependent infinite mixture – ~ 1 year of motile objects that may appear, • Generative model disappear, and occlude ~1 page latex math – • Per-state, per-object emission model • Inference derivation learned ~3 pages latex math – • Per-state, per-object complex transition functions learned 50 100 45 90 40 80 35 70 30 60 25 50 20 40 15 30 10 20 10 5 0 4 4 4 − 4 2 2 2 − 2 0 0 0 − 2 0 − 4 − 2 − 2 2 − 6 − 4 4 − 4 − 8 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Neiswanger, ¡Wood, ¡and ¡Xing ¡ The ¡Dependent ¡Dirichlet ¡Process ¡Mixture ¡of ¡Objects ¡for ¡Detection-‑free ¡Tracking ¡and ¡Object ¡Modeling, ¡AISTATS, ¡2014 ¡ Caron, ¡F., ¡Neiswanger, ¡W., ¡Wood, ¡F., ¡Doucet, ¡A., ¡& ¡Davy, ¡M. ¡Generalized ¡Pólya ¡Urn ¡for ¡Time-‑Varying ¡Pitman-‑Yor ¡Processes. ¡JMLR. ¡To ¡appear ¡2015

Outline • Supervised modeling and maximum likelihood aren’t the whole ML story • Exact inference is great when possible but too restrictive to be the sole focus • Continuous variables are essential to machine learning • Unsupervised modeling is a growing and important challenge • We can write (and actually perform inference) in some pretty advanced models using existing probabilistic programming systems now • Inference is key, and general purpose inference for probabilistic programming is hard, but there is room for improvement and there is a big role for the PL community to play

Questions • Is probabilistic programming really just about tool and library building for ML applications? • Are gradients so important that we should just use HMC or stochastic gradients and neural nets for everything (re: neural Turing machine) and give up on anything resembling combinatorial search. As a corollary, for anything that does reduce to combinatorial search should we just compile to SAT solvers? • Is it more important for purposes of language aesthetic and programmer convenience to syntactically allow constraint/equality observations than to minimize the risk of programmers writing "impossible" inference problems by forcing them to specify a “noisy” observation likelihood? • Is it possible to develop program transformations that succeed and fail in sufficiently intuitive ways to give programmers the facility of equality constraints when appropriate while not allowing them to make measure zero observations of random variables whose support is “dangerously” complex (uncountable, high-dimensional, etc)? • How possible and important is it to identify program fragments in which exact inference can be performed? Is it possible that the most important probabilistic programming output will be static analysis tools that identify arbitrary programs that identify fragments of models in which exact inference can be performed and then also transform them into an amenable form for running exact inference? • Are first-class distributions a requirement? Should they be hidden from the user? • What is "query"? What should we name it? What should it mean? We've just called "query" "distribution" because rejection-query is, in Church, just an exact sampler from a conditional distribution, vs. mh-query which is different. Can't we unify them all and simply call them "query" if you don't have first class distributions or "distribution" if you do and distributions support a sample interface? • Should we expose the "internals" of the inference mechanism to the end user? Is the ability to draw a single unweighted approximate sample enough? Is a set of converging weighted samples enough? Do we need exact samples? Do we want an evidence approximation? Or converging expectation computations? Are compositions of the latter meaningful in any theoretically interpretable way? • What is the "deliverable" from inference? How do we amortize inference across queries/tasks? Is probabilistic program compilation Rao- Blackwellization or transfer learning or both? • Is probabilistic programming just an efficient mechanism or means for searching for models in which approximate inference is sufficiently

Outline • Supervised modeling and maximum likelihood aren’t the whole ML story • Exact inference is great when possible but too restrictive to be the sole focus • Continuous variables are essential to machine learning • Unsupervised modeling is a growing and important challenge • We can write (and actually perform inference) in some pretty advanced models using existing probabilistic programming systems now • Inference is key, and general purpose inference for probabilistic programming is hard, but there is room for improvement and there is a big role for the PL community to play

The Maximum Likelihood Principle Y L ( X, y, θ ) = p ( y n | x n , θ ) n X arg max L ( X, y, θ ) = arg max log p ( y n | x n , θ ) θ θ n = θ ∗ s.t. r θ X log p ( y n | x n , θ ∗ ) = 0 n ( defn sgd [X y f ' theta-init num-iters stepsize] ( loop [theta theta-init num-iters num-iters] ( if (= num-iters 0) theta ( let [n (rand-int (count y)) gradient ( f' (get X n) (get y n) theta) theta (map - theta (map # (* stepsize %) gradient))] ( recur w (dec num-iters))))))

Reasonable Analogy? Automatic Differentiation Probabilistic Programming Supervised Learning Unsupervised Learning

Example: Logistic Regression ≈ Shallow feed forward neural net 1 p ( y n = 1 | x n , w ) = 1 + exp { − w 0 − P d w d x nd } x y 5 2 3.5 1 Iris-versicolor 6 2.2 4 1 Iris-versicolor 6.2 2.2 4.5 1.5 Iris-versicolor 6 2.2 5 1.5 Iris-virginica 4.5 2.3 1.3 0.3 Iris-setosa 5 2.3 3.3 1 Iris-versicolor Logistic Regression Maximum Likelihood Code

Pros vs. Cons • Cons • Lack of convexity requires multiple restarts • Brittle • Lacking uncertainty quantification, point estimates aren’t naturally composable • Pros • Fast • Learned parameter value is “compiled deliverable”

Questions • Is probabilistic programming really just tool and library building for ML applications? • Are gradients so important that we should just use differentiable models with stochastic gradients for everything (re: neural Turing machine, OpenDR, Picture, Stan) and give up on anything resembling combinatorial search?

Tutorial on Probabilistic Programming in Machine Learning Frank - PowerPoint PPT Presentation

Tutorial on Probabilistic Programming in Machine Learning Frank Wood Play Along 1. Download and install Leiningen http://leiningen.org/ 2. Fork and clone the Anglican Examples repository git@bitbucket.org:fwood/anglican-examples.git

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Edward: Deep Probabilistic Programming Extended Seminar Systems and Machine Learning Steven

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Principles of Probabilistic Programming Lectures at EWSCS 2020 Winter School Joost-Pieter Katoen

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

(MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 UCSD Statistical

Likelihood-Based Statistical Decisions Marco Cattaneo Seminar for Statistics ETH Z urich,

10-701 Probability and MLE (brief) intro to probability Basic notations Random variable -

Statistical inference for incomplete Ins Couso ranking data: A comparison of two Mohsen Ahmadi

Introduction to (profiled) side-channel analysis Annelie Heuser In this talk back to

Statistical Tests Matthieu de Lapparent matthieu.delapparent@epfl.ch Transport and Mobility

Statistical Tests Amanda Stathopoulos amanda.stathopoulos@epfl.ch Transport and Mobility

RooStats Lecture and Tutorials Lorenzo Moneta (CERN) Terascale Alliance School and Workshop,