Probabilistic Programming for Bayesian Machine Learning Luke Ong 翁之昊 University of Oxford 1
What is Machine Learning? Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine learning, etc. How to make machines learn from data? • Computer Science : AI, computer vision, information retrieval • Statistics : learning theory, learning and inference from data • Cognitive Science / Psychology : perception, computational linguistics, mathematical psychology • Neuroscience : neural networks, neural information processing • Engineering : signal processing, adaptive and optimal control, information theory, robotics • Economics : decision theory, game theory, operations research 2
Truly useful real-world applications of ML / AI Autonomous veh. / robotics / drones Computer vision: facial recognition Financial prediction / automated trading Recommender systems Language / speech technologies Scientific modelling / data analysis 3
Intense global interest (and hype) in AI China’s “Next Generation AI Dev. Plan (2017)” 1. Join first echelon by 2020 (big data, swarm AI, theory) 2. Breakthroughs by 2025 (medicine, AI laws, security & control) 3. World-leading by 2030 with CNY 1 trillion ( USD 150 billion) ≈ domestic AI industry (social governance, defence, industry) * Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018 4
Intense global interest (and hype) in AI China’s “Next Generation AI Dev. Plan (2017)” 1. Join first echelon by 2020 (big data, swarm AI, theory) 2. Breakthroughs by 2025 (medicine, AI laws, security & control) 3. World-leading by 2030 with USD150 billion domestic AI industry (social governance, defence, industry) “Objective 1 (technologies & market applications) was already achieved in mid-2018 . China is: Much of the hype • #1 in AI funding*: globally 48% from China; 38% from US concerns Deep • #1 in total and highly cited AI papers worldwide Learning • #1 in AI patents Tsinghua U. Report, 2018 “A ‘ Sputnik moment ’ was felt by the West.” Stuart Russell 2018 * Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018 5
6
How to situate Deep Learning in ML? Discriminative ML Directly learn to predict : given training data (input-output pairs), learn a parametrised (non-linear) function from inputs to outputs. f θ • Training uses data to estimate optimal value of parameter . θ * • Prediction : given unseen input , return output x ′ � y ′ � := f θ * ( x ′ � ) Examples : Neural nets, support vector machines, decision trees ensembles (e.g. random forests). is typically θ uninterpretable Generative (probabilistic) ML Build a probabilistic model to explain observed data by generating them, i.e., simulator . The model defines joint probability of inputs (latent p ( X , Y ) X variables, and parameters) and outputs (data). Y 7
Deep Learning Limitations 1. Very data hungry 2. Compute-intensive to train and deploy; finicky to optimise 3. Easily fooled by adversarial examples. 4. Poor in giving uncertainty estimates, leading to over- confidence, so unsuitable for safety-critical systems 5. Hard to use prior knowledge & symbolic representation 6. Uninterpretable black-boxes: parameters have no real-world meanings ConvNet figure from Clarifai Technology 8
Deep learning ad infinitum ? Give up probability, logic, symbolic representation? “Deep learning will plateau out: many things are needed to make further progress, such as reasoning , and programmable models .” “Many more applications are completely out of reach for current deep learning techniques — even given vast amounts of human-annotated data. … The main directions in which I see promise are models closer to general-purpose programs .” Francois Chollet (deep learning expert, Keras inventor) Pace Bayesian deep learning / uncertainty in deep learning. Neal, R. M.:Bayesian learning for neural networks (Vol. 118). Springer, 1994. Gal, Yarin: Uncertainty in Deep Learning, Univ. of Cambridge PhD Thesis 2017 Gal & Ghahramani: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016 9
Thomas Bayes (1701-1761) In contrast to Deep Learning… Probabilistic Machine Learning Given a system with some data: 1. Build a model capable of generating data observable from the system. 2. Use probability to express belief / uncertainty (including noise) about the model. 3. Apply Bayes' Rule ( = Bayesian Inversion ) to learn from data: a. infer unknown quantities b. predict c. explore and adapt models 10
Thomas Bayes (1701-1761) Two axioms “ from which everything follows”* Sum Rule P ( x ) = ∑ y P ( x , y ) Product Rule P ( x , y ) = P ( x ) P ( y ∣ x ) P ( θ ∣ ) = P ( ∣ θ ) P ( θ ) Bayes’ Rule: P ( ) Given observed data = { d 1 , ⋯ , d N } data (observed) parameter (latent) P ( θ ∣ ) ∝ P ( ∣ θ ) × P ( θ ) Posterior ∝ Likelihood × Prior • Likelihood function: , not a probability (w.r.t. ) P ( ∣ θ ) θ • Model evidence: P ( ) = ∫ P ( ∣ θ ) p ( θ ) d θ , normalising constant - computational challenge in ML • Significance of Bayes’ Rule: it prescribes how our prior belief about is changed after observing the data . θ * Cox 1946; Jayne 1996, van Horn 2003 11
What is Probabilistic Programming? Problem : Probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, requiring bespoke constructions. Probabilistic programming is a general-purpose means of 1. expressing probabilistic models as programs, & 2. automatically performing Bayesian inference. 12
What is Probabilistic Programming? Problem : probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, and (unnecessarily) bespoke constructions Probabilistic programming is a general-purpose means of 1. expressing probabilistic models as programs, & 2. automatically performing Bayesian inference. Separation of concerns. Probabilistic programming systems • enable data scientists / domain experts to focus on designing good models • leaving the development of e ffi cient inference engines to experts in Bayesian statistics, machine learning & prog. langs. Key advantage : democratise access to machine learning. * Wood, F .:Probabilistic Programming. NIPS 2015 tutorial. * Tenenbaum & Mansinghka: Engineering and Reverse-Engineering Intelligence Using Probabilistic Programs, Program Induction, and Deep Learning, NIPS17 tutorial 13
Bayesian / Probabilistic Pipeline Data Knowledge & Questions (Make Assumptions) (Discover Patterns) (Infer, Predict, Explore) (Make Assumptions) PRIOR PROBABILITY JOINT PROBABILITY POSTERIOR PRIOR PROBABILITY PROBABILITY The pipeline distinguishes roles of 1. knowledge and questions (domain experts) 2. making assumptions (data scientists & ML experts) 3. building models and computing inferences (data scientists & ML experts), and 4. implementing applications (ML users and practitioners) 14
Bayesian / Probabilistic Pipeline Loop Criticise Model Data Knowledge & Questions (Make Assumptions) (Discover Patterns) (Infer, Predict, Explore) (Make Assumptions) PRIOR PROBABILITY JOINT PROBABILITY POSTERIOR PRIOR PROBABILITY PROBABILITY Probabilistic programming provides the means to iterate the Bayesian pipeline —the posterior probability of th iterate becomes n the prior of the ( )th iterate. n + 1 Loop Robustness : Asymptotic consensus of Bayesian posterior inference. 15
̂ ̂ Asymptotic certainty of posterior inference Theorem (Bernstein-von Mises). Assume data set n (comprising data points) was generated from some true . n θ * Under some regularity conditions, provided p ( θ *) > 0 n →∞ p ( θ ∣ n ) = δ ( θ − θ *) lim In the unrealisable case, where data was generated from some which cannot be modelled by any , then the posterior p *( x ) θ will converge to n →∞ p ( θ ∣ n ) = δ ( θ − lim θ ) KL ( p *( x ) ∣ ∣ p ( x ∣ θ ) ) where minimises θ The posterior distribution for unknown quantities in any problem is e ff ectively asymptotically independent of the prior distribution as the data sample grows large. Doob, 1949; Freedman, 1963 16
Asymptotic consensus of posterior inference Theorem . Take two Bayesians with di ff erent priors, and p 1 ( θ ) , observing same data . Assume and have the p 2 ( θ ) p 1 p 2 same support. Then, as , the posteriors, and n → ∞ p 1 ( θ ∣ n ) , will converge in uniform distance between p 2 ( θ ∣ n ) distributions . | P 1 ( E ) − P 2 ( E ) | ρ ( P 1 , P 2 ) := sup E Tanner: Tools for Statistical Inference. Springer , 1996. (Ch. 2) B. J. K. Kleijn, A. W. Van der Vaart, et al. The Bernstein-von-Mises theorem under misspecification. Electronic Journal ofStatistics, 6:354–381, 2012. 17
Recommend
More recommend