COSC343: Artificial Intelligence Lecture 16: Introduction to probability theory Alistair Knott Dept. of Computer Science, University of Otago Alistair Knott (Otago) COSC343 Lecture 16 1 / 22
Probabilistic learning algorithms In the next two lectures, I’ll introduce probabilistic learning algorithms. These algorithms take a set of training data and learn a probabilistic model of the data. The model can be used to assess the probabilities of events—including events not seen in the training data. Alistair Knott (Otago) COSC343 Lecture 16 2 / 22
Probabilistic learning algorithms In the next two lectures, I’ll introduce probabilistic learning algorithms. These algorithms take a set of training data and learn a probabilistic model of the data. The model can be used to assess the probabilities of events—including events not seen in the training data. For instance: Training data: people with meningitis—what symptoms do they show? Model: takes symptoms, and estimates probability of meningitis. Alistair Knott (Otago) COSC343 Lecture 16 2 / 22
Defining a sample space A sample space is a model of ‘all possible ways the world can be’. Formally, it’s the space of all possible values of the inputs and outputs to the function f ( x 1 . . . x n ) . Each of these defines one dimension of the sample space. Each possible combination is called a sample point. Formally, a probability model assigns a probability to each sample point in a sample space. Each probability is between 0 and 1 inclusive. Probabilities for all points in the space sum to 1. Alistair Knott (Otago) COSC343 Lecture 16 3 / 22
A simple probability model Imagine we roll a single die. There’s just one variable in our sample space (call it Roll ), which has 6 possible values. Roll 1 2 3 4 5 6 p p p p p p We can estimate the probability at each point by generating a training set of die rolls and using relative frequencies of events in this set. count ( Roll = n ) p ( Roll = n ) = size ( training _ set ) Terminology: Note variables are capitalised! Alistair Knott (Otago) COSC343 Lecture 16 4 / 22
A two-dimensional probability model If we roll two dice many times, we can build a probability model looking something like this: Roll _1 1 2 3 4 5 6 1 1 1 1 1 1 Roll _2 1 36 36 36 36 36 36 1 1 1 1 1 1 2 36 36 36 36 36 36 1 1 1 1 1 1 3 36 36 36 36 36 36 1 1 1 1 1 1 4 36 36 36 36 36 36 1 1 1 1 1 1 5 36 36 36 36 36 36 1 1 1 1 1 1 6 36 36 36 36 36 36 Alistair Knott (Otago) COSC343 Lecture 16 5 / 22
Some terminology An event is any subset of points in a sample space. The probability of an event E is the sum of the probabilities of each sample point it contains. � p ( E ) = p ( ω ) { ω ∈ E } Alistair Knott (Otago) COSC343 Lecture 16 6 / 22
Events What’s P ( Roll _1 = 5 ) ? Roll _1 1 2 3 4 5 6 1 1 1 1 1 1 Roll _2 1 36 36 36 36 36 36 1 1 1 1 1 1 2 36 36 36 36 36 36 1 1 1 1 1 1 3 36 36 36 36 36 36 1 1 1 1 1 1 4 36 36 36 36 36 36 1 1 1 1 1 1 5 36 36 36 36 36 36 1 1 1 1 1 1 6 36 36 36 36 36 36 Alistair Knott (Otago) COSC343 Lecture 16 7 / 22
Events Events can also be partial descriptions of outcomes. What’s P ( Roll _1 ≥ 4 ) ? Roll _1 1 2 3 4 5 6 1 1 1 1 1 1 Roll _2 1 36 36 36 36 36 36 1 1 1 1 1 1 2 36 36 36 36 36 36 1 1 1 1 1 1 3 36 36 36 36 36 36 1 1 1 1 1 1 4 36 36 36 36 36 36 1 1 1 1 1 1 5 36 36 36 36 36 36 1 1 1 1 1 1 6 36 36 36 36 36 36 Alistair Knott (Otago) COSC343 Lecture 16 8 / 22
Continuous and discrete variables The sample spaces we’ve seen so far have been built from discrete random variables. But you can build probability models using continuous variables too. E.g. we can define a random variable Temperature , whose domain is the real numbers. Terminology: For Boolean variables, (e.g. Stiff _ neck ), lower-case is shorthand for ‘true’, and ‘ ¬ ’ means ‘not’: stiff _ neck ≡ Stiff _ neck = true ¬ stiff _ neck ≡ Stiff _ neck = false Alistair Knott (Otago) COSC343 Lecture 16 9 / 22
Probability distributions A probability model induces a probability distribution for each random variable. This distribution is a function, whose domain is all possible values for the random variable, and which returns a probability for each possible value. The area under the graph has to sum to 1. Terminology: *note capitalisation! p ( E ) is the probability of an event P ( V ) is a probability distribution for the variable V . 1 2 3 4 5 6 1/6 E.g. P ( Roll _1 ) = 1 1 1 1 1 1 6 6 6 6 6 6 1 2 3 4 5 6 Alistair Knott (Otago) COSC343 Lecture 16 10 / 22
Probability for continuous variables For continuous variables, distributions are continuous. Here’s a function which gives a uniform probability for values between 18 and 26: P ( X = x ) = U [ 18 , 26 ]( x ) 0.125 18 dx 26 Here P is a density; integrates to 1. So P ( X = 20 . 5 ) = 0 . 125 means dx → 0 P ( 20 . 5 ≤ X ≤ 20 . 5 + dx ) / dx = 0 . 125 lim Alistair Knott (Otago) COSC343 Lecture 16 11 / 22
Gaussian density A particularly useful probability function for continuous variables is the Gaussian function. 2 πσ e − ( x − µ ) 2 / 2 σ 2 1 P ( x ) = √ Lots of real-world variables have this distribution. Alistair Knott (Otago) COSC343 Lecture 16 12 / 22
A simple medical example Consider a medical scenario, with 3 Boolean variables: Cavity (does the patient have a cavity or not?) Toothache (does the patient have a toothache or not?) Catch (does the dentist’s probe catch on the patient’s tooth?) Here’s an example probability model: the joint probability distribution P ( Toothache , Catch , Cavity ) . (Note capital letters: we’re enumerating all possible values for each variable.) toothache toothache L catch catch catch catch L L cavity .108 .012 .072 .008 cavity .016 .064 .144 .576 L Alistair Knott (Otago) COSC343 Lecture 16 13 / 22
Inference from a joint distribution Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points. toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L E.g. how to calculate p ( toothache ) ? Alistair Knott (Otago) COSC343 Lecture 16 14 / 22
Inference from a joint distribution Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points. toothache toothache L catch catch catch catch L L cavity .108 .012 .072 .008 cavity .016 .064 .144 .576 L E.g. how to calculate p ( toothache ) ? p ( toothache ) = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 2 Alistair Knott (Otago) COSC343 Lecture 16 14 / 22
Inference from a joint distribution Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points. toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L E.g. how to calculate p ( cavity ∨ toothache ) ? Alistair Knott (Otago) COSC343 Lecture 16 14 / 22
Inference from a joint distribution Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points. toothache toothache L catch catch catch catch L L cavity .108 .012 .072 .008 .016 .064 .144 .576 cavity L E.g. how to calculate p ( cavity ∨ toothache ) ? p ( cavity ∨ toothache ) = 0 . 108 + 0 . 012 + 0 . 072 + 0 . 008 + 0 . 016 + 0 . 064 = 0 . 28 Alistair Knott (Otago) COSC343 Lecture 16 14 / 22
Set-theoretic relationships in probability Note that we can describe the probabilities of logically related events in set-theoretic terms. For instance: p ( a ∨ b ) = p ( a ) + p ( b ) − p ( a ∧ b ) A B A B > Alistair Knott (Otago) COSC343 Lecture 16 15 / 22
Prior probabilities and conditional probabilities Assume we have built a probability model from some training data, and we are now considering a test item. If we don’t know anything about this item, all we can compute is is prior probabilities: e.g. p ( toothache ) . But if we know some of the item’s properties, we can compute conditional probabilities based on these properties. Alistair Knott (Otago) COSC343 Lecture 16 16 / 22
Prior probabilities and conditional probabilities Assume we have built a probability model from some training data, and we are now considering a test item. If we don’t know anything about this item, all we can compute is is prior probabilities: e.g. p ( toothache ) . But if we know some of the item’s properties, we can compute conditional probabilities based on these properties. Terminology: p ( cavity | toothache ) probability of cavity given P has a toothache . P ( Cavity | Toothache ) conditional probability distribution (table of conditional probabilities for all combinations of Cavity and Toothache) Alistair Knott (Otago) COSC343 Lecture 16 16 / 22
Computing conditional probabilities Assume we begin with these prior probabilities. . . toothache toothache L catch catch catch catch L L cavity .108 .012 .072 .008 cavity .016 .064 .144 .576 L Alistair Knott (Otago) COSC343 Lecture 16 17 / 22
Recommend
More recommend