Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006 Tutorial on Basic Probability Tutorial on Basic Probability Eric Xing Eric Xing f ( f f ( f ( x ( x x ) x ) ) ) Lecture 2, September 15, 2006 x x x x Reading: Chap. 1&2, CB & Chap 5,6, TM µ µ µ µ What is this? � Classical AI and ML research ignored this phenomena � The Problem (an example): you want to catch a flight at 10:00am from Pitt to SF, can I make it if I leave at � 7am and take a 28X at CMU? partial observability (road state, other drivers' plans, etc.) � noisy sensors (radio traffic reports) � uncertainty in action outcomes (flat tire, etc.) � immense complexity of modeling and predicting traffic � � Reasoning under uncertainty! 1
Basic Probability Concepts � A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. ( S can be finite or infinite.) E.g., S may be the set of all possible outcomes � { } S of a dice roll: ≡ 1,2,3,4,5, 6 E.g., S may be the set of all possible nucleotides � { } S ≡ of a DNA site: A, T, C, G E.g., S may be the set of all possible positions time-space positions � S 0 R 0 360 0 ≡ × × +∞ o of a aircraft on a radar screen: { , } { , } { , } max � An event A is the any subset S : Seeing "1" or "6" in a roll; observing a "G" at a site; UA007 in space-time interval X � � An event space E is the possible worlds the outcomes can happen All dice-rolls, reading a genome, monitoring the radar signal � Visualizing Probability Space � A probability space is a sample space of which, for every subset s ∈ S , there is an assignment P(s) ∈ S such that: 0 ≤ P ( s ) ≤ 1 � Σ s ∈ S P ( s )=1 � � P(s) is called the probability (or probability mass) of s Event space of all possible worlds. Worlds in which A Its area is 1 is true Worlds in which A is false P(a) is the area of the oval 2
Kolmogorov Axioms � All probabilities are between 0 and 1 0 ≤ P ( X ) ≤ 1 � � P ( true ) = 1 regardless of the event, my outcome is true � � P ( false )=0 no event makes my outcome true � � The probability of a disjunction is given by P ( A ∨ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) � ¬A ∧ ¬B B A ∨ B ? A ∧ B A Why use probability? There have been attempts to develop different methodologies for � uncertainty: Fuzzy logic � Qualitative reasoning (Qualitative physics) � … � “Probability theory is nothing but common sense reduced to � calculation” — Pierre Laplace, 1812. � In 1931, de Finetti proved that it is irrational to have beliefs that � violate these axioms, in the following sense: If you bet in accordance with your beliefs, but your beliefs violate the axioms, then you can be � guaranteed to lose money to an opponent whose beliefs more accurately reflect the true state of the world. (Here, “betting” and “money” are proxies for “decision making” and “utilities”.) What if you refuse to bet? This is like refusing to allow time to pass: � every action (including inaction) is a bet 3
Random Variable � A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) X ( ω ) S S Discrete r.v.: � ω ω The outcome of a dice-roll � The outcome of reading a nt at site i : X � i Binary event and indicator variable: � Seeing an "A" at a site ⇒ X =1, o/w X =0. � This describes the true or false outcome a random event. � Can we describe richer outcomes in the same way? (i.e., X =1, 2, 3, 4, for being A, C, G, � T) --- think about what would happen if we take expectation of X . Unit-Base Random vector � X i =[ X i A , X i T , X i G , X i C ] ' , X i =[0,0,1,0]' ⇒ seeing a "G" at site i Continuous r.v.: � The outcome of recording the true location of an aircraft: X � true X The outcome of observing the measured location of an aircraft � obs Discrete Prob. Distribution � (In the discrete case), a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P ( s ) to each s ∈ S (or each valid value of x ) such that Σ s ∈ S P ( s )=1. (0 ≤ P ( s ) ≤ 1) intuitively, P ( s ) corresponds to the frequency (or the likelihood) of getting s in the � experiments, if repeated many times call θ s = P ( s ) the parameters in a discrete probability distribution � � A probability distribution on a sample space is sometimes called a probability model , in particular if several different distributions are under consideration write models as M 1 , M 2 , probabilities as P( X | M 1 ), P( X | M 2 ) � e.g., M 1 may be the appropriate prob. dist. if X is from "fair dice", M 2 is for the � "loaded dice". M is usually a two-tuple of {dist. family, dist. parameters} � 4
Discrete Distributions � Bernoulli distribution: Ber( p ) 1 p x 0 ⎧ − = for P x = ⇒ P x = p x 1 − p 1 − x ⎨ ( ) ( ) ( ) p x 1 = ⎩ for � Multinomial distribution: Mult(1, θ ) Multinomial (indicator) variable: � X ⎡ ⎤ 1 ⎢ ⎥ ∑ X X = 0 1 X = 1 [ , ], and ⎢ 2 ⎥ j j j ⎢ X ⎥ ∈ [1,...,6] 3 X = ⎢ ⎥ , where X ⎢ ⎥ 4 ∑ X = 1 θ θ = 1 w.p. , . ⎢ X ⎥ j j j 5 j ⎢ ⎥ ∈ [1,...,6] X ⎢ ⎥ ⎣ ⎦ 6 ( ) p x j P X 1 j = = ( ( )) { , where index the dice - face } j x x x x ∏ x x = θ = θ × θ × θ × θ = θ = θ A C G T k j A C G T k k Discrete Distributions � Multinomial distribution: Mult( n , θ ) Count variable: � x ⎡ ⎤ 1 ⎢ ⎥ ∑ X x n = = M , where ⎢ ⎥ j j ⎢ x ⎥ ⎣ ⎦ K n n ! ! p x x x x x = θ θ θ = θ ( ) 1 2 L K x x x 1 2 K x x x L L ! ! ! ! ! ! 1 2 K 1 2 K 5
Continuous Prob. Distribution � A continuous random variable X can assume any value in an interval on the real line or in a region in a high dimensional space X usually corresponds to a real-valued measurements of some property, e.g., � length, position, … It is not possible to talk about the probability of the random variable assuming a � particular value --- P ( x ) = 0 Instead, we talk about the probability of the random variable assuming a value within � a given interval, or half interval ( [ ] ) , P X x 1 x ∈ � , 2 ( ) ( [ ] ) P X x P X x < = ∈ − ∞ , Arbitrary Boolean combination of basic propositions � Continuous Prob. Distribution � The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2 . ( [ ] ) x P X x x ∫ 2 p x dx Probability mass: ∈ = � , ( ) , 1 2 x 1 +∞ ∫ p x dx = note that ( ) 1 . − ∞ Cumulative distribution function (CDF): � ) ∫ ∞ x ( P x P X x p x dx = < = ( ) ( ' ) ' − Probability density function (PDF): � d ( ) p x P x = ( ) dx Car flow on Liberty Bridge (cooked up!) +∞ ∫ p x dx = p x > 0 ∀ x ( ) 1 ; ( ) , − ∞ 6
What is the intuitive meaning of p(x) � If p(x 1 ) = a and p(x 2 ) = b, then when a value X is sampled from the distribution with density p(x), you are a/b times as likely to find that X is “very close to” x 1 than that X is “very close to” x 2 . � That is: x h + ∫ 1 p x dx ( ) P x h X x h p x 2 h − < < + × ( ) ( ) a x h − 1 1 1 = = = 1 lim b P x h X x h x h p x 2 h − < < + + × h → 0 ∫ p x dx ( ) 2 ( ) ( ) 2 2 2 x h − 2 Continuous Distributions � Uniform Probability Density Function p x 1 b a a x b = − ≤ ≤ ( ) /( ) for 0 = elsewhere � Normal (Gaussian) Probability Density Function ( x f ( f ( ( x x ) x ) ) ) f f 1 2 2 p x e x 2 = − − µ σ ( ) / ( ) 2 π σ The distribution is symmetric, and is often illustrated � x x x x as a bell-shaped curve. µ µ µ µ Two parameters, µ (mean) and σ (standard deviation), determine the location and shape of the distribution. � The highest point on the normal curve is at the mean, which is also the median and mode. � The mean can be any numerical value: negative, zero, or positive. � � Exponential Probability Distribution f ( x ) f ( x ) f ( x ) f ( x ) .4 .4 .4 .4 P ( x < 2) = area = .4866 P ( x < 2) = area = .4866 P ( x < 2) = area = .4866 P ( x < 2) = area = .4866 .3 .3 .3 .3 .2 .2 .2 .2 = 1 p x e x .1 .1 .1 .1 − / µ x P x ≤ x = 1 − e − µ density : ( ) , / CDF : ( ) o x x µ 0 x x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time Between Successive Arrivals (mins.) Time Between Successive Arrivals (mins.) 7
Recommend
More recommend