Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: - PowerPoint PPT Presentation

Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: infer from data, given G Θ Learning Parameters and Parents P(W|Pa) P(~W|Pa) θ 1 =? 1 − θ 1 ~L,~R Structure θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R θ 4 =? 1 − θ 4 L,R Machine Learning 10-701 2. Structure Learning: inferring G and from data Θ Anna Goldenberg Parents P(W|Pa) P(~W|Pa) ? ? θ 1 =? 1 − θ 1 ~L,~R ? θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R θ 4 =? 1 − θ 4 L,R ? ? ? Parents P(W|Pa) P(~W|Pa) θ 1 =? 1 − θ 1 ~L,~R θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R Parameter Learning Parameter Estimation Outline θ 4 =? 1 − θ 4 L,R � Frequentist Parameter Estimation � G is a given DAG over N variables � MLE � Goal: Estimate from iid data , D = ( x 1 , . . . , x M ) θ where M is the number of records � example of estimation with discrete data x m = { x m 1 , . . . , x m N } � Each record � MAP � estimate for discrete data � Complete Observability (no missing values) � Bayesian Parameter Estimation � How it’s different from Frequentist

Maximum Likelihood Estimator Example: MLE for one variable � � � Likelihood (for iid data): p ( x m i | x m � Variable X ~ Multinomial with K values (K-sided die) p ( D | θ ) = π i , θ ) m i � Observe M rolls: 1, 4, K, 2, ... � Log likelihood � � log p ( x m i | x m l ( θ ; D ) = log p ( D | θ ) = π i , θ ) � model , (2) � p ( X = k ) = θ k θ k = 1 m i k ˆ θ ML = arg max l ( θ ; D ) � MLE N k log ( θ k ) (1) I ( x m = k ) θ k = I ( x m = k ) log( θ k ) = � � � � � l ( θ ; D ) = log θ m k m k � advantages: has nice statistical properties Maximizing (1) subject to constraint (2): � disadvantages: can overfit θ k,ML = N k ˆ the fraction of times k occurs M Discrete Bayes Nets Continuous Variables Example: Gaussian Variables <=> One variable: X ∼ N ( µ, σ ) ML estimates: � m x m µ ML = ˆ M � Assume each CPD is represented as a table � m ( x m − ˆ µ ML ) 2 ˆ σ 2 ML = M � Loglikelihood: Similarly for several Continuous Variables Another option to estimate parameters: X i ∼ f ( Pa i , θ ) � Parameter Estimator:

Maximum A Posteriori estimate Example: MAP for Multinomial (MAP) θ N ijk � � � MLE is obtained by maximizing loglikelihood Multinomial likelihood: P ( D | θ ) = ijk m ijk � sensitive to small sample sizes ijk θ ( α ijk − 1) � ijk Dirichlet Prior: P ( θ | α ) = Z ( α ) � MAP comes from maximizing posterior θ N ijk + α ijk − 1 � Posterior: P ( θ | D, α ) ∝ p ( θ | D ) ∼ p ( D | θ ) p ( θ ) = likelihood × prior ijk ijk N ijk + α ijk ˆ � prior acts as a smoothing factor θ MAP MAP = ijk � j � ( N ij � k + α ij � k ) can be thought of as virtual pseudo counts α Bayesian vs Frequentist � Frequentist: � are unknown constants θ Questions on � MLE is a very common frequentist estimator Parameter Learning? � Bayesian θ � unknown are random variables � estimates differ based on a prior

What if G is not given? Structural Learning � When? � Constraint Based � Scientific discovery (protein networks, data mining) � Test independencies � Add edges according to the tests � Need a good model for compression, prediction... � Search and Score j_miller m_moore j_kolojejchick u_saranli � Define a selection criterion that measures goodness of a model m_derthick j_kozar � Search in the space of all models (or orders) j_harrison r_munos m_riedmiller � Mix models (recent) j_boyan m_meila a_steinfeld j_schneider b_anderson � Test for almost all independencies t_kanade k_deng a_moore l_kramer � Search and score according to possible l_baird a_ankolekar m_nechyba v_cicirello j_kubica Constraint Based Learning Constraint Based Learning � Define Conditional Independence Test Ind(X i ;X j |S) � Cons: ( O x i ,x j | s − E x i ,x j | s ) 2 χ 2 : � e.g. , � � Independence tests are less reliable on small samples E x i ,x j | s x i ,x j � One incorrect independence test might propagate far (not robust to G 2 , conditional entropy, etc. noise) � if Ind(X i ;X j |S)<p, then independence � Pros: � Choose p with care! � More global decisions => doesn’t get stuck in local minima as much � Construct model consistent with the set of independencies � Works well on sparse nets (small markov blankets, sufficient data)

Score Based Search Maximum likelihood in Outline Information Theoretic terms � Select the highest scoring model! ˆ ˆ � � log P ( D | θ G , G ) = M I ( X i | Pa X i ) − M H ( X i ) � What should the score be? � The entropy does not depend on the current model i i � Specialized structures (trees, TANs) � Thus, it’s enough to maximize mutual information! � Selection operators - how to navigate the space of models? � General case: � Same as constraint search! Theorem: maximizing Bayesian Score for d � 2 � Special case (trees): (not a tree) is NP-hard (Chickering, 2002) � have to consider only all pairs (tree => only one parent): O(N 2 ) Chow Liu tree algorithm Tree Augmented Naive Bayes � Compute empirical distribution: TAN (Friedman et al, 1997) is an extension of Chow Liu C C Naive Bayes TAN � Mutual Information: X1 X2 X3 X M X1 X2 X3 X M � Set as weight per edge between X i and X j TAN: � Find Optimal tree BN by getting the maximum spanning tree for direction: pick a random node as root ˆ ˆ Score(TAN): � � I ( X i , C ) + I ( X j , { Pa X j , C } ) direct in BFS order i j

MI Problem Penalized Likelihood Score � BIC (Bayesian Information Criterion) � Doesn’t penalize complexity: I(A,B) � I(A,{B,C}) θ ML ) − d logP ( D ) ∼ logP ( D | ˆ 2 log ( N ) , where d is the number of free parameters � Adding a parent always increases the score! � AIC (Akaike Information Criterion) � Model will overfit, since the completely connected logP ( D ) ∼ logP ( D | ˆ θ ML ) − d graph would be favored � BIC penalizes complexity more than AIC Minimum Description Length What should the score be? � Consistent : for all G’ I-equivalent to the true G and � Total number of bits needed to describe data is -log 2 P(x) all G* not equivalent to G Score(G)=Score(G’) and Score(G*)<Score(G’) � Instead - send the model and then residuals: � Decomposable : can be locally computed (for efficiency) � Score ( G ; D ) = FamScore ( X i | Pa X i ; D ) -L(D,H) = - logP(H) - log P(D|H) = -log P(H|D) + const i � The best is the one with the shortest message! Example: BIC and AIC are consistent and decomposable

Bayesian Scoring Bayesian Scoring Parameter Prior Parameter Prior � Parameter Prior - important for small datasets! � Bayes Dirichlet equivalent scoring (BDe) : α X i | P a Xi = MP � ( X i , Pa ( X i )) � Is consistent (and decomposable) � Dirichlet Parameters ( from a few slides before ) Theorem: If P(G) assigns the same prior to I-equivalent � For each possible family define a prior distribution structures and Parameter prior is Dirichlet then Bayesian score satisfies score equivalence, if and only if prior is of � Can encode it as a Bayes Net BDe form! � (Usually Independent - product of marginals) � Bayesian Scoring Structure search algorithms Structure Prior � Structure Prior - should satisfy prior modularity � Order in known � Order is unknown � Parameter Modularity: if X has the same set of parents in two different structures, then parameters should be � Search in the space of orderings the same. � Search in the space of DAGs � Search in the space of equivalence classes � Typically set to uniform. 1 � Can be a function of prior counts: α + 1

Order is unknown Order is known Search space of orderings � Select an order according to some heuristic � Suppose the total ordering is � Then for each node X i can find an optimal set of parents in � Use K2 to learn a BN corresponding to the ordering and score it � Choice of parents for X j doesn’t depend on previous X i � Maybe do multiple restarts � Need to search among all choices (where d is the maximum number of parents) for the highest local score � Most recent research: Tessier and Koller (2005) � Greedy search with known order, aka K2 algorithm is Order is unknown Exploiting Decomposable Score Search space of DAGs � Typical search operators � If the operator for edge (X,Y) is valid, then we need only to look at the families of X and Y � Add an edge � Remove an edge � e.g. for addition operator o � Reverse an edge � At most O(n 2 ) steps to get from any graph to any graph � Moves are reversible � Simplest search is Greedy Hillclimbing � Move to proposed new graph if it satisfies constraints

Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: - PowerPoint PPT Presentation

Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: infer from data, given G Learning Parameters and Parents P(W|Pa) P(~W|Pa) 1 =? 1 1 ~L,~R Structure 2 =? 1 2 ~L,R 3 =? 1 3 L,~R 4

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Bayes Nets 10-701 recitation 04-02-2013 Bayes Nets Represent dependencies between variables

Outline Inference in Bayes Nets Variable Elimination Bayes Nets (cont) CS 486/686

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

1 Bayes Nets: Assumptions Independence in a BN Assumptions we are required to make to define

Learning in Graphical Models Problem Dimensions Model Bayes Nets Markov Nets

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Nave Bayes Pieter Abbeel

Large Sample Robustness Bayes Nets with Incomplete Information Jim Smith and Ali Daneshkhah

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

Tracking P4 Program Execution in the Data Plane SOSR 20 Suriya Kodeswaran Mina Arashloo,

Digital Logic Design: a rigorous approach c Chapter 4: Directed Graphs Guy Even Moti Medina

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois Fleuret

How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Pschel TA: Georg

Greedy algorithms Problem. Given a digraph G = ( V , E ) , edge lengths e 0 , source s

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Information-Theoretic Implications of Classical and Quantum Causal Structures Rafael Chaves QIP