structure learning
play

Structure' Learning' Daphne Koller Why Structure Learning To - PowerPoint PPT Presentation

Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller Why Structure Learning To learn model for new queries, when domain expertise is not perfect For structure discovery, when inferring network


  1. Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller

  2. Why Structure Learning • To learn model for new queries, when domain expertise is not perfect • For structure discovery, when inferring network structure is goal in itself Daphne Koller

  3. Importance of Accurate Structure B C A D Missing an arc Adding an arc B C B C A A D D • Incorrect independencies • Spurious dependencies • Correct distribution P* • Can correctly learn P* cannot be learned • Increases # of parameters • But could generalize better • Worse generalization Daphne Koller

  4. Score-Based Learning Define scoring function that evaluates how well a structure matches the data A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . C A C B . A <0,1,0> C B A B Search for a structure that maximizes the score Daphne Koller

  5. Learning) Probabilis3c) Graphical) BN)Structurds) Models) Likelihood) Structure) Score) Daphne Koller

  6. Likelihood Score • Find (G, θ ) that maximize the likelihood Daphne Koller

  7. Example X Y X Y Daphne Koller

  8. General Decomposition • The Likelihood score decomposes as: Daphne Koller

  9. Limitations of Likelihood Score X Y X Y • Mutual information is always ≥ 0 • Equals 0 iff X, Y are independent – In empirical distribution • Adding edges can’t hurt, and almost always helps • Score maximized for fully connected network Daphne Koller

  10. Avoiding Overfitting • Restricting the hypothesis space – restrict # of parents or # of parameters • Scores that penalize complexity: – Explicitly – Bayesian score averages over all possible parameter values Daphne Koller

  11. Summary • Likelihood score computes log-likelihood of D relative to G, using MLE parameters – Parameters optimized for D • Nice information-theoretic interpretation in terms of (in)dependencies in G • Guaranteed to overfit the training data (if we don’t impose constraints) Daphne Koller

  12. Learning$ Probabilis3c$ Graphical$ BN$Structure$ Models$ BIC$Score$and$ Asympto3c$ Consistency$ Daphne Koller

  13. Penalizing Complexity • Tradeoff between fit to data and model complexity Daphne Koller

  14. Asymptotic Behavior • Mutual information grows linearly with M while complexity grows logarithmically with M – As M grows, more emphasis is given to fit to data Daphne Koller

  15. Consistency • As M  ∞ , the true structure G* (or any I- equivalent structure) maximizes the score – Asymptotically, spurious edges will not contribute to likelihood and will be penalized – Required edges will be added due to linear growth of likelihood term compared to logarithmic growth of model complexity Daphne Koller

  16. Summary • BIC score explicitly penalizes model complexity (# of independent parameters) – Its negation often called MDL • BIC is asymptotically consistent: – If data generated by G*, networks I-equivalent to G* will have highest score as M grows to ∞ Daphne Koller

  17. Learning( Probabilis0c( Graphical( BN(Structure( Models( Bayesian( Score( Daphne Koller

  18. Bayesian Score Marginal likelihood Prior over structures Marginal probability of Data Daphne Koller

  19. Marginal Likelihood of Data Given G Prior over parameters Likelihood Daphne Koller

  20. Marginal Likelihood Intuition Daphne Koller

  21. Marginal Likelihood: BayesNets ∞ x 1 t ( x ) t − e − dt ( x ) x ( x 1 ) Γ = ∫ Γ = ⋅ Γ − 0 Daphne Koller

  22. Marginal Likelihood Decomposition Daphne Koller

  23. Structure Priors • Structure prior P(G) – Uniform prior: P(G) ∝ constant – Prior penalizing # of edges: P(G) ∝ c |G| (0<c<1) – Prior penalizing # of parameters • Normalizing constant across networks is similar and can thus be ignored Daphne Koller

  24. Parameter Priors • Parameter prior P( θ |G) is usually the BDe prior – α : equivalent sample size – B 0 : network representing prior probability of events – Set α (x i ,pa i G ) = α P(x i ,pa i G | B 0 ) • Note: pa i G are not the same as parents of X i in B 0 • A single network provides priors for all candidate networks • Unique prior with the property that I-equivalent networks have the same Bayesian score Daphne Koller

  25. BDe and BIC • As M  ∞ , a network G with Dirichlet priors satisfies Daphne Koller

  26. Summary • Bayesian score averages over parameters to avoid overfitting • Most often instantiated as BDe – BDe requires assessing prior network – Can naturally incorporate prior knowledge – I-equivalent networks have same score • Bayesian score – Asymptotically equivalent to BIC – Asymptotically consistent – But for small M, BIC tends to underfit Daphne Koller

  27. Learning' Probabilis4c' Graphical' BN'Structure' Models' Structure' Learning'In' Trees' Daphne Koller

  28. Score-Based Learning Define scoring function that evaluates how well a structure matches the data A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . C A C B . A <0,1,0> C B A B Search for a structure that maximizes the score Daphne Koller

  29. Optimization Problem Input: – Training data – Scoring function (including priors, if needed) – Set of possible structures Output: A network that maximizes the score Key Property: Decomposability Daphne Koller

  30. Learning Trees/Forests • Forests – At most one parent per variable • Why trees? – Elegant math – Efficient optimization – Sparse parameterization Daphne Koller

  31. Learning Forests p(i) = parent of X i , or 0 if X i has no parent • Improvement over Score of “empty” “empty” network network • Score = sum of edge scores + constant Daphne Koller

  32. Learning Forests I • Set w(i → j) = Score(X j | X i ) - Score(X j ) • For likelihood score, w(i → j) = M I (X i ; X j ), ˆ P and all edge weights are nonnegative  Optimal structure is always a tree • For BIC or BDe, weights can be negative  Optimal structure might be a forest Daphne Koller

  33. Learning Forests II • A score satisfies score equivalence if I- equivalent structures have the same score – Such scores include likelihood, BIC, and BDe • For such a score, we can show w(i → j) = w(j → i), and use an undirected graph Daphne Koller

  34. Learning Forests III (for score-equivalent scores) • Define undirected graph with nodes {1,…,n} • Set w(i,j) = max[ Score(X j | X i ) - Score(X j ), 0] • Find forest with maximal weight – Standard algorithms for max-weight spanning trees (e.g., Prim’s or Kruskal’s) in O(n 2 ) time – Remove all edges of weight 0 to produce a forest Daphne Koller

  35. Learning Forests: Example MINVOLSET MINVOLSET PULMEMBOLUS INTUBATION KINKEDTUBE KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT VENTMACH DISCONNECT Tree learned from data of Alarm network PAP SHUNT VENTLUNG PAP SHUNT VENTLUNG VENITUBE VENITUBE PRESS PRESS MINOVL MINOVL VENTALV VENTALV FIO2 FIO2 ANAPHYLAXIS PVSAT PVSAT ANAPHYLAXIS ARTCO2 ARTCO2 EXPCO2 EXPCO2 SAO2 SAO2 TPR TPR INSUFFANESTH INSUFFANESTH Correct edges HYPOVOLEMIA HYPOVOLEMIA LVFAILURE LVFAILURE CATECHOL CATECHOL Spurious edges LVEDVOLUME STROEVOLUME ERRCAUTER LVEDVOLUME STROEVOLUME ERRBLOWOUTPUT HR HR ERRCAUTER HISTORY HISTORY ERRBLOWOUTPUT CVP PCWP CO CO HREKG CVP PCWP HREKG HRSAT HRSAT HRBP HRBP BP BP • Not every edge in tree is in the original network • Inferred edges are undirected – can’t determine direction Daphne Koller

  36. Summary • Structure learning is an optimization over the combinatorial space of graph structures • Decomposability  network score is a sum of terms for different families • Optimal tree-structured network can be found using standard MST algorithms • Computation takes quadratic time Daphne Koller

  37. Learning' Probabilis2c' Graphical' BN'Structure' Models' General' Graphs:'Search' Daphne Koller

  38. Optimization Problem Input: – Training data – Scoring function – Set of possible structures Output: A network that maximizes the score Daphne Koller

  39. Beyond Trees • Problem is not obvious for general networks – Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network • Theorem: – Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1 Daphne Koller

  40. Heuristic Search A B C A B D C D A B A B C C D D Daphne Koller

  41. Heuristic Search • Search operators: – local steps: edge addition, deletion, reversal – global steps • Search techniques: – Greedy hill-climbing – Best first search – Simulated Annealing – ... Daphne Koller

  42. Search: Greedy Hill Climbing • Start with a given network – empty network – best tree – a random network – prior knowledge • At each iteration – Consider score for all possible changes – Apply change that most improves the score • Stop when no modification improves score Daphne Koller

  43. Greedy Hill Climbing Pitfalls • Greedy hill-climbing can get stuck in: – Local maxima – Plateaux • Typically because equivalent networks are often neighbors in the search space Daphne Koller

  44. Why Edge Reversal A B A B C C Daphne Koller

Recommend


More recommend