Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 22, 2011 Today: Readings: Recommended: • Clustering • Jordan “Graphical Models” • Mixture model clustering • Muphy “Intro to Graphical • Learning Bayes Net Models” structure • Chow-Liu for trees Bayes Network Definition A Bayes network represents the joint probability distribution over a collection of random variables A Bayes network is a directed acyclic graph and a set of CPD’s • Each node denotes a random variable • Edges denote dependencies • CPD for each node X i defines P(X i | Pa(X i )) • The joint distribution over all variables is defined as Pa(X) = immediate parents of X in the graph 1

Usupervised clustering Just extreme case for EM with zero labeled examples… Clustering • Given set of data points, group them • Unsupervised learning • Which patients are similar? (or which earthquakes, customers, faces, web pages, …) 2

Mixture Distributions Model joint as mixture of multiple distributions. Use discrete-valued random variable Z to indicate which distribution is being use for each random draw So Mixture of Gaussians : • Assume each data point X=<X1, … Xn> is generated by one of several Gaussians, as follows: 1. randomly choose Gaussian i, according to P(Z=i) 2. randomly generate a data point <x1,x2 .. xn> according to N(µ i , Σ i ) EM for Mixture of Gaussian Clustering Let’s simplify to make this easier: 1. assume X=<X 1 ... X n >, and the X i are conditionally independent given Z . 2. assume only 2 clusters (values of Z), and Z 3. Assume σ known, π 1 … π K, µ 1i … µ Ki unknown Observed: X=<X 1 ... X n > Unobserved: Z X 1 X 2 X 3 X 4 3

Z EM Given observed variables X, unobserved Z Define where X 1 X 2 X 3 X 4 Iterate until convergence: • E Step: Calculate P(Z(n)|X(n), θ ) for each example X(n). Use this to construct • M Step: Replace current θ by Z EM – E Step Calculate P(Z(n)|X(n), θ ) for each observed example X(n) X(n)=<x 1 (n), x 2 (n), … x T (n)>. X 1 X 2 X 3 X 4 4

EM – M Step Z First consider update for π π ’ has no influence X 1 X 2 X 3 X 4 EM – M Step Z Now consider update for µ ji µ ji ’ has no influence X 1 X 2 X 3 X 4 … … … Compare above to MLE if Z were observable: 5

Z EM – putting it together Given observed variables X, unobserved Z Define where X 1 X 2 X 3 X 4 Iterate until convergence: • E Step: For each observed example X(n), calculate P(Z(n)|X(n), θ ) • M Step: Update Mixture of Gaussians applet Go to: http://www.socr.ucla.edu/htmls/SOCR_Charts.html then go to Go to “Line Charts”  SOCR EM Mixture Chart • try it with 2 Gaussian mixture components (“kernels”) • try it with 4 6

What you should know about EM • For learning from partly unobserved data • MLEst of θ = • EM estimate: θ = Where X is observed part of data, Z is unobserved • EM for training Bayes networks • Can also develop MAP version of EM • Can also derive your own EM algorithm for your own problem – write out expression for – E step: for each training example X k , calculate P(Z k | X k , θ ) – M step: chose new θ to maximize Learning Bayes Net Structure 7

How can we learn Bayes Net graph structure? In general case, open problem • can require lots of data (else high risk of overfitting) • can use Bayesian methods to constrain search One key result: • Chow-Liu algorithm: finds “best” tree-structured network • What’s best? – suppose P( X ) is true distribution, T( X ) is our tree-structured network, where X = <X 1 , … X n > – Chou-Liu minimizes Kullback-Leibler divergence: 8

Chow-Liu Algorithm Key result: To minimize KL(P || T), it suffices to find the tree network T that maximizes the sum of mutual informations over its edges Mutual information for an edge between variable A and B: This works because for tree networks with nodes Chow-Liu Algorithm 1. for each pair of vars A,B, use data to estimate P(A,B), P(A), P(B) 2. for each pair of vars A,B calculate mutual information 3. calculate the maximum spanning tree over the set of variables, using edge weights I(A,B) (given N vars, this costs only O(N 2 ) time) 4. add arrows to edges to form a directed-acyclic graph 5. learn the CPD’s for this graph 9

1/ 1/ 1/ 1/ 1/ 1/ 1/ 1/ 1/ 1/ 1/ [courtesy A. Singh, C. Guestrin] Bayes Nets – What You Should Know • Representation – Bayes nets represent joint distribution as a DAG + Conditional Distributions – D-separation lets us decode conditional independence assumptions • Inference – NP-hard in general – For some graphs, closed form inference is feasible – Approximate methods too, e.g., Monte Carlo methods, … • Learning – Easy for known graph, fully observed data (MLE’s, MAP est.) – EM for partly observed data – Learning graph structure: Chow-Liu for tree-structured networks 10

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 22, 2011 Today: Readings: Recommended: Clustering Jordan Graphical Models Mixture model clustering Muphy Intro

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Tropical cycles and Chow polytopes Alex Fink Department of Mathematics University of California,

Stanley @ 70 June 23, 2014 4 n Theorem 1 . There are binary sequences of length 4 n + 1 2

Summary statistics DATA MAN IP ULATION W ITH PAN DAS Maggie Matsui Content Developer at

The Chow form of a reciprocal linear space Cynthia Vinzant North Carolina State University joint

Geometric control and applications Ludovic Rifford Universit e Nice Sophia Antipolis Fields

The Bayesian Chow-Liu Algorithm Joe Suzuki Osaka University September 19, 2012 Granada, Spain

Real Options Switching Strategies in Dynamic Transport Service Operations Qian-wen Guo a , Joseph

A nonsmooth Chow-Rashevskis theorem Ermal Feleqi University of Vlora, Albania Optimization,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us