CSE574 - Administriva No class on Fri 01/25 (Ski Day) Last - PowerPoint PPT Presentation

CSE574 - Administriva • No class on Fri 01/25 (Ski Day)

Last Wednesday • HMMs – Most likely individual state at time t: (forward) – Most likely sequence of states (Viterbi) – Learning using EM • Generative vs. Discriminative Learning – Model p(y,x) vs. p(y|x) – p(y|x) : don’t bother about p(x) if we only want to do classification

Today • Markov Networks – Most likely individual state at time t: (forward) – Most likely sequence of states (Viterbi) – Learning using EM • CRFs – Model p(y,x) vs. p(y|x) – p(y|x) : don’t bother about p(x) if we only want to do classification

Finite State Models Generative HMMs Naïve Bayes directed models Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Sequence Graphs Figure by Sutton & McCallum

Graphical Models Node is independent of its non- • Family of probability distributions that factorize in a descendants given its parents certain way • Directed (Bayes Nets) Node is independent all other x = x 1 x 2 . . .x K x 0 x 4 p ( x ) = Q K nodes given its neighbors i =1 p ( x i | P arents ( x i )) x 3 x 1 x 2 • Undirected (Markov Random Field) Q p ( x ) = 1 C Ψ C ( x C ) Z x 0 x 4 C ⊂ { x 1 , . . ., x K } clique x 5 Ψ C potential function x 3 x 1 x 2 • Factor Graphs Q p ( x ) = 1 A Ψ A ( x A ) x 0 x 4 Z A ⊂ { x 1 , . . ., x K } x 5 x 3 x 1 Ψ A factor function x 2

Markov Networks • Undirected graphical models A B C D • Potential functions defined over cliques 1 ∑∏ ∏ = Φ = Φ ( ) ( ) ( ) Z X P X X c c Z c X c ⎧ 3.7 if A and B ⎪ Φ = ⎨ ( , ) 2.1 if A and B A B ⎪ 0.7 otherwise ⎩ ⎧ 2.3 if B and C and D Φ = ⎨ ( , , ) B C D ⎩ 5.1 otherwise Slide by Domingos

Markov Networks • Undirected graphical models A B C D • Potential functions defined over cliques ⎛ ⎞ ⎛ ⎞ ∑ ∑ 1 ∑ = = exp ( ) Z ⎜ w f X ⎟ ( ) exp ( ) P X ⎜ w f X ⎟ i i ⎝ ⎠ i i ⎝ ⎠ Z X i i Weight of Feature i Feature i ⎧ 1 if A and B = ⎨ ( , ) f A B ⎩ 0 otherwise ⎧ 1 if B and C and D = ⎨ ( , , ) f B C D ⎩ 0 Slide by Domingos

Hammersley-Clifford Theorem If Distribution is strictly positive (P(x) > 0) And Graph encodes conditional independences Then Distribution is product of potentials over cliques of graph Inverse is also true. Slide by Domingos

Markov Nets vs. Bayes Nets Property Markov Nets Bayes Nets Form Prod. potentials Prod. potentials Potentials Arbitrary Cond. probabilities Cycles Allowed Forbidden Partition Z = ? Z = 1 func. Indep. check Graph separation D-separation Indep. props. Some Some Inference MCMC, BP, etc. Convert to Markov Slide by Domingos

Inference in Markov Networks • Goal: compute marginals & conditionals of ⎛ ⎞ ⎛ ⎞ 1 ∑ ∑ ∑ = = ( ) exp ( ) exp ( ) ⎜ ⎟ P X w f X Z ⎜ w f X ⎟ i i i i ⎝ ⎠ ⎝ ⎠ Z i X i • Exact inference is #P-complete E.g.: What is ? P ( x i ) • Conditioning on Markov blanket is easy: What is ? P ( x i | x 1 , . . ., x i − 1 , x i +1 , . . ., x N ) ( ) ∑ exp ( ) w f x = i i i ( | ( )) P x MB x ( ) ( ) ∑ ∑ = + = exp ( 0) exp ( 1) w f x w f x i i i i i i • Gibbs sampling exploits this Slide by Domingos

Markov Chain Monte Carlo • Idea: – create chain of samples x (1) , x (2) , … where x(i+1) depends on x(i) – set of samples x (1) , x (2) , … used to approximate p(x) X 1 x (1) = ( X 1 = x (1) 1 , X 2 = x (1) 2 , . . ., X 5 = x (1) 5 ) X 2 x (2) = ( X 1 = x (2) 1 , X 2 = x (2) 2 , . . ., X 5 = x (2) 5 ) X 3 x (3) = ( X 1 = x (3) 1 , X 2 = x (3) 2 , . . ., X 5 = x (3) X 4 X 5 5 ) Slide by Domingos

Markov Chain Monte Carlo • Gibbs Sampler 1. Start with an initial assignment to nodes 2. One node at a time, sample node given others 3. Repeat 4. Use samples to compute P(X) • Convergence: Burn-in + Mixing time • Many modes ⇒ Multiple chains Iterations required to Iterations required to move away be close to stationary dist. from particular initial condition Slide by Domingos

Other Inference Methods • Belief propagation (sum-product) • Mean field / Variational approximations Slide by Domingos

Learning • Learning Weights – Maximize likelihood – Convex optimization: gradient ascent, quasi- Newton methods, etc. – Requires inference at each step (slow!) • Learning Structure – Feature Search – Evaluation using Likelihood, …

Back to CRFs • CRFs are conditionally trained Markov Networks

Linear-Chain Conditional Random Fields • From HMMs to CRFs T Y p ( y t | y t − 1 ) p ( x t | y t ) p ( y , x ) = t =1 can also be written as ⎛ ⎞ ⎝ X X X X X p ( y , x ) = 1 ⎠ Z exp λ ij 1 { y t = i } 1 { y t − 1 = j } + μ oi 1 { y t = i } 1 { x t = o } t t i,j ∈ S i ∈ S o ∈ O (set , …) λ ij := log p ( y 0 = i | y = j ) We let new parameters vary freely, so we need normalization constant Z.

Linear-Chain Conditional Random Fields ⎛ ⎞ ⎝ X X X X X p ( y , x ) = 1 This is a ⎠ Z exp λ ij 1 { y t = i } 1 { y t − 1 = j } + μ oi 1 { y t = i } 1 { x t = o } linear-chain t t i,j ∈ S i ∈ S o ∈ O • Introduce feature functions CRF, f k ( y t , y t − 1 , x t ) but includes One feature per transition One feature per state-observation pair only ( , ) current f ij ( y, y 0 , x t ) := 1 y = i 1 y 0 = j f io ( y, y 0 , x t ) := 1 y = i 1 x = o word’s Ã K ! X p ( y , x ) = 1 identity as Z exp λ k f k ( y t , y t − 1 , x t ) a feature k =1 • Then the conditional distribution is ³P K ´ exp k =1 λ k f k ( y t , y t − 1 , x t ) p ( y , x ) P ³P K ´ p ( y | x ) = y 0 p ( y 0 , x ) = P y 0 exp k =1 λ k f k ( y t , y t − 1 , x t )

Linear-Chain Conditional Random Fields • Conditional p(y|x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Linear-Chain Conditional Random Fields • Definition: A linear-chain CRF is a distribution that takes the form parameters feature functions Ã K ! X 1 p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 , x t ) k =1 where Z(x) is a normalization function Ã K ! X X Z ( x ) = exp λ k f k ( y t , y t − 1 , x t ) y k =1

Linear-Chain Conditional Random Fields • HMM-like linear-chain CRF … y … x • Linear-chain CRF, in which transition score depends on the current observation … y … x

Questions • #1 – Inference Given observations x 1 … x N and CRF θ , what is P(y t ,y t-1 |x) and what is Z(x)? (needed for learning) • #2 – Inference Given observations x 1 … x N and CRF θ , what is the most likely (Viterbi) labeling y*= arg max y p(y|x)? • #3 – Learning Given iid training data D={x (i) , y (i) }, i=1..N, how do we estimate the parameters θ ={ λ k } of a linear-chain CRF?

Solutions to #1 and #2 • Forward/Backward and Viterbi algorithms similar to versions for HMMs HMM Definition T Y • HMM as factor graph p ( y t | y t − 1 ) p ( x t | y t ) p ( y , x ) = t =1 T Y p ( y , x ) = Ψ t p ( y t , y t − 1 , x t ) t =1 Ψ t ( j, i, x ) := p ( y t = j | y t − 1 = i ) p ( x t = x | y t = j ) • Then X forward recursion α t ( i ) = Ψ t ( j, i, x t ) α t − 1 ( i ) i ∈ S X backward recursion β t ( i ) = Ψ t +1 ( j, i, x t +1 ) β t +1 ( j ) j ∈ S Viterbi recursion δ t ( j ) = max i ∈ S Ψ t ( j, i, x t ) δ t − 1 ( i )

Forward/Backward for linear-chain CRFs … • … identical to HMM version except for factor functions Ã K ! Ψ t ( j, i, x t ) CRF Definition X p ( y | x ) = 1 • CRF can be written as Z exp λ k f k ( y t , y t − 1 , x t ) k =1 T Y p ( y | x ) = 1 Ψ t ( y t , y t − 1 , x t ) Z ÃX ! t =1 Ψ t ( y t , y t − 1 , x t ) := exp λ k f k ( y t , y t − 1 , x t ) k X • Same: forward recursion α t ( i ) = Ψ t ( j, i, x t ) α t − 1 ( i ) i ∈ S X backward recursion β t ( i ) = Ψ t +1 ( j, i, x t +1 ) β t +1 ( j ) j ∈ S Viterbi recursion δ t ( j ) = max i ∈ S Ψ t ( j, i, x t ) δ t − 1 ( i )

Forward/Backward for linear-chain CRFs • Complexity same as for HMMs Time: Space: K = |S| #states O(K 2 N) O(KN) N length of sequence Linear in length of sequence!

Solution to #3 - Learning • Want to maximize Conditional log likelihood N X log p ( y ( i ) | x ( i ) ) l ( θ ) = i =1 • Substitute in CRF model into likelihood CRFs typically learned using numerical N T K N K X X X X X λ 2 optimization of likelihood. λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) k t ) − − l ( θ ) = 2 σ 2 (Also possible for HMMs, but we only i =1 t =1 k =1 i =1 k =1 discussed EM) • Add Regularizer Often large number of parameters, so need to avoid overfitting

Regularization • Commonly used l 2 -norm (Euclidean) – Corresponds to Gaussian prior over parameters K X λ 2 k − 2 σ 2 k =1 • Alternative is l 1 -norm – Corresponds to exponential prior over parameters – Encourages sparsity K X | λ k | − σ k =1 • Accuracy of final model not sensitive to σ

CSE574 - Administriva No class on Fri 01/25 (Ski Day) Last - PowerPoint PPT Presentation

CSE574 - Administriva No class on Fri 01/25 (Ski Day) Last Wednesday HMMs Most likely individual state at time t: (forward) Most likely sequence of states (Viterbi) Learning using EM Generative vs. Discriminative Learning

SPEERMINT Working Group Administriva mailing list: speermint@ietf.org subscribe:

Charlie Garrod Michael Hilton School of Computer Science 15-214 1 Administriva HW4 Part A

VoIP Peering & Interconnect BOF (voipeer) IETF 64 Vancouver, BC

Hybrid integration of rules and ontologies: A constraint-based fram ew ork Jakob Henriksson, Jan

3 MAX X O X O X . . . MAX (X) O A 1 A 2 A 3 3 2 2 X O X X O X O . . . MIN MIN (O)

Chapter 10: Core Mechanics Definition of Core Mechanics:

Josh Bloch Charlie Garrod School of Computer Science 15-214 1 Administrivia Midterm exam

CS 480: GAME AI MACHINE LEARNING IN GAMES 5/31/2012 Santiago Ontan santi@cs.drexel.edu

Experimental Mathematics : 2 Ten Computational Challenge Problems Jonathan M. Borwein, FRSC

Embracing the Age of Intelligence Data-driven decision making in an agile organisation April 2018

Introduction to Course Timothy C. Weiskel Class Session 1 16 September 2008 Harvard University

The link between corruption and tax evasion - an experimental investigation Ritwik Banerjee 1

CMPSC 497 Security Basics Trent Jaeger Systems and Internet Infrastructure Security (SIIS)

Lobbying and Corruption Dr James Tremewan (james.tremewan@univie.ac.at) Corruption Laboratory

Government Surveillance and Incentives to Abuse Power Paul Laskowski Benjamin Johnson Thomas

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

TRAJECTORY FOLLOWING AND REGULATION OF CHEMICAL BATCH REACTORS VIA GENEALOGICAL DECISION TREES

Air Force ROTC Texas State University Det 840 FLX Overview Detachment 840 Applicant

Evidence-Based Policing Translating research into practice Cynthia Lum , PhD Deputy Director,

5/19/2014 Active Shooter Guidance for Healthcare Facilities

Optimal Deposit Insurance Eduardo D avila and Itay Goldstein Yale/NYU Stern/NBER and UPenn

Do Banks Pass Through Credit Expansions to Consumers Who Want to Borrow? by Agarwal,

Sociological Aspects of CKD (UE) in Sri Lanka Kalinga Tudor Silva, Siri Hettige, Ramani

Traditional Programming Models: Traditional Programming Models: Stone Knives and Bearskins in the

CSE574 - Administriva No class on Fri 01/25 (Ski Day) Last - PowerPoint PPT Presentation

CSE574 - Administriva No class on Fri 01/25 (Ski Day) Last Wednesday HMMs Most likely individual state at time t: (forward) Most likely sequence of states (Viterbi) Learning using EM Generative vs. Discriminative Learning

SPEERMINT Working Group Administriva mailing list: speermint@ietf.org subscribe:

Charlie Garrod Michael Hilton School of Computer Science 15-214 1 Administriva HW4 Part A

VoIP Peering &amp; Interconnect BOF (voipeer) IETF 64 Vancouver, BC

Hybrid integration of rules and ontologies: A constraint-based fram ew ork Jakob Henriksson, Jan

3 MAX X O X O X . . . MAX (X) O A 1 A 2 A 3 3 2 2 X O X X O X O . . . MIN MIN (O)

Chapter 10: Core Mechanics Definition of Core Mechanics:

Josh Bloch Charlie Garrod School of Computer Science 15-214 1 Administrivia Midterm exam

CS 480: GAME AI MACHINE LEARNING IN GAMES 5/31/2012 Santiago Ontan santi@cs.drexel.edu

Experimental Mathematics : 2 Ten Computational Challenge Problems Jonathan M. Borwein, FRSC

Embracing the Age of Intelligence Data-driven decision making in an agile organisation April 2018

Introduction to Course Timothy C. Weiskel Class Session 1 16 September 2008 Harvard University

The link between corruption and tax evasion - an experimental investigation Ritwik Banerjee 1

CMPSC 497 Security Basics Trent Jaeger Systems and Internet Infrastructure Security (SIIS)

Lobbying and Corruption Dr James Tremewan (james.tremewan@univie.ac.at) Corruption Laboratory

Government Surveillance and Incentives to Abuse Power Paul Laskowski Benjamin Johnson Thomas

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

TRAJECTORY FOLLOWING AND REGULATION OF CHEMICAL BATCH REACTORS VIA GENEALOGICAL DECISION TREES

Air Force ROTC Texas State University Det 840 FLX Overview Detachment 840 Applicant

Evidence-Based Policing Translating research into practice Cynthia Lum , PhD Deputy Director,

5/19/2014 Active Shooter Guidance for Healthcare Facilities

Optimal Deposit Insurance Eduardo D avila and Itay Goldstein Yale/NYU Stern/NBER and UPenn

Do Banks Pass Through Credit Expansions to Consumers Who Want to Borrow? by Agarwal,

Sociological Aspects of CKD (UE) in Sri Lanka Kalinga Tudor Silva, Siri Hettige, Ramani

Traditional Programming Models: Traditional Programming Models: Stone Knives and Bearskins in the

VoIP Peering & Interconnect BOF (voipeer) IETF 64 Vancouver, BC