graphical models
play

Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org - PowerPoint PPT Presentation

Graphical Models 10-715 Fall 2015 Alexander Smola alex@smola.org Office hours - after class in my office Marianas Labs Directed Graphical Models Brain & Brawn p (brain) = 0 . 1 p (sports) = 0 . 2 smart strong 0 1 0 0.1 0.8 1


  1. Example - PCA/ICA Latent Factors Observed Effects • Observed effects 
 Click behavior, queries, watched news, emails d ! d X Y y i v i , σ 2 1 x ∼ N and p ( y ) = p ( y i ) i =1 i =1

  2. Example - PCA/ICA Latent Factors Observed Effects • Observed effects 
 Click behavior, queries, watched news, emails d ! d X Y y i v i , σ 2 1 x ∼ N and p ( y ) = p ( y i ) i =1 i =1

  3. Example - PCA/ICA Latent Factors Observed Effects • Observed effects 
 Click behavior, queries, watched news, emails d ! d X Y y i v i , σ 2 1 x ∼ N and p ( y ) = p ( y i ) i =1 i =1

  4. Example - PCA/ICA Latent Factors Observed Effects • Observed effects 
 Click behavior, queries, watched news, emails d ! d X Y y i v i , σ 2 1 x ∼ N and p ( y ) = p ( y i ) i =1 i =1 • p(y) is Gaussian for PCA. General for ICA

  5. Cocktail party problem

  6. Recommender Systems u m r

  7. Recommender Systems u m r • Users u • Movies m • Ratings r (but only for a subset of users)

  8. Recommender Systems u m r ... intersecting plates ... (like nested FOR loops) • Users u • Movies m • Ratings r (but only for a subset of users)

  9. Recommender Systems news, SearchMonkey u m answers social ranking r ... intersecting plates ... OMG personals (like nested FOR loops) • Users u • Movies m • Ratings r (but only for a subset of users)

  10. Challenges engineering machine learning

  11. Challenges • How to design models engineering • Common (engineering) sense • Computational tractability machine learning

  12. Challenges • How to design models engineering • Common (engineering) sense • Computational tractability • Dependency analysis machine learning

  13. Challenges • How to design models engineering • Common (engineering) sense • Computational tractability • Dependency analysis machine learning • Inference • Easy for fully observed situations • Many algorithms if not fully observed • Dynamic programming / message passing

  14. Summary • Repeated structure - encode with plate • Chains, bipartite graphs, etc (more later) • Plates can intersect • Not all variables are observed Y Θ p ( X, θ ) = p ( θ ) p ( x i | θ ) Θ i x1 x2 x3 x4 xi

  15. Markov Chains n − 1 Y x 0 x 1 x 2 x 3 p ( x ; θ ) = p ( x 0 ; θ ) p ( x i +1 | x i ; θ ) i =1 x 0 x 1 x 2 Transition Matrices 0 1 0 1 0 1 x 0 x 2 x 3 0 0.4 x1 0 0.2 0.1 0 0.8 0.5 0 0 1 1 0.6 1 0.8 0.9 1 0.2 0.5 1 1 0 Unraveling the chain X p ( x 1 ) = p ( x 1 | x 0 ) p ( x 0 ) ⇐ ⇒ π 1 = Π 0 → 1 π 0 x 0 X p ( x 2 ) = p ( x 2 | x 1 ) p ( x 1 ) ⇐ ⇒ π 2 = Π 1 → 2 π 1 = Π 1 → 2 Π 0 → 1 π 0 x 1

  16. Markov Chains n − 1 Y x 0 x 1 x 2 x 3 p ( x ; θ ) = p ( x 0 ; θ ) p ( x i +1 | x i ; θ ) i =1 • From the start - sum sequentially i − 1 X Y p ( x i | x 1 ) = p ( x l +1 | x l ) · p ( x 2 | x 1 ) | {z } x j :1 <j<i l =2 =: l 2 ( x 2 ) i − 1 X Y X = p ( x l +1 | x l ) · p ( x 3 | x 2 ) l 2 ( x 2 ) x j :2 <j<i x 2 l =3 | {z } =: l 3 ( x 3 ) i − 1 X Y X = p ( x l +1 | x l ) · p ( x 4 | x 3 ) l 3 ( x 3 ) x j :3 <j<i l =4 x 3 | {z } =: l 4 ( x 4 )

  17. 
 Markov Chains n − 1 Y x 0 x 1 x 2 x 3 p ( x ; θ ) = p ( x 0 ; θ ) p ( x i +1 | x i ; θ ) i =1 x 0 x 1 x 2 Transition Matrices 0 1 0 1 0 1 x 0 x 2 x 3 0 0.4 x1 0 0.2 0.1 0 0.8 0.5 0 0 1 1 0.6 1 0.8 0.9 1 0.2 0.5 1 1 0 Unraveling the chain only need matrix-vector x0 = [0.4; 0.6]; 
 Pi1 = [0.2 0.1; 0.8 0.9]; 
 Pi2 = [0.8 0.5; 0.2 0.5]; 
 Pi3 = [0 1; 1 0]; 
 x3 = Pi3 * Pi2 * Pi1 * x0 = [0.45800; 0.54200]

  18. Markov Chains n − 1 Y x 0 x 1 x 2 x 3 p ( x ; θ ) = p ( x 0 ; θ ) p ( x i +1 | x i ; θ ) i =1 • From the end - sum sequentially n − 1 normalize in X Y p ( x 1 | x n ) ∝ p ( x l +1 | x l ) · 1 |{z} the end x j :1 <j<n l =1 =: r n ( x n ) n − 2 X Y X = p ( x l +1 | x l ) · p ( x n | x n − 1 ) r n ( x n ) x j :1 <j<n − 1 l =1 x n | {z } =: r n − 1 ( x n − 1 ) n − 3 X Y X = p ( x l +1 | x l ) · p ( x n − 1 | x n − 2 ) r n − 1 ( x n − 1 ) x j :1 <j<n − 2 l =1 x n − 1 | {z } =: r n − 2 ( x n − 2 )

  19. Example - inferring lunch current • Initial probability 
 p(x 0 =t)=p(x 0 =b) = 0.5 • Stationary transition matrix • On fifth day observed at Tazza d’oro p(x 5 =t)=1 0.9 0.2 • Distribution on day 3 • Left messages to 3 0.1 0.8 • Right messages to 3 • Renormalize

  20. Example - inferring lunch current > Pi = [0.9, 0.2; 0.1 0.8] Pi = 0.90000 0.20000 0.10000 0.80000 > l1 = [0.5; 0.5]; > l3 = Pi * Pi * l1 l3 = 0.58500 0.9 0.2 0.41500 > r5 = [1; 0]; > r3 = Pi' * Pi' * r5 r3 = 0.83000 0.1 0.8 0.34000 > (l3 .* r3) / sum(l3 .* r3) ans = 0.77483 0.22517

  21. 
 Message Passing l i = Π i l i � 1 x 0 x 1 x 2 x 3 x 4 x 5 r i = Π > i r i +1 • Send forward messages starting from left node 
 X m i − 1 → i ( x i ) = m i − 2 → i − 1 ( x i − 1 ) f ( x i − 1 , x i ) x i − 1 • Send backward messages starting from right node X m i +1 → i ( x i ) = m i +2 → i +1 ( x i +1 ) f ( x i , x i +1 ) x i +1

  22. Higher Order Markov Chains • First order chain x 0 x 1 x 2 x 3 Y p ( X ) = p ( x 0 ) p ( x i +1 | x i ) i • Second order x 0 x 1 x 2 x 3 Y p ( X ) = p ( x 0 , x 1 ) p ( x i +1 | x i , x i − 1 ) i

  23. Higher Order Markov Chains • First order chain x 0 x 1 x 2 x 3 Y p ( X ) = p ( x 0 ) p ( x i +1 | x i ) i • Second order x 0 x 1 x 2 x 3 Y p ( X ) = p ( x 0 , x 1 ) p ( x i +1 | x i , x i − 1 ) i

  24. Trees x3 x4 x5 x0 x1 x2 x7 x8 x6 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ...

  25. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  26. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  27. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  28. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  29. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  30. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  31. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 7 x 8 x 6 X X l 1 ( x 1 ) = p ( x 0 ) p ( x 1 | x 0 ) r 7 ( x 7 ) = p ( x 8 | x 7 ) x 0 x 8 X X l 2 ( x 2 ) = l 1 ( x 1 ) p ( x 2 | x 1 ) r 6 ( x 6 ) = r 7 ( x 7 ) p ( x 7 | x 6 ) x 1 x 7 X r 2 ( x 2 ) = r 6 ( x 6 ) p ( x 6 | x 2 ) x 6 X l 3 ( x 3 ) = l 2 ( x 2 ) p ( x 3 | x 2 ) r 2 ( x 2 ) x 2 . . .

  32. Junction Template • Order of computation • Dependence does not matter 
 (only matters for parametrization) X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 4 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 3 out in 1 2 i n 4

  33. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  34. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  35. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  36. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  37. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  38. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  39. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  40. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  41. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  42. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  43. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  44. Trees x 3 x 4 x 5 x 0 x 1 x 2 x 6 x 7 x 8 • Forward/Backward messages as normal for chain • When we have more edges for a vertex use ... X m 2 → 3 ( x 3 ) = m 1 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 2 , x 3 ) x 2 X m 2 → 6 ( x 6 ) = m 1 → 2 ( x 2 ) m 3 → 2 ( x 2 ) f ( x 2 , x 6 ) x 2 X m 2 → 1 ( x 1 ) = m 3 → 2 ( x 2 ) m 6 → 2 ( x 2 ) f ( x 1 , x 2 ) x 2

  45. Summary • Markov chains • Present only depends on recent past • Higher order - longer history. • Dynamic programming • Exponential if brute force. • Linear in chain if we iterate. • For junctions treat like chains but 
 3 integrate signals from all sources. out in • Exponential in the history size. 1 2 i n 4

  46. Hidden Markov Models

  47. Clustering and 
 Hidden Markov Models x 1 x 2 x 3 x 4 x m x i x i+1 ... y 1 y 2 y 3 y 4 y m y i i=1..m x 1 x 2 x 3 x 4 x m x i x i+1 ... y 1 y 2 y 3 y 4 y m y i i=1..m • Clustering - no dependence between observations • Hidden Markov Model - dependence between states

  48. Applications x 1 x 2 x 3 x 4 x m x i x i+1 ... y 1 y 2 y 3 y 4 y m y i i=1..m • Speech recognition (sound|text) • Optical character recognition (writing|text) • Gene finding (DNA sequence|genes) • Activity recognition (accelerometer|activity)

Recommend


More recommend