Monotonicity • monotonicity Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is monotonic if for all a, b, c ∈ A ( a ≤ b ) ⇒ ( a ⊗ c ≤ b ⊗ c ) ( a ≤ b ) ⇒ ( c ⊗ a ≤ c ⊗ b ) • optimal substructure in dynamic programming A: b’ ⊗ c A: b ⊗ c ≤ b ⊗ c C: c C: c B: b B: b ’ ≤ b • idempotent => monotone (from distributivity) • (a+b) ⊗ c = (a ⊗ c)+(b ⊗ c); if a ≤ b, (a ⊗ c)=(a ⊗ c)+(b ⊗ c) Liang Huang (Penn) Dynamic Programming 11
Monotonicity • monotonicity Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is monotonic if for all a, b, c ∈ A ( a ≤ b ) ⇒ ( a ⊗ c ≤ b ⊗ c ) ( a ≤ b ) ⇒ ( c ⊗ a ≤ c ⊗ b ) • optimal substructure in dynamic programming A: b’ ⊗ c A: b ⊗ c ≤ b ⊗ c C: c C: c B: b B: b ’ ≤ b • idempotent => monotone (from distributivity) • (a+b) ⊗ c = (a ⊗ c)+(b ⊗ c); if a ≤ b, (a ⊗ c)=(a ⊗ c)+(b ⊗ c) • by def. of comparison, a ⊗ c ≤ b ⊗ c Liang Huang (Penn) Dynamic Programming 11
DP on Graphs • optimization problems on graphs => generic shortest-path problem • weighted directed graph G=(V, E) with a function w that assigns each edge a weight from a semiring • compute the best weight of the target vertex t • generic update along edge (u, v) w (u, v) d ( v ) ⊕ = d ( u ) ⊗ w ( u, v ) u v • how to avoid cyclic updates? d ( v ) ← d ( v ) ⊕ ( d ( u ) ⊗ w ( u, v )) • only update when d(u) is fixed Liang Huang (Penn) Dynamic Programming 12
Two Dimensional Survey traversing order topological best-first (acyclic) (superior) graphs with semirings Viterbi Dijkstra search space (e.g., FSMs) hypergraphs with Generalized weight functions Knuth Viterbi (e.g., CFGs) Liang Huang (Penn) Dynamic Programming 13
Viterbi Algorithm for DAGs 1. topological sort 2. visit each vertex v in sorted order and do updates • for each incoming edge (u, v) in E • use d(u) to update d(v): d ( v ) ⊕ = d ( u ) ⊗ w ( u, v ) • key observation: d(u) is fixed to optimal at this time w (u, v) u v • time complexity: O( V + E ) Liang Huang (Penn) Dynamic Programming 14
Variant 1: forward-update 1. topological sort 2. visit each vertex v in sorted order and do updates • for each outgoing edge (v, u) in E • use d(v) to update d(u): d ( u ) ⊕ = d ( v ) ⊗ w ( v, u ) • key observation: d(v) is fixed to optimal at this time w (v, u) u v • time complexity: O( V + E ) Liang Huang (Penn) Dynamic Programming 15
Examples Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] • just use the counting semiring (N, +, × , 0, 1) • note: this is not an optimization problem! Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] • just use the counting semiring (N, +, × , 0, 1) • note: this is not an optimization problem! • [Longest Path in a DAG] Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] • just use the counting semiring (N, +, × , 0, 1) • note: this is not an optimization problem! • [Longest Path in a DAG] • just use the semiring ( R ∪ {−∞} , max , + , −∞ , 0) Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] • just use the counting semiring (N, +, × , 0, 1) • note: this is not an optimization problem! • [Longest Path in a DAG] • just use the semiring ( R ∪ {−∞} , max , + , −∞ , 0) • [Part-of-Speech Tagging with a Hidden Markov Model] Liang Huang (Penn) Dynamic Programming 16
Examples • [Number of Paths in a DAG] • just use the counting semiring (N, +, × , 0, 1) • note: this is not an optimization problem! • [Longest Path in a DAG] • just use the semiring ( R ∪ {−∞} , max , + , −∞ , 0) • [Part-of-Speech Tagging with a Hidden Markov Model] Liang Huang (Penn) Dynamic Programming 16
Example: Speech Alignment time complexity: O(n 2 ) also used in: edit distance biological sequence alignment Liang Huang (Penn) Dynamic Programming 17
Example: Word Alignment • key difference • reorderings in translation! I love you . • sequence/speech alignment k Je is always monotonic • complexity under HMM t’ • word alignment is O( n 3 ) j aime • for every ( i, j ) . • enumerate all ( i -1, k ) i- 1 i • sequence alignment O( n 2 ) Liang Huang (Penn) Dynamic Programming 18
Chinese Word Segmentation 下 雨 天 地 面 积 水 xia yu tian di mian ji shui Liang Huang (Penn) Dynamic Programming 19
Chinese Word Segmentation 民主 min-zhu people-dominate “democracy” 下 雨 天 地 面 积 水 xia yu tian di mian ji shui Liang Huang (Penn) Dynamic Programming 19
Chinese Word Segmentation 江泽民 主席 民主 min-zhu jiang-ze-min zhu-xi people-dominate ... - ... - people dominate-podium “democracy” “President Jiang Zemin” 下 雨 天 地 面 积 水 xia yu tian di mian ji shui Liang Huang (Penn) Dynamic Programming 19
Chinese Word Segmentation 江泽民 主席 民主 min-zhu jiang-ze-min zhu-xi people-dominate ... - ... - people dominate-podium this was 5 years ago. “democracy” “President Jiang Zemin” now Google is good at segmentation! 下 雨 天 地 面 积 水 xia yu tian di mian ji shui Liang Huang (Penn) Dynamic Programming 19
Chinese Word Segmentation 江泽民 主席 民主 min-zhu jiang-ze-min zhu-xi people-dominate ... - ... - people dominate-podium this was 5 years ago. “democracy” “President Jiang Zemin” now Google is good at segmentation! 下 雨 天 地 面 积 水 xia yu tian di mian ji shui Liang Huang (Penn) Dynamic Programming 19
Chinese Word Segmentation 江泽民 主席 民主 min-zhu jiang-ze-min zhu-xi people-dominate ... - ... - people dominate-podium this was 5 years ago. “democracy” “President Jiang Zemin” now Google is good at segmentation! 下 雨 天 地 面 积 水 xia yu tian di mian ji shui graph search Liang Huang (Penn) Dynamic Programming 19
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 20
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon _ _ _ _ _ Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 20
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon _ _ ● _ _ _ _ _ ● ● Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 20
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon _ _ ● ● ● ● ● ● _ _ _ _ _ ● ● Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 20
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon _ _ ● ● ● ● ● ● _ _ _ _ _ ● ● Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 21
Phrase-based Decoding 与 沙龙 举行 了 会谈 yu Shalong juxing le huitan held a talk with Sharon _ ● ● ● ● _ _ ● ● ● ● ● ● _ _ _ _ _ ● ● Sharon held talks with with Sharon held a talk yu Shalong juxing le huitan Huang and Chiang Forest Rescoring 21
Phrase-based Decoding source-side: coverage vector 与 沙龙 举行 了 会谈 _ _ ● ● ● yu Shalong juxing le huitan held a talk target-side: grow hypotheses held a talk with Sharon strictly left-to-right ... ... _ _ ● ● ● ● ● ● _ _ _ _ _ ● ● held a talk held a talk with Sharon ... ... space: O(2 n ), time: O(2 n n 2 ) -- cf. traveling salesman problem Huang and Chiang Forest Rescoring 22
Traveling Salesman Problem & MT • a classical NP-hard problem • goal: visit each city once and only once • exponential-time dynamic programming • state: cities visited so far (bit-vector) • search in this O(2 n ) transformed graph • MT: each city is a source-language word • restrictions in reordering can reduce complexity => distortion limit • => syntax-based MT (Held and Karp, 1962; Knight, 1999) Huang and Chiang Forest Rescoring 23
Traveling Salesman Problem & MT • a classical NP-hard problem • goal: visit each city once and only once • exponential-time dynamic programming • state: cities visited so far (bit-vector) • search in this O(2 n ) transformed graph • MT: each city is a source-language word • restrictions in reordering can reduce complexity => distortion limit • => syntax-based MT (Held and Karp, 1962; Knight, 1999) Huang and Chiang Forest Rescoring 23
Adding a Bigram Model • “refined” graph: annotated with language model words • still dynamic programming, just larger search space ●●●●● ... Shalong ... meeting _ _ ● ● ● ●●●●● ... Sharon ... talk _ _ _ _ _ _ _ ● ● ● ... talks _ _ ● ● ● Huang and Chiang Forest Rescoring 24
Adding a Bigram Model • “refined” graph: annotated with language model words • still dynamic programming, just larger search space ●●●●● ... Shalong ... meeting _ _ ● ● ● with Sharon ●●●●● ... Sharon ... talk _ _ _ _ _ _ _ ● ● ● ... talks _ _ ● ● ● Huang and Chiang Forest Rescoring 24
Adding a Bigram Model • “refined” graph: annotated with language model words • still dynamic programming, just larger search space ●●●●● ... Shalong ... meeting _ _ ● ● ● with Sharon ●●●●● ... Sharon ... talk _ _ _ _ _ _ _ ● ● ● bigram ... talks _ _ ● ● ● Huang and Chiang Forest Rescoring 24
Adding a Bigram Model • “refined” graph: annotated with language model words • still dynamic programming, just larger search space ●●●●● ... Shalong ... meeting _ _ ● ● ● with Sharon ●●●●● ... Sharon ... talk _ _ _ _ _ _ _ ● ● ● bigram ... talks _ _ ● ● ● space: O(2 n ), time: O(2 n n 2 ) => space: O(2 n V m- 1 ), time: O(2 n V m- 1 n 2 ) for m- gram language models Huang and Chiang Forest Rescoring 24
Two Dimensional Survey traversing order topological best-first (acyclic) (superior) graphs with semirings Viterbi Dijkstra search space (e.g., FSMs) hypergraphs with Generalized weight functions Knuth Viterbi (e.g., CFGs) Liang Huang (Penn) Dynamic Programming 25
Dijkstra Algorithm d(u) d(u) ⊗ w (e) w (e) Liang Huang (Penn) Dynamic Programming 26
Dijkstra Algorithm • Dijkstra does not require acyclicity • instead of topological order, we use best-first order • but this requires superiority of the semiring Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b. • intuition: combination always gets worse d(u) d(u) ⊗ w (e) w (e) Liang Huang (Penn) Dynamic Programming 26
Dijkstra Algorithm • Dijkstra does not require acyclicity • instead of topological order, we use best-first order • but this requires superiority of the semiring Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b. • intuition: combination always gets worse • contrast: monotonicity: combination preserves order d(u) d(u) ⊗ w (e) w (e) Liang Huang (Penn) Dynamic Programming 26
Dijkstra Algorithm • Dijkstra does not require acyclicity • instead of topological order, we use best-first order • but this requires superiority of the semiring Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b. • intuition: combination always gets worse • contrast: monotonicity: combination preserves order ( { 0 , 1 } , ∨ , ∧ , 0 , 1) ([0 , 1] , max , × , 0 , 1) ( R + ∪ { + ∞} , min , + , + ∞ , 0) d(u) d(u) ⊗ w (e) w (e) ( R ∪ { + ∞} , min , + , + ∞ , 0) Liang Huang (Penn) Dynamic Programming 26
Dijkstra Algorithm • Dijkstra does not require acyclicity • instead of topological order, we use best-first order • but this requires superiority of the semiring Let K = ( A, ⊕ , ⊗ , 0 , 1) be a semiring, and ≤ a partial ordering over A . We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b. • intuition: combination always gets worse • contrast: monotonicity: combination preserves order ( { 0 , 1 } , ∨ , ∧ , 0 , 1) ([0 , 1] , max , × , 0 , 1) ( R + ∪ { + ∞} , min , + , + ∞ , 0) d(u) d(u) ⊗ w (e) w (e) ( R ∪ { + ∞} , min , + , + ∞ , 0) Liang Huang (Penn) Dynamic Programming 26
Dijkstra Algorithm • keep a cut (S : V - S) where S vertices are fixed • maintain a priority queue Q of V - S vertices • each iteration choose the best vertex v from Q • move v to S, and use d(v) to forward-update others d ( u ) ⊕ = d ( v ) ⊗ w ( v, u ) ... s v time complexity: O((V+E) lgV) (binary heap) S V - S O(V lgV + E) (fib. heap) Liang Huang (Penn) Dynamic Programming 27
Dijkstra Algorithm • keep a cut (S : V - S) where S vertices are fixed • maintain a priority queue Q of V - S vertices • each iteration choose the best vertex v from Q • move v to S, and use d(v) to forward-update others d ( u ) ⊕ = d ( v ) ⊗ w ( v, u ) ... s v time complexity: O((V+E) lgV) (binary heap) S V - S O(V lgV + E) (fib. heap) Liang Huang (Penn) Dynamic Programming 27
Dijkstra Algorithm • keep a cut (S : V - S) where S vertices are fixed • maintain a priority queue Q of V - S vertices • each iteration choose the best vertex v from Q • move v to S, and use d(v) to forward-update others w (v, u) d ( u ) ⊕ = d ( v ) ⊗ w ( v, u ) u ... s v time complexity: O((V+E) lgV) (binary heap) S V - S O(V lgV + E) (fib. heap) Liang Huang (Penn) Dynamic Programming 27
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems acyclic: Viterbi Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems acyclic: superior: Viterbi Dijkstra Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems forward-backward (Inside semiring) Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems forward-backward non-probabilistic (Inside semiring) models Liang Huang (Penn) Dynamic Programming 28
Viterbi vs. Dijkstra • structural vs. algebraic constraints • Dijkstra only applicable to optimization problems monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems forward-backward cyclic FSMs/ non-probabilistic (Inside semiring) grammars models Liang Huang (Penn) Dynamic Programming 28
What if both fail? monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems generalized Bellman-Ford (CLR, 1990; Mohri, 2002) or, first do strongly-connected components (SCC) which gives a DAG; use Viterbi globally on this SCC-DAG; use Bellman-Ford locally within each SCC Liang Huang (Penn) Dynamic Programming 29
What if both work? monotonic optimization problems many acyclic: superior: NLP Viterbi Dijkstra problems full Dijkstra is slower than Viterbi O((V + E) lgV) vs. O(V + E) but it can finish as early as the target vertex is popped a (V + E) lgV vs. V + E Q : how to (magically) reduce a ? Liang Huang (Penn) Dynamic Programming 30
A* Search: Intuition • Dijkstra is “blind” about how far the target is • may get “trapped” by obstacles • can we be more intelligent about the future? • idea: prioritize by s-v distance + v-t estimate v s t u Liang Huang (Penn) Dynamic Programming 31
A* Search: Intuition • Dijkstra is “blind” about how far the target is • may get “trapped” by obstacles • can we be more intelligent about the future? • idea: prioritize by s-v distance + v-t estimate v s t u Liang Huang (Penn) Dynamic Programming 31
A* Search: Intuition • Dijkstra is “blind” about how far the target is • may get “trapped” by obstacles • can we be more intelligent about the future? • idea: prioritize by s-v distance + v-t estimate v s t u Liang Huang (Penn) Dynamic Programming 31
A* Heuristic h(v) d(v) s v t ĥ (v) • h(v): the distance from v to target t • ĥ (v) must be an optimistic estimate of h(v): ĥ (v) ≤ h(v) • Dijkstra is a special case where ĥ (v) = ī (0 for dist.) • now, prioritize the queue by d(v) ⊗ ĥ (v) • can stop when target gets popped -- why? • optimal subpaths should pop earlier than non-optimal • d(v) ⊗ ĥ (v) ≤ d(v) ⊗ h (v) ≤ d(t) ≤ non-optimal paths of t Liang Huang (Penn) Dynamic Programming 32
How to design a heuristic? • more of an art than science • basic idea: projection into coarser space • cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V } • exact cost in coarser graph is estimate of finer graph 33 (Raphael, 2001) Liang Huang (Penn) Dynamic Programming
How to design a heuristic? • more of an art than science • basic idea: projection into coarser space • cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V } • exact cost in coarser graph is estimate of finer graph U V U V (Raphael, 2001) Liang Huang (Penn) Dynamic Programming 33
Viterbi or A*? • A* intuition: d(t) ⊗ ĥ (t) ranks higher among d(v) ⊗ ĥ (v) • can finish early if lucky • actually, d(t) ⊗ ĥ (t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t) • with the price of maintaining priority queue - O(log V) • Q: how early? worth the price? • if the rank is r, then A* is better when r/V log V < 1 d(v) pool d(v) ⊗ ĥ (v) pool 1 r d(t) V d(t) Liang Huang (Penn) Dynamic Programming 34 Dijkstra A*
Viterbi or A*? • A* intuition: d(t) ⊗ ĥ (t) ranks higher among d(v) ⊗ ĥ (v) • can finish early if lucky • actually, d(t) ⊗ ĥ (t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t) • with the price of maintaining priority queue - O(log V) • Q: how early? worth the price? • if the rank is r, then A* is better when r/V log V < 1 d(v) pool d(v) ⊗ ĥ (v) pool 1 r < V / log V r d(t) V d(t) Liang Huang (Penn) Dynamic Programming 34 Dijkstra A*
Two Dimensional Survey traversing order topological best-first (acyclic) (superior) graphs with semirings Viterbi Dijkstra search space (e.g., FSMs) hypergraphs with Generalized weight functions Knuth Viterbi (e.g., CFGs) Liang Huang (Penn) Dynamic Programming 35
Two Dimensional Survey traversing order topological best-first (acyclic) (superior) graphs with semirings Viterbi Dijkstra search space (e.g., FSMs) hypergraphs with Generalized weight functions Knuth Viterbi (e.g., CFGs) Liang Huang (Penn) Dynamic Programming 35
Background: CFG and Parsing (S, 0, n) w 0 w 1 ... w n-1 Liang Huang (Penn) Dynamic Programming 36
Background: CFG and Parsing (S, 0, n) w 0 w 1 ... w n-1 Liang Huang (Penn) Dynamic Programming 36
Background: CFG and Parsing (S, 0, n) w 0 w 1 ... w n-1 Liang Huang (Penn) Dynamic Programming 37
Background: CFG and Parsing (S, 0, n) w 0 w 1 ... w n-1 Liang Huang (Penn) Dynamic Programming 37
(Directed) Hypergraphs • a generalization of graphs • edge => hyperedge: several vertices to one vertex • e = (T(e), h(e), f e ). arity |e| = |T(e)| Y i,j e • a totally-ordered weight set R X i,k Z j,k • we borrow the ⊕ operator to be the comparison • weight function f e : R |e| to R • generalizes the ⊗ operator in semirings simple case: f e (a, b) = a ⊗ b ⊗ w(e) f e u 1 tails v d ( v ) ⊕ = f e ( d ( u 1 ) , d ( u 2 )) u 2 head Liang Huang (Penn) Dynamic Programming 38
Hypergraphs and Deduction (B, i, k) (C, k, j) (B, i, k) (C, k, j) : a : b A → B C u 1 u 2 (A, i, j) f e : a × b × Pr(A → B C) v (A, i, j) (Nederhof, 2003) Liang Huang (Penn) Dynamic Programming 39
Hypergraphs and Deduction (B, i, k) (C, k, j) (B, i, k) (C, k, j) : a : b A → B C u 1 u 2 (A, i, j) f e : a × b × Pr(A → B C) v (A, i, j) (Nederhof, 2003) : a : b : a : b tails u 1 u 2 antecedents u 1 u 2 f e f e : f e ( a,b ) v : f e ( a,b ) v head consequent Liang Huang (Penn) Dynamic Programming 39
Related Formalisms v v OR-node e AND-node e OR-nodes u 1 u 2 u 1 u 2 Liang Huang (Penn) Dynamic Programming 40
Packed Forests • a compact representation of many parses • by sharing common sub-derivations • polynomial-space encoding of exponentially large set 0 I 1 saw 2 him 3 with 4 a 5 mirror 6 (Klein and Manning, 2001; Huang and Chiang, 2005) Liang Huang (Penn) Dynamic Programming 41
Packed Forests • a compact representation of many parses • by sharing common sub-derivations • polynomial-space encoding of exponentially large set nodes hyperedges a hypergraph 0 I 1 saw 2 him 3 with 4 a 5 mirror 6 (Klein and Manning, 2001; Huang and Chiang, 2005) Liang Huang (Penn) Dynamic Programming 41
Weight Functions and Semirings u 1 f e (a 1 , ..., a k ) f e tails v u 2 head ... u k Liang Huang (Penn) Dynamic Programming 42
Recommend
More recommend