human inspired structured prediction for language and
play

Human-Inspired Structured Prediction for Language and Biology Liang - PowerPoint PPT Presentation

Human-Inspired Structured Prediction for Language and Biology Liang Huang Principal Scientist, Baidu Research Assistant Professor, Oregon State University incremental & linear-time Human-Inspired Structured Prediction for Language and


  1. 
 Background: Consecutive vs. Simultaneous Interpretation consecutive interpretation 
 simultaneous interpretation 
 multiplicative latency (x2) additive latency (+3 secs) simultaneous interpretation is extremely difficult only ~3,000 qualified simultaneous interpreters world-wide each interpreter can only sustain for 
 at most 10-30 minutes the best interpreters can only cover 
 ~ 60% of the source material just use standard 
 one of the holy grails of AI full-sentence translation (e.g., seq-to-seq) need fundamentally di ff erent ideas!

  2. Our Breakthrough full-sentence (non-simultaneous) translation 
 simultaneous translation achieved for the first time 
 latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

  3. Our Breakthrough full-sentence (non-simultaneous) translation 
 simultaneous translation achieved for the first time 
 latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

  4. Our Breakthrough full-sentence (non-simultaneous) translation 
 simultaneous translation achieved for the first time 
 latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

  5. Our Breakthrough full-sentence (non-simultaneous) translation 
 simultaneous translation achieved for the first time 
 latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

  6. Our Breakthrough full-sentence (non-simultaneous) translation 
 simultaneous translation achieved for the first time 
 latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

  7. Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014

  8. Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow

  9. Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow non-anticipative: President Bush ( …… waiting …… ) meets with Russian …

  10. Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow non-anticipative: President Bush ( …… waiting …… ) meets with Russian … anticipative: President Bush meets with Russian President Putin in Moscow

  11. Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 )

  12. 布什茶 总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng Bush President President wait 2

  13. 布什茶 总统 在 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Bush President in President Bush wait 2

  14. 总统 布什茶 在 莫斯科 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē Bush President in Moscow President Bush meets wait 2

  15. 布什茶 与 莫斯科 在 总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Bush President in Moscow with President Bush meets with wait 2

  16. 在 俄罗斯 与 布什茶 莫斯科 总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī Bush President in Moscow with Russian President Bush meets with Russian wait 2

  17. 总统 布什茶 总统 在 莫斯科 与 俄罗斯 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng Bush President in Moscow with Russian President President Bush meets with Russian President wait 2

  18. 布什茶 总统 在 莫斯科 与 总统 俄罗斯 普京 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng P ǔ j ī ng Bush President in Moscow with Russian President Putin President Bush meets with Russian President Putin wait 2

  19. 在 莫斯科 与 总统 总统 俄罗斯 普京 布什茶 会晤 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix, 
 target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix 
 … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng P ǔ j ī ng huìwù Bush President in Moscow with Russian President Putin meet President Bush meets with Russian President Putin in Moscow wait 2

  20. Research Demo This is just our research demo. Our production system is better (shorter ASR latency). 14

  21. Research Demo This is just our research demo. Our production system is better (shorter ASR latency). 14

  22. Research Demo ji ā ng zé mín d u ì f ǎ guó z ǒ ng t ǒ ng d e l á i huá f ǎ ng wèn bi ǎ o shì g ǎ n xiè 江 泽⺠氒 对 法国 总统 的 来华 访问 表示 感谢 。 jiang zemin to French President ’s to-China visit express gratitude jiang zemin expressed his appreciation for the visit by french president . This is just our research demo. Our production system is better (shorter ASR latency). 14

  23. Latency-Accuracy Tradeoff 15

  24. Latency-Accuracy Tradeoff 15

  25. Deployment Demo This is live recording from the Baidu World Conference on Nov 1, 2018. 16

  26. Deployment Demo This is live recording from the Baidu World Conference on Nov 1, 2018. 16

  27. Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . 17

  28. Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . full-sentence 
 baselines Gu et al. (2017) 17

  29. Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . full-sentence 
 baselines wait 8 words I traveled to Ulm by train full-sentence baseline: CW = 8 wait 2 wait 6 words I traveled to Ulm by train Gu et al. (2017) Gu et al. (2017): CW = (2+6)/2 = 4 wait 4 1 1 to Ulm I took a 1 train 1 our wait 4 model: CW = (4+1+1+1+1)/5 = 1.6 17

  30. Summary of Innovations and Impact • first simultaneous translation approach with integrated anticipation • inspired by human simultaneous interpreters who routinely anticipate • first simultaneous translation approach with arbitrary controllable latency • previous RL-based approaches can encourage but can’t enforce latency limit • very easy to train and scalable — minor changes to any neural MT codebase • prefix-to-prefix is very general; can be used in other tasks with simultaneity 18

  31. Summary of Innovations and Impact • first simultaneous translation approach with integrated anticipation • inspired by human simultaneous interpreters who routinely anticipate • first simultaneous translation approach with arbitrary controllable latency • previous RL-based approaches can encourage but can’t enforce latency limit • very easy to train and scalable — minor changes to any neural MT codebase • prefix-to-prefix is very general; can be used in other tasks with simultaneity 18

  32. Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平 于 2012 年憐 在 北磻京 当选 关于 克林淋顿主义 , 没有 准确 的 定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation 19

  33. Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平 于 2012 年憐 在 北磻京 当选 关于 克林淋顿主义 , 没有 准确 的 定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation ideal Xi Jinping ….. was elected… About Clintonism, there is no accurate definition. simultaneous 19

  34. Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: (Chinese) PP VP => (English) VP PP (Chinese) PP S => (English) PP S or S PP S S NP VP PP S PP VP PP VP x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平 于 2012 年憐 在 北磻京 当选 关于 克林淋顿主义 , 没有 准确 的 定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation ideal Xi Jinping ….. was elected… About Clintonism, there is no accurate definition. simultaneous 19

  35. the man bit the dog the man bit the dog S NP VP the man bit the dog DT NN VB NP the man bit DT NN the dog constituency parsing dependency parsing Part II: Linear-Time Incremental Parsing (Huang & Sagae, ACL 2010 * ; Goldberg, Zhao, Huang, ACL 2013; 
 Zhao, Cross, Huang, EMNLP 2013; Mi & Huang, ACL 2015; 
 Cross & Huang, ACL 2016; Cross & Huang, EMNLP 2016 ** 
 Hong and Huang, ACL 2018) * best paper nominee ** best paper honorable mention

  36. the man bit the dog the man bit the dog S NP VP the man bit the dog DT NN VB NP the man bit DT NN the dog constituency parsing dependency parsing Part II: Linear-Time Incremental Parsing (Huang & Sagae, ACL 2010 * ; Goldberg, Zhao, Huang, ACL 2013; 
 Zhao, Cross, Huang, EMNLP 2013; Mi & Huang, ACL 2015; 
 Cross & Huang, ACL 2016; Cross & Huang, EMNLP 2016 ** 
 Hong and Huang, ACL 2018) * best paper nominee ** best paper honorable mention

  37. Motivations for Incremental Parsing • simultaneous translation • auto completion (search suggestions) • question answering • dialog • speech recognition • input method editor • … 21

  38. Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan 22

  39. Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan O ( n ) O ( n ) O ( n 3 ) 22

  40. Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan O ( n ) O ( n ) O ( n 3 ) • can we design NL parsing algorithms that is both fast and accurate, 
 inspired by human sentence processing and compilers? • our idea: generalize PL parsing (LR algorithm) to NL parsing, but keep it O ( n ) • challenge: how to deal with ambiguity explosion in NL? • solution: linear-time dynamic programming — both fast and accurate! 22

  41. Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese 23

  42. Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese O ( n 2.5 ) O ( n 2.4 ) O ( n 2 ) O ( n ) this work 23

  43. Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese O ( n 2.5 ) O ( n 2.4 ) DP: exponential O ( n 2 ) O ( n ) this work non-DP beam search 23

  44. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan 24

  45. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan 24

  46. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue 24

  47. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - 24

  48. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift 24

  49. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift 24

  50. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 24

  51. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 24

  52. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 5a r -reduce eat with tuna from ... sushi I 24

  53. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 5a r -reduce eat with tuna from ... sushi I 5b shift tuna from Japan ... eat sushi with I 24

  54. Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... shift-reduce 
 I conflict 5a r -reduce eat with tuna from ... sushi I 5b shift tuna from Japan ... eat sushi with I 24

  55. Greedy Search • each state => three new states (shift, l-reduce, r-reduce) • greedy search: always pick the best next state • “best” is defined by a score learned from data sh l -re r -re 25

  56. Greedy Search • each state => three new states (shift, l-reduce, r-reduce) • greedy search: always pick the best next state • “best” is defined by a score learned from data 26

  57. Beam Search • each state => three new states (shift, l-reduce, r-reduce) • beam search: always keep top- b states • still just a tiny fraction of the whole search space 27

  58. Beam Search • each state => three new states (shift, l-reduce, r-reduce) • beam search: always keep top- b states • still just a tiny fraction of the whole search space psycholinguistic evidence: parallelism (Fodor et al, 1974; Gibson, 1991) 27

  59. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space 28 (Huang and Sagae, 2010)

  60. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space 29 (Huang and Sagae, 2010)

  61. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space 30 (Huang and Sagae, 2010)

  62. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to 
 exponentially many non-DP states graph-structured stack 
 (Tomita, 1986) 30 (Huang and Sagae, 2010)

  63. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to 
 DP: exponential exponentially many non-DP states non-DP beam search 31 (Huang and Sagae, 2010)

  64. Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to 
 DP: exponential exponentially many non-DP states non-DP beam search graph-structured stack 
 (Tomita, 1986) 31 (Huang and Sagae, 2010)

  65. Merging (Ambiguity Packing) • two states are equivalent if they agree on features • because same features guarantee same cost • example: if we only care about the last 2 words on stack I sushi I eat sushi eat sushi 32 (Huang and Sagae, 2010)

  66. Merging (Ambiguity Packing) • two states are equivalent if they agree on features • because same features guarantee same cost • example: if we only care about the last 2 words on stack two equivalent classes I sushi ... I sushi I eat sushi ... eat sushi eat sushi 32 (Huang and Sagae, 2010)

Recommend


More recommend