Human-Inspired Structured Prediction for Language and Biology Liang - PowerPoint PPT Presentation

  Background: Consecutive vs. Simultaneous Interpretation consecutive interpretation   simultaneous interpretation   multiplicative latency (x2) additive latency (+3 secs) simultaneous interpretation is extremely difficult only ~3,000 qualified simultaneous interpreters world-wide each interpreter can only sustain for   at most 10-30 minutes the best interpreters can only cover   ～ 60% of the source material just use standard   one of the holy grails of AI full-sentence translation (e.g., seq-to-seq) need fundamentally di ff erent ideas!

Our Breakthrough full-sentence (non-simultaneous) translation   simultaneous translation achieved for the first time   latency: one sentence (10+ secs) latency ~3 secs our work Baidu World Conference, November 2017 Baidu World Conference, November 2018 and many other companies 11

Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014

Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow

Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow non-anticipative: President Bush ( …… waiting …… ) meets with Russian …

Challenge: Word Order Difference • e.g. translate from Subj-Obj-Verb (Japanese, German) to Subj-Verb-Obj (English) • German is underlyingly SOV, and Chinese is a mix of SVO and SOV • human simultaneous interpreters routinely “anticipate” (e.g., predicting German verb) Grissom et al, 2014 President Bush meets with Russian President Putin in Moscow non-anticipative: President Bush ( …… waiting …… ) meets with Russian … anticipative: President Bush meets with Russian President Putin in Moscow

Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 )

布什茶总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng Bush President President wait 2

布什茶总统在 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Bush President in President Bush wait 2

总统布什茶在莫斯科 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē Bush President in Moscow President Bush meets wait 2

布什茶与莫斯科在总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Bush President in Moscow with President Bush meets with wait 2

在俄罗斯与布什茶莫斯科总统 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī Bush President in Moscow with Russian President Bush meets with Russian wait 2

总统布什茶总统在莫斯科与俄罗斯 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng Bush President in Moscow with Russian President President Bush meets with Russian President wait 2

布什茶总统在莫斯科与总统俄罗斯普京 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng P ǔ j ī ng Bush President in Moscow with Russian President Putin President Bush meets with Russian President Putin wait 2

在莫斯科与总统总统俄罗斯普京布什茶会晤 Our Solution: Prefix-to-Prefix p ( y i | x 1 … x n , y 1 …y i- 1 ) • standard seq-to-seq is only suitable for 1 2 3 4 5 seq-to-seq source: conventional full-sentence MT … • we propose prefix-to-prefix,   target: … wait whole source sentence … 1 2 tailed to simultaneous MT 1 2 3 4 5 source: • special case: wait- k policy: translation is prefix-to-prefix   … (wait- k ) target: always k words behind source sentence wait k words 1 2 • training in this way enables anticipation p ( y i | x 1 … x i+k- 1 , y 1 …y i- 1 ) Bùshí z ǒ ngt ǒ ng zài Mòs ī k ē y ǔ Éluós ī z ǒ ngt ǒ ng P ǔ j ī ng huìwù Bush President in Moscow with Russian President Putin meet President Bush meets with Russian President Putin in Moscow wait 2

Research Demo This is just our research demo. Our production system is better (shorter ASR latency). 14

Research Demo ji ā ng zé mín d u ì f ǎ guó z ǒ ng t ǒ ng d e l á i huá f ǎ ng wèn bi ǎ o shì g ǎ n xiè 江泽⺠氒对法国总统的来华访问表示感谢。 jiang zemin to French President ’s to-China visit express gratitude jiang zemin expressed his appreciation for the visit by french president . This is just our research demo. Our production system is better (shorter ASR latency). 14

Latency-Accuracy Tradeoff 15

Deployment Demo This is live recording from the Baidu World Conference on Nov 1, 2018. 16

Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . 17

Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . full-sentence   baselines Gu et al. (2017) 17

Experimental Results (German=>English) German source: doch während man sich im kongress nicht auf ein vorgehen einigen kann , warten mehrere bundesstaaten nicht länger . but while they self in congress not on one action agree can wait several states not longer English translation (simultaneous, wait 3): but , while congress does not agree on a course of action , several states no longer wait . English translation (full-sentence baseline): but , while congressional action can not be agreed , several states are no longer waiting . full-sentence   baselines wait 8 words I traveled to Ulm by train full-sentence baseline: CW = 8 wait 2 wait 6 words I traveled to Ulm by train Gu et al. (2017) Gu et al. (2017): CW = (2+6)/2 = 4 wait 4 1 1 to Ulm I took a 1 train 1 our wait 4 model: CW = (4+1+1+1+1)/5 = 1.6 17

Summary of Innovations and Impact • first simultaneous translation approach with integrated anticipation • inspired by human simultaneous interpreters who routinely anticipate • first simultaneous translation approach with arbitrary controllable latency • previous RL-based approaches can encourage but can’t enforce latency limit • very easy to train and scalable — minor changes to any neural MT codebase • prefix-to-prefix is very general; can be used in other tasks with simultaneity 18

Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平于 2012 年憐在北磻京当选关于克林淋顿主义，没有准确的定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation 19

Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平于 2012 年憐在北磻京当选关于克林淋顿主义，没有准确的定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation ideal Xi Jinping ….. was elected… About Clintonism, there is no accurate definition. simultaneous 19

Next: Integrate Incremental Predictive Parsing • how to be smarter about when to wait and when to translate? mandatory reordering (i.e., wait): optional reordering: (Chinese) PP VP => (English) VP PP (Chinese) PP S => (English) PP S or S PP S S NP VP PP S PP VP PP VP x í j ì n p í n g y ú nián z à i b ě i j ī n g d ā ngxu ǎ n gu ā nyú k è l í n d ù n z h ǔ y ì méiy ǒ u zh ǔ nquè d e d ì n g y ì 习近平于 2012 年憐在北磻京当选关于克林淋顿主义，没有准确的定义 Xi Jiping in 2012 yr in Beijing elected about Clintonism no accurate def. reference “Xi Jinping was elected in Beijing in 2012” “There is no accurate definition of Clintonism.” translation ideal Xi Jinping ….. was elected… About Clintonism, there is no accurate definition. simultaneous 19

the man bit the dog the man bit the dog S NP VP the man bit the dog DT NN VB NP the man bit DT NN the dog constituency parsing dependency parsing Part II: Linear-Time Incremental Parsing (Huang & Sagae, ACL 2010 * ; Goldberg, Zhao, Huang, ACL 2013;   Zhao, Cross, Huang, EMNLP 2013; Mi & Huang, ACL 2015;   Cross & Huang, ACL 2016; Cross & Huang, EMNLP 2016 **   Hong and Huang, ACL 2018) * best paper nominee ** best paper honorable mention

Motivations for Incremental Parsing • simultaneous translation • auto completion (search suggestions) • question answering • dialog • speech recognition • input method editor • … 21

Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan 22

Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan O ( n ) O ( n ) O ( n 3 ) 22

Human Parsing vs. Compilers vs. NL Parsing := id + x id const y 3 I eat sushi with tuna from Japan x = y + 3; I eat sushi with tuna from Japan O ( n ) O ( n ) O ( n 3 ) • can we design NL parsing algorithms that is both fast and accurate,   inspired by human sentence processing and compilers? • our idea: generalize PL parsing (LR algorithm) to NL parsing, but keep it O ( n ) • challenge: how to deal with ambiguity explosion in NL? • solution: linear-time dynamic programming — both fast and accurate! 22

Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese 23

Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese O ( n 2.5 ) O ( n 2.4 ) O ( n 2 ) O ( n ) this work 23

Solution: linear-time, DP , and accurate! • very fast linear-time dynamic programming parser • explores exponentially many trees (and outputs forest) • accurate parsing accuracy on English & Chinese O ( n 2.5 ) O ( n 2.4 ) DP: exponential O ( n 2 ) O ( n ) this work non-DP beam search 23

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 5a r -reduce eat with tuna from ... sushi I 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... I 5a r -reduce eat with tuna from ... sushi I 5b shift tuna from Japan ... eat sushi with I 24

Incremental Parsing (Shift-Reduce) I eat sushi with tuna from Japan action stack queue I eat sushi ... 0 - eat sushi with ... I 1 shift sushi with tuna ... I eat 2 shift eat sushi with tuna ... 3 l -reduce I 4 shift eat sushi with tuna from ... shift-reduce   I conflict 5a r -reduce eat with tuna from ... sushi I 5b shift tuna from Japan ... eat sushi with I 24

Greedy Search • each state => three new states (shift, l-reduce, r-reduce) • greedy search: always pick the best next state • “best” is defined by a score learned from data sh l -re r -re 25

Greedy Search • each state => three new states (shift, l-reduce, r-reduce) • greedy search: always pick the best next state • “best” is defined by a score learned from data 26

Beam Search • each state => three new states (shift, l-reduce, r-reduce) • beam search: always keep top- b states • still just a tiny fraction of the whole search space 27

Beam Search • each state => three new states (shift, l-reduce, r-reduce) • beam search: always keep top- b states • still just a tiny fraction of the whole search space psycholinguistic evidence: parallelism (Fodor et al, 1974; Gibson, 1991) 27

Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space 28 (Huang and Sagae, 2010)

Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to   exponentially many non-DP states graph-structured stack   (Tomita, 1986) 30 (Huang and Sagae, 2010)

Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to   DP: exponential exponentially many non-DP states non-DP beam search 31 (Huang and Sagae, 2010)

Dynamic Programming • each state => three new states (shift, l-reduce, r-reduce) • key idea of DP: share common subproblems • merge equivalent states => polynomial space each DP state corresponds to   DP: exponential exponentially many non-DP states non-DP beam search graph-structured stack   (Tomita, 1986) 31 (Huang and Sagae, 2010)

Merging (Ambiguity Packing) • two states are equivalent if they agree on features • because same features guarantee same cost • example: if we only care about the last 2 words on stack I sushi I eat sushi eat sushi 32 (Huang and Sagae, 2010)

Merging (Ambiguity Packing) • two states are equivalent if they agree on features • because same features guarantee same cost • example: if we only care about the last 2 words on stack two equivalent classes I sushi ... I sushi I eat sushi ... eat sushi eat sushi 32 (Huang and Sagae, 2010)

Human-Inspired Structured Prediction for Language and Biology Liang - PowerPoint PPT Presentation

Human-Inspired Structured Prediction for Language and Biology Liang Huang Principal Scientist, Baidu Research Assistant Professor, Oregon State University incremental & linear-time Human-Inspired Structured Prediction for Language and

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Human Language vs. Animal Communication Linguistics 101 Human Language vs. Animal Communication

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Biologically I nspired Hardware System What is Bio-Inspired System? Why do we need

Service Equivalence via Multiparty Session Type Isomorphisms Assel Altayeva December 19, 2019

Latent Class Analysis (LCA) in Stata Kristin MacDonald Director of Statistical Services

Research Centre for Physics Wigner/MPP Collaboration Meeting Budapest, 2016.04.08 G.P. Djotyan

@ P. Ascher 1 , G. Ban 2 , B. Blank 3 , K. Blaum 1 , J.- F. Cam 4 , P. Delahaye 4 ,

Objectives Clustering Data Compression: Huffman Codes March 4, 2019 CSCI211 - Sprenkle 1

Graph and Web Mining - Motivation, Applications and Algorithms Prof. Ehud Gudes Department of

Time Series Mining and Forecasting Duen Horng (Polo) Chau Georgia Tech Slides based on

the real-time Internet routing observatory Alessandro Improta alessandro.improta@iit.cnr.it Our