The HDU Discriminative SMT System for Constrained Data PatentMT at - - PowerPoint PPT Presentation

the hdu discriminative smt system for constrained data
SMART_READER_LITE
LIVE PREVIEW

The HDU Discriminative SMT System for Constrained Data PatentMT at - - PowerPoint PPT Presentation

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10 Patrick Simianer, Gesa Stupperich, Laura Jehl, Katharina W aschle, Artem Sokolov,


slide-1
SLIDE 1

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

The HDU Discriminative SMT System for Constrained Data PatentMT at NTCIR10

Patrick Simianer, Gesa Stupperich, Laura Jehl, Katharina W¨ aschle, Artem Sokolov, Stefan Riezler

Institute for Computational Linguistics, Heidelberg University, Germany

1 / 20

slide-2
SLIDE 2

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Outline

1 Introduction 2 Discriminative SMT

  • Online pairwise-ranking optimization
  • Multi-Task learning
  • Feature sets

3 Japanese-to-English system description 4 Chinese-to-English system description 5 Conclusion

2 / 20

slide-3
SLIDE 3

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

The HDU discriminative SMT system

Intuition: Patents have a twofold nature; They are . . .

1 easy to translate: repetitive and formulaic text 2 hard to translate: long sentences and unusual jargon

Method: Discriminative SMT

1 Training: multi-task learning with large, sparse feature sets via

ℓ1/ℓ2 regularization

2 Syntax features: soft-syntactic constraints for complex word

  • rder differences in long sentences

3 / 20

slide-4
SLIDE 4

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Subtasks/results

Participation in Chinese-to-English (ZH-EN) and Japanese-to-English (JP-EN) PatentMT subtasks

  • Constrained data situation where only the parallel corpus

provided by the organizers was used

  • Results:

JP-EN Rank 5 (constrained: 2) with regard to BLEU on the Intrinsic Evaluation (IE) test set; IE adequacy 8th, IE acceptability 6th ZH-EN Rank 9 (constrained: 3) for the ZH-EN translation subtask on this subtask’s IE test set; IE adequacy 4th, IE acceptability 4th

4 / 20

slide-5
SLIDE 5

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Hierarchical phrase-based translation

(1) X → X1 要件 の X2 | X2 of X1 requirements (2) X → この とき 、 X1 は | this time , the X1 is (3) X → テキスト メモリ 41 に X1 | X1 in the text memory 41

  • Synchronous CFG with rules encoding hierarchical phrases

(Chiang, 2007; Adam Lopez, 2007)

  • cdec decoder (Dyer et al., 2010)

5 / 20

slide-6
SLIDE 6

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Online pairwise-ranking optimization

ranking by BLEU should agree with ...

  • g(x1) > g(x2)

the model score of the decoder

  • f(x1) > f(x2)

f(x1) − f(x2) > 0

w · x1 − w · x2 > 0

w · (x1 − x2) > 0

  • this can reformulated as a binary classification problem
  • For large feature sets we train a pairwise ranking model

using algorithms for stochastic gradient descent

  • Gold standard training data is obtained by calculating

per-sentence BLEU scores of translations of kbest lists

  • Simplest case: several runs of the perceptron algorithm over a

single development set

  • (data-) Parallelized by sharding (multi-task learning),

incorporating ℓ1/ℓ2 regularization

6 / 20

slide-7
SLIDE 7

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Online pairwise-ranking optimization

ranking by BLEU should agree with ...

  • g(x1) > g(x2)

the model score of the decoder

  • f(x1) > f(x2)

f(x1) − f(x2) > 0

w · x1 − w · x2 > 0

w · (x1 − x2) > 0

  • this can reformulated as a binary classification problem
  • For large feature sets we train a pairwise ranking model

using algorithms for stochastic gradient descent

  • Gold standard training data is obtained by calculating

per-sentence BLEU scores of translations of kbest lists

  • Simplest case: several runs of the perceptron algorithm over a

single development set

  • (data-) Parallelized by sharding (multi-task learning),

incorporating ℓ1/ℓ2 regularization

6 / 20

slide-8
SLIDE 8

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Online pairwise-ranking optimization

ranking by BLEU should agree with ...

  • g(x1) > g(x2)

the model score of the decoder

  • f(x1) > f(x2)

f(x1) − f(x2) > 0

w · x1 − w · x2 > 0

w · (x1 − x2) > 0

  • this can reformulated as a binary classification problem
  • For large feature sets we train a pairwise ranking model

using algorithms for stochastic gradient descent

  • Gold standard training data is obtained by calculating

per-sentence BLEU scores of translations of kbest lists

  • Simplest case: several runs of the perceptron algorithm over a

single development set

  • (data-) Parallelized by sharding (multi-task learning),

incorporating ℓ1/ℓ2 regularization

6 / 20

slide-9
SLIDE 9

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Algorithm for Multi-Task Learning

  • Randomly split data into Z shards
  • Run optimization on each shard separately for one iteration
  • Collect and stack resulting weight vectors
  • Select top K feature columns that have highest ℓ2 norm over

shards (or equivalently, by setting a threshold λ)

  • Average weights of selected features k ← 1 . . . K over shards

v[k] = 1 Z

Z

  • z=1

W[z][k]

  • Resend reduced weight vector v to shards for new iteration

7 / 20

slide-10
SLIDE 10

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

data } shards select features, mix models . . .

8 / 20

slide-11
SLIDE 11

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Feature sets

12 dense features of the default translation model

  • Sparse lexicalized features, defined locally on SCFG rules:

Rule identifiers: unique rule identifier Rule n-grams: bigrams in source and target side of a rule, e.g. of X1, X1 requirements Rule shape: 39 patterns identifying location of sequences of terminal and non-terminal symbols, e.g. NT, term*, NT -- NT, term*, NT,

term*

(1) X → X1 要件 の X2 | X2 of X1 requirements

  • Soft-syntactic constraints on source side:
  • 20 features for matching/non-matching of 10 most common

constituents (Marton and Resnik, 2008)

9 / 20

slide-12
SLIDE 12

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Marton & Resnik’s soft-syntactic constraints

{ADJP

,ADVP ,CP ,DNP ,IP ,LCP ,NP ,PP ,QP ,VP} × {=,+}

  • These features indicate if spans in parses of the decoder

match = or cross + constituents in syntactic trees

  • We compare these on the source of the data; syntactic trees

are pre-computed; lookup is done online

  • In contrast to (Chiang, 2005) these features include the actual

phrase labels

10 / 20

slide-13
SLIDE 13

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

JP-EN: System Setup

Training data: three million parallel sentences of NTCIR10, constrained data

Standard SMT pipeline: GIZA word alignments; MeCab for Japanese segmentation; moses tools for English; lowercased models; 5gram SRILM language model; grammars with max. two non-terminals

Extensive preprocessing HDU-1 Multi-task training with sparse rule features combining all four available development sets HDU-2 Identical to HDU-1 but training stopped early

11 / 20

slide-14
SLIDE 14

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

JP-EN: Preprocessing

  • English tokenization: we slightly extended the non-breaking

prefixes list (e.g. including FIG., PAT., . . . )

  • Consistent tokenization (Ma and Matsoukas, 2011)
  • Training data was aligned using regular expressions
  • In test and development data we use the most common variant
  • bserved in training data
  • Applied a compound splitter to split Katakana terms (Feng

et al., 2011) to decrease the number of OOVs

12 / 20

slide-15
SLIDE 15

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

JP-EN: Development tuning

tuning set tuning method dev1 dev2 dev3 dev1,2,3 MERT baseline (avg) 27.85 27.63 27.6 27.76 single dev, dense 27.83 – – – single dev, +sparse 28.84 28.08 28.71 29.03 multi-task, +sparse – – – 28.92

13 / 20

slide-16
SLIDE 16

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

ZH-EN: System Setup

Training and development data of NTCIR10 (one million/2000 parallel sentences), constrained setup Standard SMT pipeline, segmentation of Chinese with the Stanford Segmenter, no additional preprocessing HDU-1 Marton & Resnik’s soft-syntactic features, 20 additional weights tuned with MERT HDU-2 System as JP-EN with sparse rule features, but unregularized training on a single development set

14 / 20

slide-17
SLIDE 17

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Effects of soft-syntactic constraints I

baseline Another option is coupled to both ends

  • f the fiber

. . . , thereby allowing . . . soft-syntax Another alternative is to couple the ends of the fiber . . . , thereby allowing . . . reference A further option is to optically couple both ends 10 of the optical fiber 5 . . . , thus allowing . . .

15 / 20

slide-18
SLIDE 18

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Effects of soft-syntactic constraints II

S X 。 . S X 更 高 higher S X 的 光 出 the light

  • ut-

put X 1 device 1 照明 illu- mi- na- tion S X , 由 此 允 there- by al- low- ing S X 不同 LED differ- ent LED S X 4 的

  • f

the 4 X 固 光源 solid- state light source S X X 耦合 到 coup- led to 10 10 X 的 两 端 both ends

  • f

the X 光 5 fiber 5 将 S X 另 一 是 Another

  • p-

tion is 16 / 20

slide-19
SLIDE 19

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

Effects of soft-syntactic constraints III

S X X 。 . X X 更 高 higher X X 光 出 light

  • ut-

put 1 的

  • f

the . . . 1 X 照 明 illu- mi- na- tion de- vice 由 此 允 there- by al- low- ing , X X X LED LED 的 不同 different X X 4 4 固 光源 solid- state light source 到 to 10 耦合 couple the . . . 10 X X 两 端 ends 的

  • f

the X 5 5 光 fiber 将 另 一 是 Another al- ter- na- tive is

17 / 20

slide-20
SLIDE 20

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

The HDU discriminative SMT system: Conclusion

  • We achieved solid results for both subtasks with good

automatic and manual evaluation results

  • Training a model of sparse features is a very good approach

for patent translation, with improvements of about 1 BLEU point by just enabling them

  • Multi-task learning enables the use of more training data,

newer experiments even point to further possibilities of improvement with this technique

  • Soft-syntactic constraints show the desired effect,

incorporating proper syntax into Hiero models, leading to better translations (and prettier derivations!)

18 / 20

slide-21
SLIDE 21

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

References I

Adam Lopez. Hierarchical phrase-based translation with suffix

  • arrays. Technical report, University of Maryland, College Park,

2007. David Chiang. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting

  • n Association for Computational Linguistics, 2005.

David Chiang. Hierarchical phrase-based translation. Computational Linguistics, 33(2), June 2007. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation

  • models. In Proceedings of ACL 2010, 2010.

19 / 20

slide-22
SLIDE 22

Introduction Discriminative SMT Japanese-to-English Chinese-to-English References

References II

Minwei Feng, Christoph Schmidt, Joern Wuebker, Stephan Peitz, Markus Freitag, and Hermann Ney. The RWTH Aachen system for NTCIR-9 PatentMT, 2011. Jeff Ma and Spyros Matsoukas. BBN’s systems for the Chinese-English sub-task of NTCIR-9 PatentMT evaluation, 2011. Yuval Marton and Philip Resnik. Soft syntactic constraints for hierarchical phrased-based translation. In Proceedings of ACL 2008, 2008.

20 / 20