neural machine translation in sogou inc
play

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - PowerPoint PPT Presentation

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are technology staff, the highest in Chinas Internet industry No. 2 Chinese Internet


  1. Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang

  2. Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are • technology staff, the highest in China’s Internet industry No. 2 Chinese Internet company in terms of user base 38% of employees hold graduate or • doctor degrees Robust revenue growth PC MAU 520MM , mobile MAU 560MM , • Revenue CAGR of 126% from 2011 to 2015, covering 96% of the Internet users in And In 2015 revenue reached $ 592 million, China profit of $ 110 million.

  3. Rich Product Line Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen 、 Sogou Encyclopedia 、 Sogou Guidance Sogou Exclusive : WeChat search 、 Zhihu search 、 English search

  4. Outline 1 . Neural Machine Translation 2 . Related application scenarios

  5. Machine Translation Automatically translate one sentence of source language into target language  沙 龙 举行 会 谈 布什 与 了 Bush held talks with Sharon Methods  Rule-based machine translation (RBMT)  Example-based machine translation (EBMT)  Statistical Machine Translation (SMT)  …  5

  6. Neural Machine Translation – A New Era To model the direct mapping between source and target language by neural network  沙 龙 举行 会 谈 布什 与 了 Bush held talks with Sharon Really amazing translation quality  Edinburgh’s WMT Results Over the Years 24.7 25 22.1 22 21.5 20.9 20.8 20.3 20.2 19.4 20 15 phrase-based SMT syntax-based SMT 10 neural MT 5 0 2013 2014 2015 2016 From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf ) 6

  7. Neural Machine Translation – A New Era Encoder-Decoder Framework  Encoder : represent the source sentence as a vector by neural network  Decoder : generate target words one by one based on the vector from Encoder  布什 与 沙龙 举行 了 会谈 <\s> Bush held talks with Sharon <\s> What do we actually have in the encoded vector?  (Sutskever et al., 2014) 7

  8. Neural Machine Translation – A New Era Attention Mechanism  For each target word to be generated, dynamically calculate the source language  information related to it <\s 布什 与 沙龙 举行 了 会谈 > Weighted average Bush held talks 8

  9. Sogou Neural Machine Translation Engine A pure neural-based commercial machine translation engine  Stacked encoders and decoders  Dual Learning  Residual network  Zero-shot Learning  Length normalization …   Domain adaptation  …  Bush held talks with Sharon Encoder hidden states Softmax … … … … … … … … … … … Attention Mechanism … … … 举行 了 会谈 布什 与 沙龙 9

  10. Sogou Neural Machine Translation Engine Keep optimizing our translation engine on translation model, bilingual data mining,  distributed training and decoding. Focus on Chinese-English and English-Chinese translation now  Good performance on Chinese-English and Engilsh-Chinese translation  Human Evaluation on Human Evaluation on Chinese-English Translation English-Chinese Translation 4.3 4.3 4.1 4.1 4.2 Sogou 3.9 3.9 3.9 3.7 3.7 Sogou 3.6 3.5 3.5 3.3 3.3 3.1 3.1 2.9 2.9 2.9 2.7 2.7 2.5 2.5 Initial performance Current performance Initial performance Current performance 10

  11. Challenges in Real Application Training is too slow !!!!!  (Sutskever et al., (Wu et al., 2016) 2014) Decoding is slow  less than 200ms per translation request on average to meet the real time standard  Take a one layer GRU NMT system as an example  Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000  Encoder(bidirection): ~ 16M MACs per word (just forward)  2*3*2000*1000 + 2*3*620*1000  Decoder: ~70M MACs per word (just forward)  For Training: 3*3620*1000 + 3*2000*1000 +80000*620  For BeamSearch inference: Decoder computation is BeamSize times larger!  We need fast training and decoding  11

  12. Distributed Training • Parameter server – Keep current model parameters – Receive gradients from workers, and update parameters accordingly • Workers – Make use of GPU for model training – Communicate with Parameter server to update parameters 12

  13. Distributed Training • Asynchronous – Each worker send local updated parameters to Parameter server – Parameter server averages the parameters from worker with its own version – Return the updated parameter to worker • Synchronous – Each worker send its gradients to Parameter server – Parameter server do parameter updating after it receives the gradients from all workers 13

  14. Distributed Training • Acceleration ratio – Asynchronous • around 3x acceleration with 10 GPU cards – Synchronous • Acceleration ratio v.s. number of GPU – (same batchsize * number of GPU) 16 1 Acceleration efficiency 13.232 1 0.976 Acceleration ratio 12 0.9 0.926 7.408 8 0.8 0.827 3.904 4 0.7 1 0 0.6 1 4 8 16 number of GPU 14

  15. Training acceleration • Acceleration on single card – Corpus shuffle • Global random shuffle • Local Sort – sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar – Optimization function selection • Adadelta • Momentum • Adam – about 2 times faster than above 15

  16. Training acceleration • Acceleration on single card – Use better GPU or newer CUDA if possible ☺ 1.8 2.5 2.26 batch 1.6 1.97 time(s) 2 speed up (X) 1.4 1.59 1.2 1.33 1.5 1 1 batch time 0.8 1 0.6 0.4 0.5 0.2 0 0 16

  17. Decoding acceleration • Compute acceleration – fusion of Computations • fusion element wise operations together • fusion matrix multiplications to larger ones – also fusion parameter matrix ahead of time • fusion input embeding projection together – instead of at each step – CUDA function selection • for batchsize=1, use level 2 cuBLAS function instead of level 3 17

  18. Decoding acceleration • Batch Processing – about 3x faster than single sentence • use batch mode if possible – Sentence reordering • sentence length may vary greatly • Encoder – reorder sentence by length – scale batchsize at each step • Decoder – rearrange beams at each step – also scale batchsize according to left beams 18

  19. Decoding acceleration • Other acceleration methods – Use better GPU or newer CUDA if possible ☺ 0.6 3 2.67 batch time(s) 0.5 2.5 2.21 speed up (X) 1.81 0.4 2 1.35 0.3 1.5 batch time 1 0.2 1 0.1 0.5 0 0 19

  20. comparison with training • P40 v.s. P100 P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s • batchsize – training: 80 or more • Computation dominate – inference: 10 or less • memory bandwidth also play an important role 20

  21. Outline 1 . Neural Machine Translation 2 . Related application scenarios 21

  22. Sogou translate related products Translation box Translation Vertical Translation    in search results channel with OCR 22

  23. Sogou translate related products Oversea search  Chinese machine abstract translatio machine Chinese English English n translatio query query results n Chinese machine webpages translatio n 23

  24. 24

Recommend


More recommend