 
              Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang
Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are • technology staff, the highest in China’s Internet industry No. 2 Chinese Internet company in terms of user base 38% of employees hold graduate or • doctor degrees Robust revenue growth PC MAU 520MM , mobile MAU 560MM , • Revenue CAGR of 126% from 2011 to 2015, covering 96% of the Internet users in And In 2015 revenue reached $ 592 million, China profit of $ 110 million.
Rich Product Line Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen 、 Sogou Encyclopedia 、 Sogou Guidance Sogou Exclusive : WeChat search 、 Zhihu search 、 English search
Outline 1 . Neural Machine Translation 2 . Related application scenarios
Machine Translation Automatically translate one sentence of source language into target language  沙 龙 举行 会 谈 布什 与 了 Bush held talks with Sharon Methods  Rule-based machine translation (RBMT)  Example-based machine translation (EBMT)  Statistical Machine Translation (SMT)  …  5
Neural Machine Translation – A New Era To model the direct mapping between source and target language by neural network  沙 龙 举行 会 谈 布什 与 了 Bush held talks with Sharon Really amazing translation quality  Edinburgh’s WMT Results Over the Years 24.7 25 22.1 22 21.5 20.9 20.8 20.3 20.2 19.4 20 15 phrase-based SMT syntax-based SMT 10 neural MT 5 0 2013 2014 2015 2016 From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf ) 6
Neural Machine Translation – A New Era Encoder-Decoder Framework  Encoder : represent the source sentence as a vector by neural network  Decoder : generate target words one by one based on the vector from Encoder  布什 与 沙龙 举行 了 会谈 <\s> Bush held talks with Sharon <\s> What do we actually have in the encoded vector?  (Sutskever et al., 2014) 7
Neural Machine Translation – A New Era Attention Mechanism  For each target word to be generated, dynamically calculate the source language  information related to it <\s 布什 与 沙龙 举行 了 会谈 > Weighted average Bush held talks 8
Sogou Neural Machine Translation Engine A pure neural-based commercial machine translation engine  Stacked encoders and decoders  Dual Learning  Residual network  Zero-shot Learning  Length normalization …   Domain adaptation  …  Bush held talks with Sharon Encoder hidden states Softmax … … … … … … … … … … … Attention Mechanism … … … 举行 了 会谈 布什 与 沙龙 9
Sogou Neural Machine Translation Engine Keep optimizing our translation engine on translation model, bilingual data mining,  distributed training and decoding. Focus on Chinese-English and English-Chinese translation now  Good performance on Chinese-English and Engilsh-Chinese translation  Human Evaluation on Human Evaluation on Chinese-English Translation English-Chinese Translation 4.3 4.3 4.1 4.1 4.2 Sogou 3.9 3.9 3.9 3.7 3.7 Sogou 3.6 3.5 3.5 3.3 3.3 3.1 3.1 2.9 2.9 2.9 2.7 2.7 2.5 2.5 Initial performance Current performance Initial performance Current performance 10
Challenges in Real Application Training is too slow !!!!!  (Sutskever et al., (Wu et al., 2016) 2014) Decoding is slow  less than 200ms per translation request on average to meet the real time standard  Take a one layer GRU NMT system as an example  Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000  Encoder(bidirection): ~ 16M MACs per word (just forward)  2*3*2000*1000 + 2*3*620*1000  Decoder: ~70M MACs per word (just forward)  For Training: 3*3620*1000 + 3*2000*1000 +80000*620  For BeamSearch inference: Decoder computation is BeamSize times larger!  We need fast training and decoding  11
Distributed Training • Parameter server – Keep current model parameters – Receive gradients from workers, and update parameters accordingly • Workers – Make use of GPU for model training – Communicate with Parameter server to update parameters 12
Distributed Training • Asynchronous – Each worker send local updated parameters to Parameter server – Parameter server averages the parameters from worker with its own version – Return the updated parameter to worker • Synchronous – Each worker send its gradients to Parameter server – Parameter server do parameter updating after it receives the gradients from all workers 13
Distributed Training • Acceleration ratio – Asynchronous • around 3x acceleration with 10 GPU cards – Synchronous • Acceleration ratio v.s. number of GPU – (same batchsize * number of GPU) 16 1 Acceleration efficiency 13.232 1 0.976 Acceleration ratio 12 0.9 0.926 7.408 8 0.8 0.827 3.904 4 0.7 1 0 0.6 1 4 8 16 number of GPU 14
Training acceleration • Acceleration on single card – Corpus shuffle • Global random shuffle • Local Sort – sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar – Optimization function selection • Adadelta • Momentum • Adam – about 2 times faster than above 15
Training acceleration • Acceleration on single card – Use better GPU or newer CUDA if possible ☺ 1.8 2.5 2.26 batch 1.6 1.97 time(s) 2 speed up (X) 1.4 1.59 1.2 1.33 1.5 1 1 batch time 0.8 1 0.6 0.4 0.5 0.2 0 0 16
Decoding acceleration • Compute acceleration – fusion of Computations • fusion element wise operations together • fusion matrix multiplications to larger ones – also fusion parameter matrix ahead of time • fusion input embeding projection together – instead of at each step – CUDA function selection • for batchsize=1, use level 2 cuBLAS function instead of level 3 17
Decoding acceleration • Batch Processing – about 3x faster than single sentence • use batch mode if possible – Sentence reordering • sentence length may vary greatly • Encoder – reorder sentence by length – scale batchsize at each step • Decoder – rearrange beams at each step – also scale batchsize according to left beams 18
Decoding acceleration • Other acceleration methods – Use better GPU or newer CUDA if possible ☺ 0.6 3 2.67 batch time(s) 0.5 2.5 2.21 speed up (X) 1.81 0.4 2 1.35 0.3 1.5 batch time 1 0.2 1 0.1 0.5 0 0 19
comparison with training • P40 v.s. P100 P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s • batchsize – training: 80 or more • Computation dominate – inference: 10 or less • memory bandwidth also play an important role 20
Outline 1 . Neural Machine Translation 2 . Related application scenarios 21
Sogou translate related products Translation box Translation Vertical Translation    in search results channel with OCR 22
Sogou translate related products Oversea search  Chinese machine abstract translatio machine Chinese English English n translatio query query results n Chinese machine webpages translatio n 23
24
Recommend
More recommend