User Track Machine Translation at Booking.com Journey and Lessons Learned May 30, 2017, Prague Pavel Levin Nishikant Dhanuka Maxim Khalilov
Who am I? About me • Master in Computer Science (NLP) from IIT Mumbai • 8 years of work experience in analytics and consulting • Data Scientist at Booking.com since last 2 years Partner Services Department (Scaled Content) • linkedin.com/in/nishikantdhanuka/ About Booking.com W orld’s #1 website for booking hotels and other accommodations • Founded in 1996 in Amsterdam; part of The Priceline Group since 2005 • 1,200,000+ properties in more than 220+ countries; 25 Million rooms • Over 1,400,000 room nights are reserved on Booking.com every 24 hours • Employing more than 15,000 people in 199 offices worldwide • Website available in 43 languages
Agenda. • Motivation MT critical for Booking.com’s localization process • MT Journey and Lessons Learned MT Model & Experiments Evaluation Results ◦ Automatic & Human ◦ Sentence Length Analysis ◦ A/B Tests Interesting Examples • Conclusion and Future Work • Q & A
Motivation
Mission: Empower people to book any hotel in the world, while browsing high quality content in their own language. of daily bookings on Booking.com is made in a language other than English … thus it is important to have locally relevant content at scale How Locally Relevant? Why At Scale? • Present Hotel descriptions in the • One Million+ properties and language of the user growing very fast • Allow partners and guests to • Frequent change requests to consume and produce content in update the content their own language • 43 languages and more Customer Reviews • New customer reviews / tickets Customer Service Support every second
Currently Hotel descriptions translated by human in 43 languages based on visitor demand. *50% Translation Coverage *90% Demand Coverage * Approximate numbers based on average of some languages
Example of Lost Business Opportunity because of highly manual and slow process. Put in human translation pipeline if this happens often Lost Business New Hotel in Profile visited Sees the Drops Off China on B.com by a description in How do we balance quality, German English Initial content only in speed and cost effectiveness? customer English & Chinese Still makes the booking Chicken-Egg Machine Success problem Translation
MT Journey and Lessons Learned
Our Journey to discover the awesomeness of NMT Phase 1 Phase 2 Phase 3 General Purpose SMT NMT Trained on NMT general purpose data Booking.com SMT NMT SMT Trained on in-domain data In-domain In-domain General Purpose NMT SMT NMT > > > General Purpose General Purpose In-domain SMT NMT SMT
Lots of in-domain data to train the MT system Langu- Parallel # of Vocab Avg. age Sente- Words Size Len nces English -> German German 171 M 845 K 16.3 10.5 M English 174 M 583 K 16.5 English -> French French 193 M 588 K 17.7 11.3 M English 188 M 581 K 16.7
Our NMT Model Configuration Details Data Preparation Model Training Translate Split Data Train, Val, Model Type seq2seq Optimization Stochastic Gradient Beam 30 Test Method Descent Size Input Text Word Input 1,000 Initial 1 Unknown Source with Unit Level Embedding Learning Words Highest Dimension Rate Handling Attention Tokenization Aggressive RNN Type LSTM Decay Rate 0.5 Evaluate Max Sentence 50 # of hidden 4 Decay Decrease in Validation Auto BLEU Length layers Strategy Perplexity <=0 WER Vocabulary 50,000 Hidden Layer 1,000 Number of 5 - 13 Human A/F Size Dimension Epochs Attention Global Stopping BLEU + sensitive Other Length Mechanism Attention Criteria sentences +constraints A/B Test ** Approx. 220 Million Dropout 0.3 ** MT pipeline based on Parameters Harvard implementation of OpenNMT Batch Size 250 ** 1 Epoch takes approx. 2 days on a single NVIDIA Tesla K80 GPU
1. Data Preparation: Tokenization and Vocabulary EN: The rooms at the Prague Mandarin Oriental feature EN DE Data Preparation underfloor heating, and guests can choose from various <blank> 1 <blank> 1 bed linen and pillows. <unk> 2 <unk> 2 Split Data Train, Val, Test <s> 3 <s> 3 DE: Die Zimmer im Prague Mandarin Oriental bieten </s> 4 </s> 4 eine Fußbodenheizung und eine Auswahl an Input Text Word a 5 und 5 Unit Level Bettwäsche und Kissen. and 6 sie 6 Tokenization Aggressive the 7 mit 7 is 8 einen 8 Max Sentence 50 with 9 der 9 EN: The rooms at the Prague Mandarin Oriental Length in 10 ein 10 feature underfloor heating , and guests can choose Vocabulary 50,000 from various bed linen and pillows . Size Aggressive only keeps Tokenized text sequences of letters / DE: Die Zimmer im Prague Mandarin Oriental bieten represented as bag of numbers i.e. doesn’t eine Fußbodenheizung und eine Auswahl an words vector based on allow mix of alphanumeric Bettwäsche und Kissen . vocabulary ids as in: "E65", "soft-landing"
2. Model Architecture: Approx. 220 Million Parameters . Umfasst wifi Model Model Type seq2seq Input 1,000 Embedding Dimension RNN Type LSTM # of hidden 4 layers Hidden Layer 1,000 Dimension Attention Global Mechanism Attention Includes Wifi . Umfasst wifi
3. Training: 1 Epoch takes approx. 2 days on a single NVIDIA Tesla K80 GPU Training Perplexity Development BLEU Score Development Optimization Stochastic Gradient Method Descent 54 2.2 Model Perplexity Initial 1 52 2.1 BLEU Score Learning 50 2 Rate 48 1.9 Decay Rate 0.5 46 1.8 44 Decay Decrease in Validation 1.7 42 Strategy Perplexity <=0 1.6 40 Number of 5 - 13 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Epochs Epoch # Epoch # Stopping BLEU + sensitive Criteria sentences +constraints Stopping Criteria: Sensitive Sentence Example Dropout 0.3 The neighborhood is very nice and safe There is a safe installed in this very nice neighborhood Batch Size 250
4. Translate: Unknown Word Handling Good Example Bad Example Translate Source Offering a restaurant, Hodor Eco- Free access to The Game Beam 30 Size lodge is located in Winterfell. entertainment Centre Unknown Source with Human Das Hodor Eco-Lodge begrüßt Sie Kostenfreier Zugang zum Words Highest Translation in Winterfell mit einem Restaurant. Unterhaltungszentrum The Handling Attention Game Raw Output Das <unk><unk> in <unk> bietet Kostenfreier Zugang zum ein Restaurant. <unk> Output with Das Hodor Eco-lodge in Winterfell Kostenfreier Zugang zum <unk> replaced bietet ein Restaurant. Centre
5. Evaluate: Auto, Human, Length Analysis & A/B Tests Evaluate BLEU A/F Framework Auto BLEU ➢ 3 evaluators per language # of words shared WER ➢ Provided with original text and MT hypotheses, between MT output Human A/F and human reference including human reference ➢ Benefits sequential ➢ Not aware which system produced which hypothesis ➢ Asked to assess the quality of 150 random sentences words Other Length A/B Test ➢ Penalizes short from test corpus ➢ 4 level scale to both Adequacy & Fluency translations Example: WER Variation of the word- Minor Mistake: - EN : “there is a parking area available” level Levenshtein - DE : “ es ist eine Garage verfügbar ” distance ➢ Measures the distance by counting Major Mistake: - EN: “there is a parking area available” insertions, deletions - DE : “ es ist eine Aufbewahrungsstelle verfügbar ” & substitutions
Our In-domain NMT system Evaluation Results 1/5: significantly outperforms all BLEU Score for German & French other MT engines 55 50 46 Both Neural systems 45 consistently outperform their 40 35 35 31 Statistical counter-parts 28 30 25 SMT NMT GP-SMT GP-NMT In-domain SMT beats General 53 55 Purpose NMT 50 45 40 36 32 35 30 Compared to German, French 30 25 improved much more from SMT NMT GP-SMT GP-NMT SMT to NMT
Our In-domain NMT system Evaluation Results 2/5: still outperforms all other MT Adequacy/Fluency Scores for German engines Human 4 3.9 3.96 3.8 Both Neural systems still 3.65 3.62 3.57 3.6 consistently outperform their 3.4 statistical counter-parts 3.2 3 However General Purpose SMT NMT GP-SMT GP-NMT 4 Human NMT now beats In-domain 3.78 3.82 3.8 SMT 3.57 3.6 3.37 3.4 3.15 Particularly fluency score of 3.2 our NMT engine is close to 3 SMT NMT GP-SMT GP-NMT human level
General Purpose NMT system Evaluation Results 3/5: outperforms others; conflicts Adequacy/Fluency Scores for French with BLEU 4 3.78 3.8 Human 3.67 Apparently General Purpose 3.70 3.6 NMT even outperforms 3.4 3.32 3.4 human level 3.2 3 Adequacy of both neural SMT NMT GP-SMT GP-NMT 4 engines is almost at human Human 3.8 3.75 level; fluency very far though 3.6 3.41 3.4 3.4 3.31 3.28 3.2 Compared to German, A/F scores are relatively less for 3 SMT NMT GP-SMT GP-NMT French; conflicts with BLEU
Recommend
More recommend