large scale product categorization
play

Large-scale Product Categorization with Deep Models in Rakuten - PowerPoint PPT Presentation

Large-scale Product Categorization with Deep Models in Rakuten May/8/2017 Ali Cevahir / Denis Miller Rakuten Institute of Technology / Rakuten, Inc. https://rit.rakuten.co.jp / https://global.rakuten.com About Rakuten


  1. Large-scale Product Categorization with Deep Models in Rakuten May/8/2017 Ali Cevahir / Denis Miller Rakuten Institute of Technology / Rakuten, Inc. https://rit.rakuten.co.jp / https://global.rakuten.com

  2. About Rakuten https://global.rakuten.com/corp/about/strength/data.html 2

  3. Rakuten Group Services E-Commerce FinTech Digital Content Travel & Reservation Pro Sports Others 3 https://global.rakuten.com/corp/about/business/internet.html

  4. Rakuten Ichiba EC Branding Consulting Marketing Shoppers Merchants Online Market Place over 230,000,000 items in 30,000+ categories 4

  5. Problem and Solution 5

  6. Introduction • Problem: Given product information, automatically classify it to its correct category MACPHEE( マカフィー ) 切り替え V ネックニット Ladies Fashion  Tops  Knit Sweaters  Long Sleeves  V Neck 6

  7. Proposed Solutions • 2 different models – Deep Belief Nets – Deep Auotoencoders + kNN • 2 different data sources – Titles – Descriptions • Overall results aggregated • GPU Implementation 7

  8. Proposed Solutions • 2-step classification – First classify to Level-1 categories – Then, to leaf levels MACPHEE( マカフィー ) • 81% match with merchants 切り替え V ネックニット (‘others’ excluded) – Merchants are not always correct Ladies Fashion  Tops  Knit Sweaters  Long Sleeves  V Neck 8

  9. CUDeep: A CUDA-based Deep Learning Framework • In-house command-line tool for training DBN and DAE • Written with CUDA, using cuBlas and cuSparse 9

  10. CUDeep: A CUDA-based Deep Learning Framework • Deep Belief Nets Supervised X Y’ Y Class probabilities vs Input features (~1 million dim.) Billions of connections!!! • Deep Autoencoders X’ X Semantic hash 10

  11. CUDeep: A CUDA-based Deep Learning Framework • Selective Reconstruction (Dauphin et. al, 2011) • Applied for both – Layer-wise training – Backpropagation 11

  12. CUDeep: Some Design Decisions • Keep neural net weights on GPU W[vis,hid 1 ] = 4 GB – Faster : No need to communicate weights btw CPU and GPU – Alternative: store weights on main memory, copy weights to be updated to GPU for each minibatch 1M • Sparse input feature vectors are stored on main memory – Limited device memory – Disk streaming possible, but slower 1000 12

  13. CUDeep: Some Design Decisions During layer-wise pre-training : 51.2GB ( 64-d ) • Do not store intermediate ( 1000-d ) 800 GB outputs of hidden layers • Do feedforward computations instead ( 2000-d ) 1.6 TB • Intermediate outputs are dense – Not practical to store 200 Million sparse inputs 8 GB (10 nonzero / feature) 13

  14. CUDA-kNN • Vector search engine 14

  15. CUDA-kNN • Preprocessing: Multi-level k-means clustering 1 • 2-step search 1. Closest-cluster search 2. kNN in the closest cluster 2 15

  16. 2-Step Classification Level 1: • Step-1: 2 DBN & kNN 35 Categories • Step 2: 2x35 DBN & kNN • 2 DAE models – Same encoding for step 1 and step 2 Level 5: ~30,000 Categories 16

  17. Feature Extraction • Features: 0-1 word vectors • Mostly Japanese text • Normalize letters: アイフォン 4 S  アイフォン 4s • Cleaning all html tags: <a href> link </a>  link • Regular expressions for: – Product codes: iPhone-4S → iphone4s – Japanese counters: 4 枚 (do not tokenize) – Sizes and dimensions: 12Cm x 3 Cm → 12cmx3cm 17

  18. Feature Extraction • Titles: 26M tokens • Descr: 47M tokens Total dictionary size: 26M • Use only 1M most- frequent tokens – Good enough for L1 classification – Less tokens exist in Total dictionary size: 800K subcategories for L2 classification 18

  19. Dataset Properties and Hardware Setup • 280 million (active and inactive) products – Rakuten Data Release (https://rit.rakuten.co.jp/opendata.html) • Deduped by titles: 280 million → 172 million – Merchants may sell the same items • 28,338 active categories – ~40% of products are assigned to leaf categories named “ others ” • 90% of randomly selected products used for training • A Linux server with 4 TitanX GPUs • 2 x12-core Intel CPUs • 96 GB main memory 19

  20. Level-1 Genre Prediction Results (Step 1) Includes “others” categories Excludes “others” categories L1 Prediction - with others(Percent Recall @ N) % L1 Prediction - without others(Percent Recall @ N) % 100 100 98 98 96 96 94 94 92 92 90 90 88 88 86 86 84 84 82 82 80 80 78 78 Title-DBN Description-DBN Title-DBN Description-DBN 76 76 Title-KNN Description-KNN Title-KNN Description-KNN 74 74 Combined Combined 72 72 70 70 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Top N predictions Top N predictions 20

  21. Overall Taxonomy Matching (Step 2) Includes “others” categories Excludes “others” categories L5 Prediction - with others (Percent Recall @ N ) L5 Prediction - without others (Percent Recall @ N ) % % 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60 Title-DBN Description-DBN Title-DBN Description-DBN Title-KNN Description-KNN Title-KNN Description-KNN 55 55 Combined Combined DBNs combined 50 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Top N predictions Top N predictions 21

  22. Sample Results Merchant Correct / Algorithm Incorrect Sweet Mother - Isaac Andrews Merchant Category: Books, Magazines & Comics > Western Books > Books For Kids Predicted Category: Books, Magazines & Comics > Western Books > Fiction & Literature 22

  23. Sample Results Merchant Incorrect / Algorithm Correct トヨトミ[ KS-67H ]電子火式流型石油ストブ KS67H Merchant Category: Flowers, Garden & DIY > DIY & Tools > Others Predicted Category: Consumer electronics > Seasonal home appliances > Heating > Oilstove > 14+ tatami (wooden) , 19+ tatami (rebar) 23

  24. Sample Results Merchant and Algorithm are Both Correct レンタル【 RG87 】袴 フルセット / 大学生 / 小学生 / 高校生 / 中学生 Merchant Category: Women’s Fashion > Japanese > Kimono > Hakama Predicted Category: Women’s Fashion > Japanese > Rental 24

  25. Summary • Large-scale product categorization • A multi-modal deep learning approach • CUDA-based tools: CUDeep, CUDA-kNN • Noisy data, high matching with manual labeling • Engineering challenges – Large data – Dynamic data: products and categories keep changing – Not easy to replicate research output with these settings 25

  26. Engineering Work  Architecture  Tuning for different GPU cards  Dealing with large data set  Improving prediction accuracy  Future work 26

  27. System architecture • Designed to have high scalability and availability • Support requests of both single and multiple input data • Based on Docker. Used nvidia-docker for GPU-based components https://github.com/NVIDIA/nvidia-docker 27

  28. Classification data flow diagram 28

  29. PROBLEMS & SOLUTIONS 29

  30. GPU memory size difference Research environment Production environment Titan X Tesla K80 > 12,287 MiB 11,519 MiB 768 MiB loss 30

  31. GPU memory size difference 900K 1K 2K N Different memory size requires a series of experiments to find new model configuration • Reduce input layer size e.g. from 1M to 900K, with sacrificing information Will use latest GPU with more GPU memory to recover this information loss in future work 31

  32. Extra large data amount • 200 GB of raw data • 260 GB of tokenized items 230 • 200+ GB of 70+ model files • 4 days preparing training data • More than one week to train the models using single server with 2 million items Tesla K80 cards • Extremely large memory usage during training and classification 32

  33. Extra large data amount • Issue – File operations and data processing has high time consumption • Solution – Multiprocessing everywhere – High-speed storage 33

  34. Accuracy worse than experiment • Research shows the result of 74% 74% accuracy rate and up to 88% in some categories • After first building the models from latest data, accuracy is only 51% 51% • Further investigations shown some few significant defects. 34

  35. Shuffling input data Additional process • Issue to shuffle data – Due to the high correlation of sample data, this can result in biased gradient and lead to poor convergence • Solution – Add shuffling process into the data preparation Input data preprocessing 35

  36. Tuning training parameter • Issue – Trained models with latest data resulted in low accuracy result • Lower input layer size • Unbalanced item distribution in categories • Solution – Increase number of backpropagation epochs in 2.5 times and decrease bias multiplier in 10 times 36

  37. Grouping of categories • Issue – Low prediction accuracy for similar categories when separating models • Solution – Group similar categories 37

  38. Accuracy improvement result • Recover expected result – 80% of overall accuracy 80% ~ 98% – 98% in popular categories • Cost several months of work 51% 38

  39. Most successful categories 39

  40. FUTURE WORK 40

  41. Next steps 80% Need to improve the accuracy as much as possible • Data analysis • New experiments is not enough 41

Recommend


More recommend