Large-scale Product Categorization with Deep Models in Rakuten May/8/2017 Ali Cevahir / Denis Miller Rakuten Institute of Technology / Rakuten, Inc. https://rit.rakuten.co.jp / https://global.rakuten.com
About Rakuten https://global.rakuten.com/corp/about/strength/data.html 2
Rakuten Group Services E-Commerce FinTech Digital Content Travel & Reservation Pro Sports Others 3 https://global.rakuten.com/corp/about/business/internet.html
Rakuten Ichiba EC Branding Consulting Marketing Shoppers Merchants Online Market Place over 230,000,000 items in 30,000+ categories 4
Problem and Solution 5
Introduction • Problem: Given product information, automatically classify it to its correct category MACPHEE( マカフィー ) 切り替え V ネックニット Ladies Fashion Tops Knit Sweaters Long Sleeves V Neck 6
Proposed Solutions • 2 different models – Deep Belief Nets – Deep Auotoencoders + kNN • 2 different data sources – Titles – Descriptions • Overall results aggregated • GPU Implementation 7
Proposed Solutions • 2-step classification – First classify to Level-1 categories – Then, to leaf levels MACPHEE( マカフィー ) • 81% match with merchants 切り替え V ネックニット (‘others’ excluded) – Merchants are not always correct Ladies Fashion Tops Knit Sweaters Long Sleeves V Neck 8
CUDeep: A CUDA-based Deep Learning Framework • In-house command-line tool for training DBN and DAE • Written with CUDA, using cuBlas and cuSparse 9
CUDeep: A CUDA-based Deep Learning Framework • Deep Belief Nets Supervised X Y’ Y Class probabilities vs Input features (~1 million dim.) Billions of connections!!! • Deep Autoencoders X’ X Semantic hash 10
CUDeep: A CUDA-based Deep Learning Framework • Selective Reconstruction (Dauphin et. al, 2011) • Applied for both – Layer-wise training – Backpropagation 11
CUDeep: Some Design Decisions • Keep neural net weights on GPU W[vis,hid 1 ] = 4 GB – Faster : No need to communicate weights btw CPU and GPU – Alternative: store weights on main memory, copy weights to be updated to GPU for each minibatch 1M • Sparse input feature vectors are stored on main memory – Limited device memory – Disk streaming possible, but slower 1000 12
CUDeep: Some Design Decisions During layer-wise pre-training : 51.2GB ( 64-d ) • Do not store intermediate ( 1000-d ) 800 GB outputs of hidden layers • Do feedforward computations instead ( 2000-d ) 1.6 TB • Intermediate outputs are dense – Not practical to store 200 Million sparse inputs 8 GB (10 nonzero / feature) 13
CUDA-kNN • Vector search engine 14
CUDA-kNN • Preprocessing: Multi-level k-means clustering 1 • 2-step search 1. Closest-cluster search 2. kNN in the closest cluster 2 15
2-Step Classification Level 1: • Step-1: 2 DBN & kNN 35 Categories • Step 2: 2x35 DBN & kNN • 2 DAE models – Same encoding for step 1 and step 2 Level 5: ~30,000 Categories 16
Feature Extraction • Features: 0-1 word vectors • Mostly Japanese text • Normalize letters: アイフォン 4 S アイフォン 4s • Cleaning all html tags: <a href> link </a> link • Regular expressions for: – Product codes: iPhone-4S → iphone4s – Japanese counters: 4 枚 (do not tokenize) – Sizes and dimensions: 12Cm x 3 Cm → 12cmx3cm 17
Feature Extraction • Titles: 26M tokens • Descr: 47M tokens Total dictionary size: 26M • Use only 1M most- frequent tokens – Good enough for L1 classification – Less tokens exist in Total dictionary size: 800K subcategories for L2 classification 18
Dataset Properties and Hardware Setup • 280 million (active and inactive) products – Rakuten Data Release (https://rit.rakuten.co.jp/opendata.html) • Deduped by titles: 280 million → 172 million – Merchants may sell the same items • 28,338 active categories – ~40% of products are assigned to leaf categories named “ others ” • 90% of randomly selected products used for training • A Linux server with 4 TitanX GPUs • 2 x12-core Intel CPUs • 96 GB main memory 19
Level-1 Genre Prediction Results (Step 1) Includes “others” categories Excludes “others” categories L1 Prediction - with others(Percent Recall @ N) % L1 Prediction - without others(Percent Recall @ N) % 100 100 98 98 96 96 94 94 92 92 90 90 88 88 86 86 84 84 82 82 80 80 78 78 Title-DBN Description-DBN Title-DBN Description-DBN 76 76 Title-KNN Description-KNN Title-KNN Description-KNN 74 74 Combined Combined 72 72 70 70 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Top N predictions Top N predictions 20
Overall Taxonomy Matching (Step 2) Includes “others” categories Excludes “others” categories L5 Prediction - with others (Percent Recall @ N ) L5 Prediction - without others (Percent Recall @ N ) % % 100 100 95 95 90 90 85 85 80 80 75 75 70 70 65 65 60 60 Title-DBN Description-DBN Title-DBN Description-DBN Title-KNN Description-KNN Title-KNN Description-KNN 55 55 Combined Combined DBNs combined 50 50 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Top N predictions Top N predictions 21
Sample Results Merchant Correct / Algorithm Incorrect Sweet Mother - Isaac Andrews Merchant Category: Books, Magazines & Comics > Western Books > Books For Kids Predicted Category: Books, Magazines & Comics > Western Books > Fiction & Literature 22
Sample Results Merchant Incorrect / Algorithm Correct トヨトミ[ KS-67H ]電子火式流型石油ストブ KS67H Merchant Category: Flowers, Garden & DIY > DIY & Tools > Others Predicted Category: Consumer electronics > Seasonal home appliances > Heating > Oilstove > 14+ tatami (wooden) , 19+ tatami (rebar) 23
Sample Results Merchant and Algorithm are Both Correct レンタル【 RG87 】袴 フルセット / 大学生 / 小学生 / 高校生 / 中学生 Merchant Category: Women’s Fashion > Japanese > Kimono > Hakama Predicted Category: Women’s Fashion > Japanese > Rental 24
Summary • Large-scale product categorization • A multi-modal deep learning approach • CUDA-based tools: CUDeep, CUDA-kNN • Noisy data, high matching with manual labeling • Engineering challenges – Large data – Dynamic data: products and categories keep changing – Not easy to replicate research output with these settings 25
Engineering Work Architecture Tuning for different GPU cards Dealing with large data set Improving prediction accuracy Future work 26
System architecture • Designed to have high scalability and availability • Support requests of both single and multiple input data • Based on Docker. Used nvidia-docker for GPU-based components https://github.com/NVIDIA/nvidia-docker 27
Classification data flow diagram 28
PROBLEMS & SOLUTIONS 29
GPU memory size difference Research environment Production environment Titan X Tesla K80 > 12,287 MiB 11,519 MiB 768 MiB loss 30
GPU memory size difference 900K 1K 2K N Different memory size requires a series of experiments to find new model configuration • Reduce input layer size e.g. from 1M to 900K, with sacrificing information Will use latest GPU with more GPU memory to recover this information loss in future work 31
Extra large data amount • 200 GB of raw data • 260 GB of tokenized items 230 • 200+ GB of 70+ model files • 4 days preparing training data • More than one week to train the models using single server with 2 million items Tesla K80 cards • Extremely large memory usage during training and classification 32
Extra large data amount • Issue – File operations and data processing has high time consumption • Solution – Multiprocessing everywhere – High-speed storage 33
Accuracy worse than experiment • Research shows the result of 74% 74% accuracy rate and up to 88% in some categories • After first building the models from latest data, accuracy is only 51% 51% • Further investigations shown some few significant defects. 34
Shuffling input data Additional process • Issue to shuffle data – Due to the high correlation of sample data, this can result in biased gradient and lead to poor convergence • Solution – Add shuffling process into the data preparation Input data preprocessing 35
Tuning training parameter • Issue – Trained models with latest data resulted in low accuracy result • Lower input layer size • Unbalanced item distribution in categories • Solution – Increase number of backpropagation epochs in 2.5 times and decrease bias multiplier in 10 times 36
Grouping of categories • Issue – Low prediction accuracy for similar categories when separating models • Solution – Group similar categories 37
Accuracy improvement result • Recover expected result – 80% of overall accuracy 80% ~ 98% – 98% in popular categories • Cost several months of work 51% 38
Most successful categories 39
FUTURE WORK 40
Next steps 80% Need to improve the accuracy as much as possible • Data analysis • New experiments is not enough 41
Recommend
More recommend