training deep autoencoders for
play

Training deep Autoencoders for collaborative filtering Oleksii - PowerPoint PPT Presentation

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg Motivation Personalized recommendations 2 Key points (spoiler alert) 1. Deep autoencoder for collaborative filtering 1. Improves generalization


  1. Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

  2. Motivation Personalized recommendations 2

  3. Key points (spoiler alert) 1. Deep autoencoder for collaborative filtering 1. Improves generalization 2. Right activation function (SELU, ELU, LeakyRELU) enables deep architectures 1. No layer-wise pre-training, or skip connections 3. Heavy use of dropout 4. Dense re-feeding for faster and better training 5. Beats other models on time-split Netflix data (RMSE of 0.9099 vs 0.9224) 6. (PyTorch-based) https://github.com/NVIDIA/DeepRecommender Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 3

  4. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 4

  5. Collaborative filtering Rating prediction Some of the most popular approaches – Alternative Least Squares ( ALS) R(i,j) = k iff user i gave item j rating k j X r Items 3 4 m ≈ Users R=Rating matrix users 1 i 3 5 3 5 r hidden factors n items 5

  6. Autoencoder Deep learning tool of choice for dimensionality reduction … y D z = encoder(x), encoding e 1 ∗ 𝑒 1 + 𝑐 4 y = decoder(z), reconstruction of x 𝑒 2 = 𝑋 𝑒 c y = decoder(encoder(x)) … o d 2 ∗ 𝑨 + 𝑐 3 ) Autoencoder can be thought of as 𝑒 1 = 𝑔(𝑋 e 𝑒 r generalization of PCA … 𝑨 = 𝑓 2 E n 2 ∗ 𝑓 1 + 𝑐 2 ) 𝑓 2 = 𝑔(𝑋 𝑓 c “Constrained” if decoder weights … o are transpose of encoder d “De - noising” if noise is a added to x. 1 ∗ 𝑦 + 𝑐 1 ) 𝑓 1 = 𝑔(W e e r … x 6 Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science , 313 (5786), 504-507.

  7. AutoEncoders for recommendations User (item) based Masked Mean Squared Error z r y (very) sparse dense Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. 7

  8. Dataset Netflix prize training data set Time split to predict future ratings 8 Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.

  9. Benchmark 𝑗 − 𝑧 𝑗 2 σ 𝑠 𝑗 ≠0 𝑠 Netflix prize training data set 𝑆𝑁𝑇𝐹 = σ 𝑠 𝑗 ≠0 1 PMF: Mnih, Andriy, and Ruslan R. Salakhutdinov. "Probabilistic matrix factorization." Advances in neural information processing systems. 2008. I-AR, U-AR : Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. 9

  10. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 10

  11. Activation function matters - We found that on this task ELU, SELU and LRELU perform much better than SIGMOID, RELU, RELU6, TANH and SWISH Apparently important : a) non-zero negative part b) Unbounded positive part RMSE Training RMSE per mini-batch. All lines correspond to 4-layers autoencoder (2 layer encoder and 2 layer decoder) with hidden unit dimensions of 128. Different line colors correspond to different activation functions. Iteration 11

  12. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 12

  13. 1.8 d size 128 d size 256 d size 512 d size 1024 1.7 Overfit your data 1.6 1.5 Wide layers 1.4 RMSE 1.3 generalize poorly 1.2 1.1 1 … y 0 10 20 30 40 50 60 70 80 90 100 1.6 d size 128 d size 256 d size 512 d size 1024 1 ∗ 𝑒 1 + 𝑐 4 1.5 𝑒 2 = 𝑋 𝑒 1.4 … 𝑒 1.3 1.2 RMSE 1 ∗ 𝑦 + 𝑐 1 ) 𝑓 1 = 𝑔(W e 1.1 1 … x 0.9 0.8 0.7 Evaluation RMSE > 1.1 on Netflix full 0 10 20 30 40 50 60 70 80 90 100 Epoch 13

  14. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 14

  15. Deeper models Generalize better … y … … … No layer-wise pre-training necessary ! … x 15

  16. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 16

  17. Dropout Helps wider models generalize Evaluation RMSE y Drop Prob 0.0 Drop Prob 0.5 Drop Prob 0.65 Drop Prob 0.8 1.09 … 512 1.07 1.05 … 512 1.03 dropout 1.01 RMSE … 1024 0.99 0.97 … 512 0.95 0.93 … 512 0.91 0 20 40 60 80 100 … x Epoch 17

  18. Autoencoders & collaborative filtering Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). 18

  19. Dense re-feeding Intuition: idealized scenario Note that x is sparse but f(x) is dense For x , most of the loss is masked Imagine perfect f ∀𝑦 𝑗 ≠ 0: 𝑔 𝑦 𝑗 = 𝑦 𝑗 If user later rates new item k with rating r, then: 𝑔 𝑦 𝑙 = 𝑠 By induction: f(x)= 𝑔 𝑔 𝑦 = 𝑔 𝑦 Thus, f(x) should be a fixed point of f for every valid x 19

  20. Dense re-feeding Attempt to enforce fixed point constraint (very) sparse x Dense f(f(x)) Dense f(x) Dense f(x) Update with real data x Update with synthetic data f(x) 20

  21. Dense re-feeding Together with bigger LR improves generalization 0.995 Baseline Baseline LR 0.005 Baseline RF Baseline LF 0.005 RF 0.985 0.975 0.965 0.955 RMSE 0.945 0.935 0.925 0.915 0.905 0 20 40 60 80 100 Epoch 21

  22. Results Netflix time split data I-AR, U-AR : Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. DeepRec is our 6 layer model 22

  23. Conclusions 1. Autoencoders can replace ALS and be competitive with other methods 2. Deeper models generalize better 1. No layer-wise pre-training is necessary 3. Right activation function enables deep architectures 1. Negative parts are important 2. Unbounded positive part 4. Heavy use of dropout is needed for wider models 5. Dense re-feeding further improves generalization 23

  24. Oleksii Kuchaiev and Boris Ginsburg "Training Deep Autoencoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017). Code, docs and tutorial: https://github.com/NVIDIA/DeepRecommender

Recommend


More recommend