cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017
Agenda Why – accelerate matrix factorization? – use GPUs? How to accelerate -- cuMF – alternating least square (ALS) – stochastic gradient descent (SGD) What is the result – cuMF outperform all competitors 2
MF Explained using Recommender Systems Input : users ratings on some items θ v Θ T Want: predict missing ratings How : derive user/item features x u MF: factorize the rating matrix R into X R and minimize the empirical lost: 3
Other applications of MF Θ T Θ T 8 Topic Word 8 3 model document X embedding word X 5 3 5 word word Θ T 1 Network Link 1 1 compression prediction item X 1 1 1 user 4
To Solve MF: SGD θ v3 θ v1 θ v2 Stochastic gradient descent ( SGD ) x u1 - Update takes one rating at a time - Vector inner product: memory bound x u2 - Need many light epochs - Parallelize: non-trivial - Handle dense (implicit) ratings: no x u3 5
To Solve MF: ALS θ v2 θ v3 θ v1 θ v2 Alternating Least Square ( ALS ) x u1 - Update takes ALL rating at a time - Vector outer product & solve: compute bound x u2 - Need few heavy epochs - Parallelize: straightforward - Handle dense (implicit) ratings: yes 6
Challenge: compute and memory capacity of CPU CPU offers: 1 T flops, 80 GB/s f =100, per epoch • ALS floating-point operations - Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T • SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB - >> CPU flops and BW capacity 7
GPU vs. CPU: compute FLOPS and memory bandwidth • Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ 8
Goal: a CUDA library for MF Fast Collab. Word PU Fast training filtering embedding MF … Update model quickly cuMF Scalable Kernels for ALS and SGD Deal with big data Exploit fast interconnection CUDA Cost efficient Fully utilize flops or BW Cheaper than CPU solutions 9
2. how to parallelize Challenges of SGD • Iterate over all ratings and do this in sequence: Parallel worker 0 Parallel worker 1 • Memory bound (a) Hogwild (b) Matrix Blocking Cache Half Memory precision Coalescing 1. update kernel Warp shuffle ILP Register Reuse 10
Challenges on GPUs float a = p[] * q []; //dot product Vectorization Scalability p[] = p[] – gradient[]; //update p q[] = q[] – gradient[]; //update q Memory Access • Cache. • Memory efficiency. • Existing CPU implementations • GPUs are not good at processing Code complexity do not scale beyond ~30 complex logics. Try to keep the threads. code structure simple is very if(…){ • Main reason: the scheduling important for the performance, … overhead & context switch which is a common problem for } overhead increases with else (…){ all sparse applications. … #threads. } 11
Computation Optimization - 1: In-warp Vectorization • Use one thread block to process one update. • For any feature vector length(k), block size = warp size = 32. • In-warp vectorization. Since Kepler architectures, GPUs exploit k=128 Instruction-level ILP by employing VLIW-like techniques. parallelism Continuous instructions with no data dependency will be executed in parallel. Fastest dot product operation on GPUs. In-warp dot No shared memory + no synchronization product overhead + extra hardware acceleration. Fully utilize register. Try to keep all Instruction-level variables in register file, which is the parallelism fastest storage on GPUs. 12
Computation Optimization - 2: Optimized Memory Access We ensure perfect memory Cache Memory coalescing at programming time. coalescing Use L1 cache Every byte of off-chip memory to capture data access is utilized. Achieving 100% localities in the bandwidth efficiency. matrix. Half precision We keep all re-usable data in Register file the register file and avoid register spilling. Maxwell supports memory storage for half precision. By using half precision, we reduce bandwidth consumption by 50% with no accuracy loss. 13
Single GPU Scheduling Scalability problem: Global 1. On CPUs, it does not scale scheduling beyond 30 threads. table Worker 0 2. On GPUs, it does not scale Worker 1 beyond 240 thread blocks. … Rating Matrix Worker N 2x10 8 Libmf Libmf-GPU 1.5x10 8 #Updates/s Centralized scheduling: a 1x10 8 global scheduler is responsible for the workload scheduling. 5x10 7 When a parallel thread finishes its job, it asks the global 0 scheduler for a new job. 0 50 100 150 200 250 300 #Parallel Workers 14
Single GPU Scheduling – 1 Wave-like update • Basic observation: GPUs do not like complex block scheduling. 15
Single GPU Scheduling – 2 Batch-Hogwild • Key idea: - Borrow the idea of Hogwild!. - Optimization: fetch a batch of samples to exploit data locality. • Comparison 16
Performance • Faster than all state-of-the-art techniques with only one GPU card 17
Cross Architecture Scalability 18
Performance and achieved bandwidth 19
Conclusion SGD-based MF is memory-bound: try to increase the memory bandwidth instead of increase FLOPS. GPUs do not prefer complex scheduling policy or control logic. Half precision (16-bit floating point) is accurate enough for matrix factorization. Understanding the architecture details of GPUs helps a lot when writing high- performance GPU applications. cuMF_sgd is the fastest SGD-based MF. 20
Thank you, questions? Acknowledgement: Xiaolong Xie, Liangliang Cao Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016 CuMF_SGD: Fast and Scalable Matrix Factorization. HPDC, 2017 Code: http://github.com/cuMF/ Blog: http://ibm.biz/cumf-blog Contact: Wei Tan, wtan@us.ibm.com 21
Recommend
More recommend