Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing
Large Scale SVMs Parallel/Multiprocessor SVMs Serial GPU SVMs SVMs Cao 2006 Zanni 2006 Distributed/Cluster Catanzaro 2008 SVMs Osuna 1997 Joachims 1999 Platt 1999 Keerthi 2001 Fan 2005 ….. Graf 2005 (Cascade SVM) Lu 2008 (Yahoo) Chang 2006 (Google)
Multiclass SVM samples Output code classes tasks
GPUs: CUDA (I) • CUDA Programming model • Three key abstractions: – Hierarchy of thread groups – Shared memory – Barrier Synchronization • Advantages: – High throughput in floating point computation (1 TFlop) – Aggressive Memory system (4 GB) – Fast memory bandwidth (102 GB/s)
GPUs: CUDA (II) Host Device Grid 1 Block Block Block Block Block (0,0) (0,1) (0,2) (0,4) (0,3) Kernel 1 Block Block Block Block Block (1,0) (1,1) (1,2) (1,3) (1,4) Grid 2 Block Block Block (0,0) (0,1) (0,2) z Block Kernel 2 Block y (1,1) (1,0) Block Block x (2,0) (2,1) Thread (x,y,z)
GPUs: CUDA (III) Grid Host Block (0,0) Block (1,0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0) Global Memory Constant Memory
Parallel SMO
Block P Block 1 Block 2 Filter Filter Filter Filter Filter Filter Max Min Max Min Max Min Max Min Host α Iup , α Ilow f Iup , f Ilow f Iup , f Ilow f Iup , f Ilow 1 2 P f i f i f i Device (x,y) i bup p blow p blow> bup +2 τ Alpha i Iup p Ilow p f i
Parallel Tasks (I) Kernel Caching (Joachims 1999) AVA OVA
Parallel Tasks (II) Subsets Tasks Task #2 Converged Task #3 Converged Task #4 Converged Task #1 Converged Grid Reduction # of iterations
Performance Results (I) Host-Device Specifications: Host Device Ubuntu 8.10 64bit Tesla C1060 CPU: Intel Core i7 920 @ 2.67 GHz # Stream Processors: 240 Memory 6GB (3x2 DDR2) Frequency of Processors: 1.3GHz 933 Gflops Memory: 4GB DDR3 Memory Bandwidth: 102GB/s Host <-> Device PCIe x16 (8GB/s) Datasets: Dataset # Training Points # Testing Points # Features # Classes C β Adult 32,561 16,281 123 2 100 0.5 MNIST 60,000 10,000 780 10 10 0.125
Performance Results (II) 0.5 0.45 0.4 1 task 0.35 Kernel Cache Hit Rate 2 tasks 0.3 3 tasks 4 tasks 0.25 5 tasks 0.2 6 tasks 7 tasks 0.15 8 tasks 0.1 9 tasks 0.05 10 tasks 0 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 # Iterations MNIST (OVA)
Performance Results (III) 0.9 0.8 0.7 Kernel Cache Hit Rate 0.6 5 tasks 0.5 15 tasks 0.4 25 tasks 35 tasks 0.3 45 tasks 0.2 0.1 0 0 5000 10000 15000 20000 # Iterations MNIST (AVA)
Performance Results (IV) Accuracy (Binary tasks): Accuracy Difference Dataset SVM # SVs Iterations (%) in b (%) GPU 82.697624 18668 115565 Adult 0.01 LIBSVM 82.697624 19058 43735 GPU 96 43730 69535 MNIST 0.04 LIBSVM 96 43756 76385 Training Time (Binary & Multiclass): Dataset GPU (sec) LIBSVM (sec) Speedup Adult 38.0542 479 12.58731 OVA (10 tasks) AVA (45 tasks) AVA (45 tasks) MNIST 2272.71 1217.333 27833 22.86392 ~ 7 hours, ~ 20 min 53 min
Performance Results (V) 2500 1400 1200 2000 Training Time (secs) Training time (secs) 1000 1500 800 600 1000 400 500 200 0 0 1 5 10 15 20 25 30 35 40 45 1 2 3 4 5 6 7 8 9 10 Task task task task task task task task task task task task task task task task task task task task # of Tasks # of Tasks MNIST (OVA) MNIST (AVA) 1172 Blocks per iteration 5274 Blocks per iteration
Conclusions: -Naïve implementation of multiclass SVM: - One order of magnitude of speedup compared to LIBSVM - Room for improvement -Second order heuristics (Keerthi 2001) -Sparse matrices (Joachims 2006) -Parallel programming experience (me) -Future work - Distributed SVM training on multi GPU scenarios (Graf 2005, Lu 2008)
Recommend
More recommend