evaluating a processing in memory architecture with the k
play

Evaluating a Processing-in-Memory Architecture with the k -means - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas


  1. Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas bastien.thomas@ens-rennes.fr May 4, 2017 University of Rennes I École Normale Supérieure de Rennes

  2. With Help From… Dominique Lavenier dominique.lavenier@irisa.fr CNRS IRISA David Furodet & the Upmem Team dfurodet@upmem.com

  3. 1/17 Context BIG DATA Workloads End of Dennard Scaling Shift towards Data- Exascale Centric Architectures End of Moore’s Law Bandwidth and Memory Walls

  4. Table of contents 1. The Upmem Architecture 2. k -means Implementation for the Upmem Architecture 3. Experimental Evaluation 2/17

  5. The Upmem Architecture

  6. Upmem architecture overview DPU dram processing-unit DIMM dual in-line memory module MRAM main memory WRAM execution memory for programs 3/17 WRAM ... WRAM DPU ... DPU CPU DDR bus MRAM ... MRAM 0 ... 255 DIMM

  7. A massively parallel architecture Characteristics • Several DIMMs can be added to a CPU • A 16 GBytes DIMM embeds 256 DPUs • Each DPU can support up to 24 threads The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism. 4/17

  8. A massively parallel architecture Characteristics • Several DIMMs can be added to a CPU • A 16 GBytes DIMM embeds 256 DPUs • Each DPU can support up to 24 threads The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism. 4/17

  9. Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations

  10. Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations communication - MRAM - Mailboxes

  11. Drawbacks and advantages Drawbacks: computation power • Frequency around 750 MHz • No floating point operations • Significant multiplication overhead (no hardware multiplier) • Explicit memory management Advantages: data access • Parallelization power • Minimum latency • Increased bandwidth • Reduced power consumption 6/17

  12. Drawbacks and advantages Drawbacks: computation power • Frequency around 750 MHz • No floating point operations • Significant multiplication overhead (no hardware multiplier) • Explicit memory management Advantages: data access • Parallelization power • Minimum latency • Increased bandwidth • Reduced power consumption 6/17

  13. k -means Implementation for the Upmem Architecture

  14. k -means Clustering Problem Examples of applications Gene sequence analysis Market research networks Communities in social Segmentation 7/17 Argmin C k d : Euclidean distance n (resp. m ): number of points (resp. attributes) Partition data ∈ R n × m into k clusters C 1 . . . C k ∑ ∑ d ( p , mean ( C i )) p ∈ C i i = 1

  15. k -means Standard Algorithm [6] 6: 14: end function C 13: 12: end for 11: 10: 9: end for Assign p to cluster C j 7: 8: 8/17 3: 2: 5: C 4: repeat 1: function k -means( k , data, δ ) Choose ˜ C := ( ˜ c 1 . . . ˜ c k ) initial centroids C = ˜ for all point p ∈ data do j := Argmin i d ( p , c i ) ▷ Find nearest cluster for all i in { 1 . . . k } do ˜ c i = mean ( p ∈ C i ) ▷ Compute new centroids until ∥ ˜ C − C ∥ ≤ δ ▷ Convergence criteria return ˜ ▷ Return the final centroids

  16. k -means algorithm on Upmem The points are DPUs. distributed across the 9/17 Data input HOST Choose initial points centroids Distribute points DPUs Computations Send centroids Start centroids End centroids update update Convergence? no yes Output results

  17. Implementation & Memory Management • int type to store distance (easy to overflow with distances) MRAM • Global variables (e.g. # of points) • Centers • Points • New centers 10/17

  18. Experimental Evaluation

  19. 1000 800 600 400 200 0 200 200 0 200 400 600 800 1000 Experimental Setup Simulator integer large datasets. Could not find ready-to-use uniformly, with clusters) • Randomly generated (not • int Datasets • Cycle-Accurate simulator • Architecture not yet manufactured 11/17

  20. Experimental Setup uniformly, with clusters) Simulator integer large datasets. Could not find ready-to-use 11/17 • Randomly generated (not Datasets • Cycle-Accurate simulator • Architecture not yet manufactured • int 1000 800 600 400 200 0 200 200 0 200 400 600 800 1000

  21. Number of Threads (N=1000000, Not the same runtime scales. D=2, K=10) (N=100000, • centroids D=34, K=3) (N=500000, • dimensions D=10, K=5) • points High number of 12/17 Runtime 0 5 10 15 20 25 Number of threads

  22. Number of DPUs Always the same Time is divided by the number of DPUs. number of points. 13/17 80 70 60 Runtime (seconds) 50 40 30 20 10 0 0 5 10 15 20 25 30 35 Number of DPUs

  23. Comparison with sequential k -means Dataset Many Points Algorithm 16-DPUs 1 core SeqC Runtime (s) 1.568 0.268 Faster than SeqC with 94 DPUs Large number of dimensions provides a large amount of multiplications to compute distances 14/17

  24. Comparison with sequential k -means Dataset Many Dimensions Algorithm 16-DPUs 1 core SeqC Runtime (s) 4.534 0.119 Faster than SeqC with 610 DPUs Large numbers of dimensions provides a large amount of multiplications to compute distances 14/17

  25. Comparison with sequential k -means Dataset Many Centers Algorithm 16-DPUs 1 core SeqC Runtime (s) 0.4353 0.0142 Faster than SeqC with 491 DPUs Large numbers of centers provides a large amount of computation per memory transfer [2] 14/17

  26. Conclusion

  27. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  28. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  29. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  30. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  31. Going Further with the Hardware Actual Physical Device • Evaluate how the program behaves at large scale • Impact on the DDR bus & communications Hardware Multiplication • Now: 40% of multiplication instructions & 30 instructions per multiplication 16/17

  32. Going Further with the Hardware Actual Physical Device • Evaluate how the program behaves at large scale • Impact on the DDR bus & communications Hardware Multiplication • Now: 40% of multiplication instructions & 30 instructions per multiplication 16/17

  33. Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate Reduce the number of distance computations Might involve the CPU 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU

  34. Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU + Reduce the number of distance computations − Might involve the CPU

  35. Thank You

  36. References

  37. References i D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems , MEMSYS ’15, pages 197–205, New York, NY, USA, 2015. ACM.

Recommend


More recommend