Evaluating a Processing-in-Memory Architecture with the k -means - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas bastien.thomas@ens-rennes.fr May 4, 2017 University of Rennes I École Normale Supérieure de Rennes

With Help From… Dominique Lavenier dominique.lavenier@irisa.fr CNRS IRISA David Furodet & the Upmem Team dfurodet@upmem.com

1/17 Context BIG DATA Workloads End of Dennard Scaling Shift towards Data- Exascale Centric Architectures End of Moore’s Law Bandwidth and Memory Walls

Table of contents 1. The Upmem Architecture 2. k -means Implementation for the Upmem Architecture 3. Experimental Evaluation 2/17

The Upmem Architecture

Upmem architecture overview DPU dram processing-unit DIMM dual in-line memory module MRAM main memory WRAM execution memory for programs 3/17 WRAM ... WRAM DPU ... DPU CPU DDR bus MRAM ... MRAM 0 ... 255 DIMM

A massively parallel architecture Characteristics • Several DIMMs can be added to a CPU • A 16 GBytes DIMM embeds 256 DPUs • Each DPU can support up to 24 threads The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism. 4/17

Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations

Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations communication - MRAM - Mailboxes

Drawbacks and advantages Drawbacks: computation power • Frequency around 750 MHz • No floating point operations • Significant multiplication overhead (no hardware multiplier) • Explicit memory management Advantages: data access • Parallelization power • Minimum latency • Increased bandwidth • Reduced power consumption 6/17

k -means Implementation for the Upmem Architecture

k -means Clustering Problem Examples of applications Gene sequence analysis Market research networks Communities in social Segmentation 7/17 Argmin C k d : Euclidean distance n (resp. m ): number of points (resp. attributes) Partition data ∈ R n × m into k clusters C 1 . . . C k ∑ ∑ d ( p , mean ( C i )) p ∈ C i i = 1

k -means Standard Algorithm [6] 6: 14: end function C 13: 12: end for 11: 10: 9: end for Assign p to cluster C j 7: 8: 8/17 3: 2: 5: C 4: repeat 1: function k -means( k , data, δ ) Choose ˜ C := ( ˜ c 1 . . . ˜ c k ) initial centroids C = ˜ for all point p ∈ data do j := Argmin i d ( p , c i ) ▷ Find nearest cluster for all i in { 1 . . . k } do ˜ c i = mean ( p ∈ C i ) ▷ Compute new centroids until ∥ ˜ C − C ∥ ≤ δ ▷ Convergence criteria return ˜ ▷ Return the final centroids

k -means algorithm on Upmem The points are DPUs. distributed across the 9/17 Data input HOST Choose initial points centroids Distribute points DPUs Computations Send centroids Start centroids End centroids update update Convergence? no yes Output results

Implementation & Memory Management • int type to store distance (easy to overflow with distances) MRAM • Global variables (e.g. # of points) • Centers • Points • New centers 10/17

Experimental Evaluation

1000 800 600 400 200 0 200 200 0 200 400 600 800 1000 Experimental Setup Simulator integer large datasets. Could not find ready-to-use uniformly, with clusters) • Randomly generated (not • int Datasets • Cycle-Accurate simulator • Architecture not yet manufactured 11/17

Experimental Setup uniformly, with clusters) Simulator integer large datasets. Could not find ready-to-use 11/17 • Randomly generated (not Datasets • Cycle-Accurate simulator • Architecture not yet manufactured • int 1000 800 600 400 200 0 200 200 0 200 400 600 800 1000

Number of Threads (N=1000000, Not the same runtime scales. D=2, K=10) (N=100000, • centroids D=34, K=3) (N=500000, • dimensions D=10, K=5) • points High number of 12/17 Runtime 0 5 10 15 20 25 Number of threads

Number of DPUs Always the same Time is divided by the number of DPUs. number of points. 13/17 80 70 60 Runtime (seconds) 50 40 30 20 10 0 0 5 10 15 20 25 30 35 Number of DPUs

Comparison with sequential k -means Dataset Many Points Algorithm 16-DPUs 1 core SeqC Runtime (s) 1.568 0.268 Faster than SeqC with 94 DPUs Large number of dimensions provides a large amount of multiplications to compute distances 14/17

Comparison with sequential k -means Dataset Many Dimensions Algorithm 16-DPUs 1 core SeqC Runtime (s) 4.534 0.119 Faster than SeqC with 610 DPUs Large numbers of dimensions provides a large amount of multiplications to compute distances 14/17

Comparison with sequential k -means Dataset Many Centers Algorithm 16-DPUs 1 core SeqC Runtime (s) 0.4353 0.0142 Faster than SeqC with 491 DPUs Large numbers of centers provides a large amount of computation per memory transfer [2] 14/17

Conclusion

Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

Going Further with the Hardware Actual Physical Device • Evaluate how the program behaves at large scale • Impact on the DDR bus & communications Hardware Multiplication • Now: 40% of multiplication instructions & 30 instructions per multiplication 16/17

Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate Reduce the number of distance computations Might involve the CPU 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU

Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU + Reduce the number of distance computations − Might involve the CPU

Thank You

References

References i D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems , MEMSYS ’15, pages 197–205, New York, NY, USA, 2015. ACM.

Evaluating a Processing-in-Memory Architecture with the k -means - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

EAP ATAM and Collaboration at the Enterprise Level Evaluating Software Architecture ATAM The

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

1 Allocation Example Assumptions Made in These Slides Memory is word addressed. p1 =

1 Key Issues Reference Counting Idea For both for each heap allocated object, maintain

Dynamic Storage Reclamation Course Introduction Roll call and introductions Name (nickname)

Dynamic Memory Allocation Anne Bracy CS 3410 Computer Science Cornell University Note: these

Implementing malloc CS 351: Systems Programming Michael Saelee <lee@iit.edu> 1 Computer

Memory management Part I Michel Schinz (based on Erik Stenmans slides) Advanced Compiler

Today Dynamic memory alloca7on Size of data structures

Twizzler: A Data-Centric OS for Persistent Memory Daniel Bittman Peter Alvaro Pankaj Mehra

Evaluating a Processing-in-Memory Architecture with the k -means - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

EAP ATAM and Collaboration at the Enterprise Level Evaluating Software Architecture ATAM The

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

1 Allocation Example Assumptions Made in These Slides Memory is word addressed. p1 =

1 Key Issues Reference Counting Idea For both for each heap allocated object, maintain

Dynamic Storage Reclamation Course Introduction Roll call and introductions Name (nickname)

Dynamic Memory Allocation Anne Bracy CS 3410 Computer Science Cornell University Note: these

Implementing malloc CS 351: Systems Programming Michael Saelee &lt;lee@iit.edu&gt; 1 Computer

Memory management Part I Michel Schinz (based on Erik Stenmans slides) Advanced Compiler

Today Dynamic memory alloca7on Size of data structures

Twizzler: A Data-Centric OS for Persistent Memory Daniel Bittman Peter Alvaro Pankaj Mehra

Implementing malloc CS 351: Systems Programming Michael Saelee <lee@iit.edu> 1 Computer