INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Porting the Spider Dataflow Runtime on the Kalray MPPA Manycore Architecture Hugo M IOMANDRE , Julien H ASCOËT , Karol D ESNOS , Kevin M ARTIN , Benoit D UPONT DE D INECHIN Jean-François N EZAN Dataflow Workshop - 2017.12.12 K. Desnos – S PIDER / MPPA 1
Context > Overview INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Reconfigurable Dataflow for Manycore GdR ISIS Project Modeling Runtime Manycore Framework Adaptation Architecture Layer Archi model K. Desnos – S PIDER / MPPA 2
Context > Dataflow input INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES The PiSDF Model of Computation Read Size =4 Header Read Send Filter Size Size Size Size Image Size SetNb Size Slices N =2 Kernel Size /N Size /N Size out in Size Desnos et al. "Pimm: Parameterized and interfaced dataflow K. Desnos – S PIDER / MPPA 3 meta-model for mpsocs runtime reconfiguration." SAMOS, 2013
Context > Runtime Master tasks: INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES 1. Manage graphs The S PIDER runtime manager 2. Map & Schedule 3. Send Jobs 4. Run jobs 5. Monitor & Trace Timings Master Jobs Params Data Slave Jobs Data Pool of data F IFO s Slave task: Slave Jobs - Run jobs Heulot et al. "Spider: A synchronous parameterized and interfaced K. Desnos – S PIDER / MPPA 4 dataflow-based RTOS for multicore DSPs" EDERC, 2014
Context > The Beast INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Manycore Architecture: Kalray’s MPPA256B Challenges • Distributed scratchpad memory • Massive parallelism • NoC-Based communications K. Desnos – S PIDER / MPPA 5
Contributions INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES • Distributed synchronization • Lightweight manycore scheduling • Dataflow-based distributed memory allocation Miomandre et al. "Embedded Runtime for Reconfigurable Dataflow K. Desnos – S PIDER / MPPA 6 Graphs on Manycore Architectures" submitted to PARMA-DITAM 18
Contributions > Overview INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Spider distribution on MPPA Master Slave X 256 K. Desnos – S PIDER / MPPA 7
Contributions > Synchro INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Previous Synchronization Mechanisms Core 1 A Shared-Memory Based (x86, Big.LITTLE) Flag Pop Core 2 Poll B job Core 1 A Hardware supported FIFO (TI Keystone 2) Pop Core 2 B job K. Desnos – S PIDER / MPPA 8
Contributions > Synchro INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Synchronization Mechanisms Before: After: Centralized Polling Distributed notifications NoC NoC Master Slave Master Slave Slave Slave Slave Slave • Unbounded • Notifier/Observer pattern • Bounded communications • Contention communications K. Desnos – S PIDER / MPPA 9
Contributions > Scheduling INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Complexity issue for manycore Complexity: O( A.log(A) + P.A)) Good’ol Lis ist t sche heduli duling ng A: # actors 1. Create a sorted list of P: # processors all actors to map/schedule. Multicore architectures: 2. Schedule first actor P smaller than log(A) of the list of first => A.log(A) dominates available core. 3. Go back to step 2 Manycore architecture: until the list is empty. P is large A ∝ P (by designer) => Complexity quadraticaly K. Desnos – S PIDER / MPPA 10
Contributions > Scheduling INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Lightweight Scheduling Complexity: O( A ) Specialized Round Robin A: # actors 1.Create a list of actor in topological order. Advantages: 2.Schedule first - Low complexity actor by - Interleaving clusters avoids systematically interleaving communication contentions clusters and cores between starting actors. 3.Go back to step 2 - The scheduler controls the number until the list is of jobs in the queue of a core. empty. K. Desnos – S PIDER / MPPA 11
Contributions > Memory Alloc. INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Classic Memory Allocation Malloc In & Out Success? OK KO Crash! Execute actor K. Desnos – S PIDER / MPPA 12
Contributions > Memory Alloc. INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Cluster-Level Dataflow Aware Memory Allocation Malloc In & Out Success? OK KO Other OK Execute actor Actor? OK Notif. Master K. Desnos – S PIDER / MPPA 13
Results > Application INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES + Morphological operator K. Desnos – S PIDER / MPPA 14
Results > Performance INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Max speedup: 22 on 256 cores (on 4k video) K. Desnos – S PIDER / MPPA 15
Results > Energy INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES 2.5 more energy efficient (On 4k video) Intel Kalray Xeon E5-1650 MPPA256 Bostan 6-hyperthraded x86 258 RISC Cores 11.40 fps 2.81 fps ~120W ~12W K. Desnos – S PIDER / MPPA 16
Summary INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Successful porting of Spider on MPPA • Feasible: Reconfigurable Dataflow Runtime for manycore • Promising: Energy efficiency beats x86 • Open source(-ish): On github! Future Work • Better Julien, Florian, Alexandre, • Faster Hamza, Claudio, … • Stronger K. Desnos – S PIDER / MPPA 17
INSTITUT D’ÉLECTRONIQUE ET DE TÉLÉCOMMUNICATIONS DE RENNES Achievement unlocked Thanks for your attention ? Achievement unlocked Question time http://preesm.sf.net @PreesmProject K. Desnos – S PIDER / MPPA 18
Recommend
More recommend