Com puter Science » Computer Engineering » Computer Architecture Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 , May 1 9 th, Barcelona holger.brunst daniel.hackenberg @ tu-dresden.de
Outline I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary Holger Brunst, Daniel Hackenberg Slide 2
Cell Broadband Engine SPE SPU SPU SPU SPU SPU SPU SPU SPU LS LS LS LS LS LS LS LS Element Interconnect Bus PowerPC Memory Interface Bus Interface Controller L 2 L 1 Core Controller ( MIC ) ( BIC ) PowerPC Processor Element (PPE) Dual XDR FlexIO SPE: Synergistic Processor Element LS: Local Store Holger Brunst, Daniel Hackenberg Slide 3
Cell Broadband Engine Vast Resources • � SPEs: SIMD-Cores for fast calculations, 256 KB local store (LS, software controlled), dedicated DMA engine (MFC) • � PPE: very simple PowerPC Core for OS (Linux) and control tasks Sophisticated Architecture • � Complex software development process • � Different compilers and programs for PPE and SPEs • � SPEs use DMA commands to access main memory or LS of other SPEs, asynchronous execution by MFC • � Mailbox communication between PPE and SPEs Tool Support Holger Brunst, Daniel Hackenberg Slide 4
Trace-based Analysis W hy do w e still need to analyze? • � HPC: System complexity increases constantly • � Parallelism enters main stream market and not many people know how to deal with it Approaches • � Profilers do not give detailed insight into timing behavior of an application • � Detailed online analysis pretty much impossible because of intrusion and data amount Tracing • � Records application behavior step-wise • � Tracing is an option to capture the dynamic behavior of parallel applications • � Performance analysis done on a post-mortem basis Holger Brunst, Daniel Hackenberg Slide 5
Background W hat is Vam pir? • � Performance monitoring and analysis tool • � Targets the visualization of dynamic processes on massively parallel (compute-) systems History • � Development started more than 15 years ago at Research Centre Jülich, ZAM • � Since 1997, developed at TU Dresden (first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Availability • � Unix, Windows, and Mac OS • � Visualization components (Vampir) are commercial • � Monitor components (VampirTrace) are Open Source Holger Brunst, Daniel Hackenberg Slide 6
Com ponents Application Trace Vampir CPU Data VampirTrace (OTF) VampirServer Time Task 1 … Task n << m Application OTF Trace 1 CPU Part 1 VampirTrace Application OTF Trace 2 CPU Part 2 VampirTrace Application OTF Trace 3 CPU Part 3 VampirTrace Application OTF Trace 4 CPU Part 4 VampirTrace . . . . . . Trace Data Application 10,000 CPU Part m VampirTrace Holger Brunst, Daniel Hackenberg Slide 7
Flavors Vam pir • � Sequential event analysis • � Rich set of graphical performance views • � For desktops and small parallel production environments • � Less scalable Vam pirServer • � Distributed client/ server approach • � Parallel analysis • � New features Vam pir for W indow s • � Modern QT-based GUI • � Released at ISC 2009, Hamburg • � Currently: Beta-Release Holger Brunst, Daniel Hackenberg Slide 8
Outline I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary Holger Brunst, Daniel Hackenberg Slide 11
Softw are Tracing on Cell System s PPE • � Conventional tools with PowerPC support run unmodified • � Modifications necessary to support SPE threads SPE • � New concept needs to be designed, suitable for this architecture • � New monitor necessary to generate events • � Local store too small, only temporary storage of events • � Synchronization of PPE and SPE timers necessary Holger Brunst, Daniel Hackenberg Slide 12
Trace Monitor Concept * Buffers will switch each time SPE 0 SPE n the current trace buffer is full SPU n SPU 0 Instrumented Instrumented I/O System SPE program SPE program ... SPE program writes trace trace file PPE events into Local Store Local Store small trace PPE trace file SPE 0 buffer processes Buf 1 Buf 1 ... SPE trace * Buf 2 Buf 2 buffers (post trace file SPE n mortem) and DMA transfer writes trace of full trace files to disk buffer to main Element Interconnect Bus memory in backgound, SPE program keeps running Instrumented Buf 1/0 Buf 1/n PPE program ... Buf 2/0 Buf 2/n PPE Buf 3/0 Buf 3/n Conventional monitoring ... ... Main tool with enhancements to Buf m/0 Buf m/n cover e.g. mailbox Memory communication with SPEs Holger Brunst, Daniel Hackenberg Slide 13
Trace Visualization for Cell ( 1 ) Location Process 1 Region 1 Region 2 Process 2 Region 1 Region 2 Process 3 Region 1 Region 2 Process 4 Region 1 Region 2 Time I llustration of parallel processes in a typical tim eline display Holger Brunst, Daniel Hackenberg Slide 14
Trace Visualization for Cell ( 2 ) Location PPE Process 1 SPE Thread 1 Region 1 Region 2 SPE Thread 2 Region 1 Region 2 SPE Thread 3 Region 1 Region 2 Time I llustration of SPE threads as children of the PPE process Holger Brunst, Daniel Hackenberg Slide 15
Trace Visualization for Cell ( 3 ) Location PPE Process 1 SPE Thread 1 Region 1 SPE Thread 2 Region 1 SPE Thread 3 Region 1 Time I llustration of m ailbox m essages • � Classic two-sided communication (send/ receive) • � Illustrated by lines similar to MPI messages Holger Brunst, Daniel Hackenberg Slide 16
Trace Visualization for Cell ( 4 ) Location PPE Process 1 Main Memory read read write SPE Thread 1 Region 1 SPE Thread 2 Region 1 Time I llustration of DMA transfers betw een SPEs and m ain m em ory • � PPE is not involved • � Main memory is represented as independent bar • � Allows graphical representation of memory states (read/ write) Holger Brunst, Daniel Hackenberg Slide 17
Trace Visualization for Cell ( 5 ) Location PPE Process 1 Main Memory SPE Thread 1 SPE Thread 2 DMA put DMA get Time DMA transfers betw een SPEs • � Challenge: Communication is one-sided • � Peer-to-peer send/ receive representation unsuitable Distinction of active and passive partner? • � Additional lines • � Additional bullets (active partner) • � Even more bullets? (passive partner) Holger Brunst, Daniel Hackenberg Slide 18
Trace Visualization for Cell ( 6 ) Location PPE Process 1 Main Memory SPE Thread 1 SPE Thread 2 DMA wait t 0 t 1 t 2 Time • � DMA wait operation creates t_0 = get_timestamp(); two events (at t1 and t2) mfc_get(); • � Allows illustration of DMA wait time [...] • � Similar for mailbox messages t_1 = get_timestamp(); wait_for_dma_tag(); t_2 = get_timestamp(); Holger Brunst, Daniel Hackenberg Slide 19
Outline I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary Holger Brunst, Daniel Hackenberg Slide 20
I m plem entation Prototype im plem entation based on Vam pirTrace ( VT) • � Open Source • � http: / / www.tu-dresden.de/ zih/ vampirtrace Additional tool: CellTrace • � Header files for PPE and SPE programs: Instrumentation of inline functions provided by the Cell SDK • � Library for PPE programs + library for SPE programs celltrace _spu.h celltrace _ppu.h spu_code_1.c spu _code _n.c ppu_code_1.c ppu_code _m.c SPU Compiler SPU Compiler vtcc vtcc -DCTRACE -DCTRACE -DCTRACE -DCTRACE celltrace _spu.a celltrace _ppu.a spu_code_1.o spu _code _n.o ppu_code_1.o ppu_code _m.o spu_object .o ppu_object .o Embedder Archiver cell _binary spu _lib.a (trace enabled ) Holger Brunst, Daniel Hackenberg Slide 21
Trace Visualization w ith Vam pir ( 1 ) Visualization of a Cell trace using Vam pir • � Simple demo program • � 4 SPEs only Holger Brunst, Daniel Hackenberg Slide 22
Trace Visualization w ith Vam pir ( 2 ) Holger Brunst, Daniel Hackenberg Slide 23
Trace Visualization w ith Vam pir ( 3 ) Complex DMA transfers of SPE 3 Holger Brunst, Daniel Hackenberg Slide 24
Outline I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary Holger Brunst, Daniel Hackenberg Slide 25
Exam ple Cell Applications: FFT ( 1 ) FFT at a synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS Holger Brunst, Daniel Hackenberg Slide 26
Exam ple Cell Applications: FFT ( 2 ) FFT at a synchronization point 8 SPEs, 16 MByte page size, 42.9 GFLOPS Holger Brunst, Daniel Hackenberg Slide 27
Exam ple Applications: Cholesky ( 1 ) Cholesky transformation with 8 SPEs overview with DMA communication of SPE 3 Holger Brunst, Daniel Hackenberg Slide 28
Exam ple Applications: Cholesky ( 2 ) Cholesky transformation with 8 SPEs enlargement with DMA communication of SPE 3 Holger Brunst, Daniel Hackenberg Slide 29
Exam ple Cell Applications: RAxML ( 1 ) RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase Holger Brunst, Daniel Hackenberg Slide 30
Recommend
More recommend