History Our Design MPI Implementation Performance Conclusions and Future Work Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters Torsten Höfler Department of Computer Science TU Chemnitz June 24, 2006 university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Earth Simulator Global Barrier Counter (GBC) Flag registers within a processor node (Global Barrier Flag university-logo - GBF) Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Earth Simulator Barrier working principle: 1 Master node sets number of nodes into GBC 2 Control unit resets all GBFs of nodes 3 A completed node decrements GBC, and loops on GBF 4 When GBC=0 → control unit sets all GBFs 5 All nodes continue ⇒ constant barrier latency of 3 . 5 µ s between 2 and 512 nodes university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work BlueGene/L Independent Barrier Network university-logo Four independent Channels Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work BlueGene/L Barrier working principle: 1 Global OR 2 Global AND by inverted logic 3 Signal is propagated to top of a binomial Tree and down 4 OR is used for Interrupts (halt machine) 5 AND is used for Barrier 6 Can be partitioned at specific borders ⇒ constant barrier latency of 1 . 5 µ s between 2 and 65536 nodes university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Cray T3D Two Fetch&Increment Registers per Processor Global AND / OR barrier university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Other Hardware Barriers ... many many more with same principles: Cray T3D Fujitsu VPP500 Thinking Machines CM-5 Purdue’s Adapter ... ⇒ our approach is to support commodity clusters without changes in the machine itself university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work FPGA Based Prototype Simple and cheap design university-logo Prototype supports 1 barrier per node Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Parallel Port Control Port (BASE + 2) outgoing 7 6 5 4 3 2 1 0 incoming 17 16 14 1 IRQ enable Status Port (BASE + 1) 7 6 5 4 3 2 1 0 11 10 12 13 15 1 14 Data Port (BASE + 0) 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 25 13 Three cables per node ( IN , OUT , GND ) university-logo Prototype supports 1 barrier per node Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Two-state Machine i1 and i2 and i3 and i4 = ’1’ o = ’0’ o = ’1’ i1 or i2 or i3 or i4 = ’0’ Two states (2 FFs + ⌈ log 2 P ⌉ 2-port ANDs/ORs) Very fast state transition university-logo OUT ↔ iP , IN ↔ o Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Working Principle Goal: minimize read/write Operations! 1 init only: read status ( IN ) 2 toggle status 3 write new status ( OUT ) 4 read status ( IN ) until toggled → no ”packets”, constant Voltage-Level based university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Scalability Goal: Connect more than thousand nodes! Similar principle as for BlueGene/L AND / OR tree Propagating state up and down Two-state principle university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Accessing the Parallel Port #define B 1 A S E P O R T 0x378 int main() { / ∗ Set the data signals (D0 − 7) of the port to ’0 ’ ∗ / outb(0, B A S E P O R T); / ∗ Read from the status port (BASE+1) ∗ / 6 printf ("status: % d\n" , inb(B A S E P O R T + 1)); } Protoype uses INB , OUTB Requires root-access and OS adds overhead university-logo Kernel module with mmapped registers easily possible Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Collective Module in Open MPI Application MPI OB1 HWBARR ... PML COLL R2 BML IB TCP HWBARR BTL BTL Implemented as collective Module in Open MPI Prototype supports only MPI_COMM_WORLD university-logo Requires to run as root Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Microbenchmark MPI Implementation Application Bechmark Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
History Our Design Microbenchmark MPI Implementation Application Bechmark Performance Conclusions and Future Work Performance Model Variables: 1 t b : Barrier latency 2 o w : CPU overhead to write to the parallel port 3 o r : CPU overhead to read from the parallel port 4 o p ( P ) : Processing overhead of a state change 5 P : Number of processors → toggle - write - read schema: t b = o w + o p ( P ) + o r university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier
Recommend
More recommend