2010 Blue Waters Performance Modeling Workshop – Opening and Introduction Torsten Hoefler With slides from: William Kramer, Marc Snir, William Gropp, IBM, and the Blue Waters team 1
Introduction and Overview • My slides contain only public information and will be available online after the workshop • No need to take pictures or notes! • Parts of tomorrow will contain IBM confidential information • You may only attend the NDA session if your institution signed and cleared all NDAs for you! • You are responsible to maintain the confidentiality of the information! 2
Blue Waters in a Nutshell • >300.000 compute cores • based on Power7 • 10 PF/s peak • 1 PF/s sustained • >1 PiB RAM • >10 PiB disk storage • >0.5 EiB archival storage 3
Performance Modeling for Blue Waters • Most users have only experience at comparatively “small” scale (<8000 cores) • Applications should be ready to run on the full system • Needs a clear understanding before system is deployed (run, tweak, rerun loop not possible) Programmers need to develop a deep understanding of the application scaling and bottlenecks at scale by performance modeling! 4
From Chip to Entire Integrated System NPCF Blue Waters System Building Block SuperNode (1024 cores) Super Node (32 Nodes / 4 CEC) L-Link Cables Near-line Storage Drawer (256 cores) On-line Storage SMP node (32 cores) P7 Chip (8 cores) 5
6
Power7 Chip (8 cores) • Base Technology • 45 nm, 576 mm2 • 1.2 B transistors • Chip • 8 cores • 4 FMAs/cycle/core • 32 MB L3 (private/shared) • Dual DDR3 memory • 128 GiB/s peak bandwidth • (1/2 byte/flop) • Clock range of 3.5 – 4 GHz Quad-chip MCM 7
L3 Cache/On-Chip Communication • L1 32KB Instruction / core • L1 32KB Data / core • L2 = 256KB / core • L3 = 4MB eDRAM / core • Fast private and shared region 8
Quad Chip Module (4 chips) A Clk Grp A Clk Grp 8c uP 8c uP B Clk Grp B Clk Grp MC 0 MC 0 MC0 MC 0 C Clk Grp C Clk Grp P7-0 P7-1 D Clk Grp D Clk Grp A Clk Grp A Clk Grp • B A 32 cores! B Clk Grp B Clk Grp MC 1 MC 1 MC1 MC1 Y X C Clk Grp C Clk Grp • 32 cores*8 F/core*4 GHz = 1 TF C C D Clk Grp D Clk Grp Y Z W • Z 4 threads per core (max) A X B W • 4x32 MiB L3 cache B W A X Z W • C A Clk Grp C 512 GB/s RAM BW (0.5 B/F) A Clk Grp B Clk Grp B Clk Grp Y MC 0 MC 0 Z MC0 MC0 • C Clk Grp C Clk Grp A B 800 W (0.8 W/F) D Clk Grp D Clk Grp X Y • Flat shared memory! A Clk Grp A Clk Grp P7-3 P7-2 B Clk Grp B Clk Grp MC 1 MC 1 MC1 MC1 C Clk Grp C Clk Grp D Clk Grp D Clk Grp 8c uP 8c uP 9
Adding a Network Interface (Torrent) • Connects QCM to PCI-e A Clk Grp A Clk Grp DIMM 5 DIMM 1 DIMM 1 Mem Mem 8c uP 8c uP Mem Mem DIMM 5 B Clk Grp B Clk Grp MC 0 MC 0 MC0 MC 0 Mem Mem Mem Mem C Clk Grp C Clk Grp Mem Mem DIMM 4 • DIMM 0 DIMM 0 Mem P7-0 P7-1 Mem DIMM 4 (two 16x and one 8x PCI-e slot) D Clk Grp D Clk Grp Mem Mem Mem Mem A Clk Grp A Clk Grp DIMM 12 DIMM 7 DIMM 7 Mem DIMM 12 Mem Mem Mem B A B Clk Grp B Clk Grp • Mem MC1 MC 1 MC 1 MC1 Mem Mem Mem Connects 8 QCM's via low latency, Y X C Clk Grp C Clk Grp DIMM 13 DIMM 13 DIMM 6 Mem Mem Mem Mem DIMM 6 C C D Clk Grp D Clk Grp high bandwidth, copper fabric. Mem Mem Mem Mem Z Y W Z A X B W • Provides a message passing B W A X Z W mechanism with very C A Clk Grp A Clk Grp C DIMM 10 DIMM 10 DIMM 14 DIMM 14 Mem Mem Mem Mem B Clk Grp high bandwidth Y B Clk Grp Mem MC 0 MC 0 Mem Z Mem Mem MC0 MC0 C Clk Grp C Clk Grp A B DIMM 11 DIMM 11 DIMM 15 DIMM 15 Mem Mem Mem Mem D Clk Grp D Clk Grp X Y Mem Mem Mem Mem • Provides the lowest possible A Clk Grp A Clk Grp DIMM 3 DIMM 3 Mem Mem Mem DIMM 8 Mem DIMM 8 P7-3 P7-2 B Clk Grp B Clk Grp Mem Mem MC 1 MC 1 Mem latency between 8 QCM's Mem MC1 MC1 C Clk Grp C Clk Grp DIMM 2 Mem Mem Mem Mem DIMM 9 DIMM 2 DIMM 9 D Clk Grp D Clk Grp Mem 8c uP 8c uP Mem Mem Mem Hub Chip Module 28x XMIT/RCV pairs 624 @ 10 Gb/s 22+22GB/s 22+22GB/s 22+22GB/s 22+22GB/s 22+22GB/s 22+22GB/s 22+22GB/s 12x 12x 832 7+7GB/s 7+7GB/s 7+7GB/s 12x 12x 12x 12x 12x 12x Z 12x 12x Y 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 164 164 164 164 164 164 164 72 72 72 W 12x 12x 12x 12x 12x 12x X Ll0 Ll1 Ll2 Ll3 Ll4 Ll5 Ll6 EG2 EG1 EG2 D0-D15 Lr0-Lr23 10+10GB/s (12x=10+2) 5+5GB/s (6x=5+1) PCIe 61x PCIe 16x PCIe 8x 7 Inter-Hub Board Level L-Buses 3.0Gb/s @ 8B+8B, 90% sus. peak 320 GB/s 240 GB/s 10
1.1 TB/s HUB TPMD-A, TMPD-B Hot Plug Ctl Hot Plug Ctl Hot Plug Ctl SEEPROM 1 SEEPROM 2 PX0 Bus PX1 Bus PX2 Bus FSP1-A FSP1-B MDC-A MDC-B • 192 GB/s Host Connection SVIC SVIC 16x 16x 16x 16x FSI FSI I2C I2C I2C 8x 8x • 336 GB/s to 7 other local nodes IO PHY IO PHY IO PHY PCI-E PCI-E PCI-E I2C_0 + Int LL0 Bus 8B To Optical Modules Copper 8B I2C I2C • 28 HUB To HUB Copper Board Wiring 240 GB/s to local-remote nodes LL1 Bus 8B Copper 8B I2C_27 + Int LL2 Bus 8B Copper 8B EI-3 PHYs Diff PHYs • L local 320 GB/s to remote nodes LL3 Bus Torrent 8B 12x D0 Bus Interconnect of Supernodes Copper 8B 12x Optical LL4 Bus 8B Copper 8B D Buses • Optical 16 40 GB/s to general purpose I/O D Bus LL5 Bus 8B Copper 8B LL6 Bus 8B 12x D15 Bus Copper 8B Diff PHYs 12x Optical 8B W-Bus 8B W-Bus TOD Sync 8B X-Bus TOD Sync 8B Y-Bus TOD Sync 8B Z-Bus TOD Sync 8B X-Bus 8B Y-Bus 8B Z-Bus 6x 6x 6x 6x 24 L remote LR23 Bus LR0 Bus HUB to QCM Connections Optical Optical Buses Address/Data L remote 4 Drawer Interconnect to Create a Supernode Optical 11
256 Cores HUB to HUB Copper Wiring L-Local First Level Interconnect • Drawer 8 nodes • • 256 cores 32 chips 64/40 Optical N0-DIMM07 N0-DIMM15 N0-DIMM06 N0-DIMM14 P7-1 'D-Link' HUB N0-DIMM05 U-P1-M1 N0-DIMM13 P7-0 QCM 0 N0-DIMM04 N0-DIMM12 P7-2 0 N0-DIMM03 N0-DIMM11 P7-3 N0-DIMM02 N0-DIMM10 N0-DIMM01 N0-DIMM09 N0-DIMM00 N0-DIMM08 N1-DIMM07 N1-DIMM15 N1-DIMM06 N1-DIMM14 P7-1 HUB U-P1-M2 N1-DIMM05 N1-DIMM13 P1-C17-C1 17 QCM 1 e C P P7-0 I N1-DIMM04 P7-2 N1-DIMM12 1 N1-DIMM03 N1-DIMM11 P1-C16-C1 P7-3 16 C P e N1-DIMM02 N1-DIMM10 I N1-DIMM01 N1-DIMM09 P1-C15-C1 15 N1-DIMM00 N1-DIMM08 e C P I N2-DIMM07 N2-DIMM15 P1-C14-C1 14 C P N2-DIMM06 N2-DIMM14 e I P7-1 DCA-1 Connector (Bottom DCA) HUB N2-DIMM05 U-P1-M3 N2-DIMM13 QCM 2 P7-0 P1-C13-C1 13 N2-DIMM04 P7-2 N2-DIMM12 e C P DCA-0 Connector (Top DCA) 2 I N2-DIMM03 N2-DIMM11 P7-3 N2-DIMM02 N2-DIMM10 P1-C12-C1 12 C P e I N2-DIMM01 N2-DIMM09 N2-DIMM00 N2-DIMM08 P1-C11-C1 11 e C P I N3-DIMM07 N3-DIMM15 N3-DIMM06 N3-DIMM14 P7-1 P1-C10-C1 10 C P HUB e N3-DIMM05 U-P1-M4 N3-DIMM13 I P7-0 QCM 3 N3-DIMM04 N3-DIMM12 P7-2 3 P1-C9-C1 N3-DIMM03 N3-DIMM11 9 e C P I P7-3 N3-DIMM02 N3-DIMM10 Optical Fan-out from N3-DIMM01 N3-DIMM09 N3-DIMM00 N3-DIMM08 HUB Modules N4-DIMM07 N4-DIMM15 N4-DIMM06 N4-DIMM14 2,304 Fiber 'L-Link' P7-1 HUB N4-DIMM05 U-P1-M5 N4-DIMM13 P7-0 QCM 4 P1-C8-C1 C N4-DIMM04 N4-DIMM12 8 e P P7-2 I 4 N4-DIMM03 N4-DIMM11 P7-3 N4-DIMM02 N4-DIMM10 P1-C7-C1 C P 7 e I N4-DIMM01 N4-DIMM09 N4-DIMM00 N4-DIMM08 P1-C6-C1 6 e C P I N5-DIMM07 N5-DIMM15 N5-DIMM06 N5-DIMM14 P1-C5-C1 P7-1 C P 5 e HUB I N5-DIMM05 U-P1-M6 N5-DIMM13 QCM 5 P7-0 N5-DIMM04 P7-2 N5-DIMM12 5 P1-C4-C1 4 e C P N5-DIMM03 N5-DIMM11 I P7-3 N5-DIMM02 N5-DIMM10 P1-C3-C1 N5-DIMM01 N5-DIMM09 C P 3 e I N5-DIMM00 N5-DIMM08 P1-C2-C1 N6-DIMM07 N6-DIMM15 2 e C P I N6-DIMM06 N6-DIMM14 P7-1 HUB N6-DIMM05 U-P1-M7 N6-DIMM13 P1-C1-C1 C P QCM 6 1 e P7-0 I N6-DIMM04 N6-DIMM12 P7-2 6 N6-DIMM03 N6-DIMM11 P7-3 N6-DIMM02 N6-DIMM10 64/40 Optical N6-DIMM01 N6-DIMM09 N6-DIMM00 N6-DIMM08 'D-Link' N7-DIMM07 N7-DIMM15 N7-DIMM06 N7-DIMM14 P7-1 HUB U-P1-M8 N7-DIMM05 N7-DIMM13 QCM 7 P7-0 N7-DIMM04 N7-DIMM12 P7-2 7 N7-DIMM03 N7-DIMM11 P7-3 N7-DIMM02 N7-DIMM10 N7-DIMM01 N7-DIMM09 N7-DIMM00 N7-DIMM08 12
13
Recommend
More recommend