BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams 1,2 , Jonathan Carter 2 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory samw@eecs.berkeley.edu
Motivation BIPS BIPS Multicore is the de facto solution for improving peak performance for the next decade How do we ensure this applies to sustained performance as well ? Processor architectures are extremely diverse and compilers can rarely fully exploit them Require a HW/SW solution that guarantees performance without completely sacrificing productivity
Overview BIPS BIPS Examined the Lattice-Boltzmann Magneto-hydrodynamic ( LBMHD ) application Present and analyze two threaded & auto-tuned implementations Benchmarked performance across 5 diverse multicore microarchitectures Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM QS20 Cell Blade (PPEs) IBM QS20 Cell Blade (SPEs) We show Auto-tuning can significantly improve application performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity
BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multicore SMPs used
Multicore SMP Systems BIPS BIPS Intel Xeon (Clovertown) AMD Opteron (rev.F) (each direction) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s 4MB 4MB 4MB 4MB 1MB 1MB 1MB 1MB HT HT Shared L2 Shared L2 Shared L2 Shared L2 victim victim victim victim SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade MT MT MT MT MT MT MT MT SPE SPE SPE SPE SPE SPE SPE SPE PPE PPE Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 512KB 512KB MFC MFC MFC MFC MFC MFC MFC MFC L2 L2 Crossbar Switch EIB (Ring Network) EIB (Ring Network) <20GB/s 90 GB/s (writethru) 179 GB/s (fill) each direction 4MB Shared L2 (16 way) MFC MFC MFC MFC MFC MFC MFC MFC (address interleaving via 8x64B banks) 256K 256K 256K 256K XDR BIF BIF XDR 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Multicore SMP Systems BIPS BIPS (memory hierarchy) Intel Xeon (Clovertown) AMD Opteron (rev.F) d (each direction) e Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s s 4MB 4MB 4MB 4MB 1MB 1MB 1MB 1MB a HT HT Shared L2 Shared L2 Shared L2 Shared L2 victim victim victim victim b SRI / crossbar SRI / crossbar - FSB FSB e 10.6 GB/s 10.6 GB/s h 128b memory controller 128b memory controller y c Chipset (4x64b controllers) h a 10.66 GB/s 10.66 GB/s c 21.3 GB/s(read) 10.6 GB/s(write) C r 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs a 667MHz FBDIMMs l a r e n i o H i Sun Niagara2 (Huron) IBM QS20 Cell Blade t y n r e o v m n MT MT MT MT MT MT MT MT SPE SPE SPE SPE SPE SPE SPE SPE o e PPE PPE Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc C M 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 512KB 512KB MFC MFC MFC MFC MFC MFC MFC MFC L2 L2 Crossbar Switch EIB (Ring Network) EIB (Ring Network) <20GB/s 90 GB/s (writethru) 179 GB/s (fill) each direction 4MB Shared L2 (16 way) MFC MFC MFC MFC MFC MFC MFC MFC (address interleaving via 8x64B banks) 256K 256K 256K 256K XDR BIF BIF XDR 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Multicore SMP Systems BIPS BIPS (memory hierarchy) d Intel Xeon (Clovertown) AMD Opteron (rev.F) e s a b (each direction) - Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s e h 4MB 4MB 4MB 4MB 1MB 1MB 1MB 1MB HT HT y Shared L2 Shared L2 Shared L2 Shared L2 victim victim victim victim c h a SRI / crossbar SRI / crossbar c FSB FSB C r 10.6 GB/s 10.6 GB/s a 128b memory controller 128b memory controller l a Chipset (4x64b controllers) r e n 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) i o H 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs i t y n r e o v m n Sun Niagara2 (Huron) IBM QS20 Cell Blade e o e r o C M t S y l h MT MT MT MT MT MT MT MT SPE SPE SPE SPE SPE SPE SPE SPE a PPE PPE c Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc c 256K 256K 256K 256K 256K 256K 256K 256K r o a 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 512KB 512KB L r MFC MFC MFC MFC MFC MFC MFC MFC L2 L2 e t n i Crossbar Switch H EIB (Ring Network) i EIB (Ring Network) <20GB/s o 90 GB/s (writethru) 179 GB/s (fill) y each j s r direction 4MB Shared L2 (16 way) MFC MFC MFC MFC MFC MFC MFC MFC i o D (address interleaving via 8x64B banks) m 256K 256K 256K 256K XDR BIF BIF XDR 256K 256K 256K 256K 4x128b memory controllers (2 banks each) e SPE SPE SPE SPE SPE SPE SPE SPE M 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Multicore SMP Systems BIPS BIPS (memory hierarchy) Intel Xeon (Clovertown) AMD Opteron (rev.F) s (each direction) d Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s a 4MB 4MB 4MB 4MB 1MB 1MB 1MB 1MB e HT HT Shared L2 Shared L2 Shared L2 Shared L2 victim victim victim victim r b h SRI / crossbar SRI / crossbar s FSB FSB t n P 10.6 GB/s 10.6 GB/s o 128b memory controller 128b memory controller + Chipset (4x64b controllers) i t a 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) e t n h 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs e c a m C e l Sun Niagara2 (Huron) IBM QS20 Cell Blade p e m p s i b i l MT MT MT MT MT MT MT MT SPE SPE SPE SPE SPE SPE SPE SPE PPE PPE s + Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc n 256K 256K 256K 256K 256K 256K 256K 256K e o 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 512KB 512KB r i MFC MFC MFC MFC MFC MFC MFC MFC L2 L2 o t a t Crossbar Switch S t n EIB (Ring Network) EIB (Ring Network) <20GB/s e 90 GB/s (writethru) 179 GB/s (fill) l each a m direction c 4MB Shared L2 (16 way) MFC MFC MFC MFC MFC MFC MFC MFC o e (address interleaving via 8x64B banks) 256K L 256K 256K 256K XDR l BIF BIF XDR 256K 256K 256K 256K p 4x128b memory controllers (2 banks each) m SPE SPE SPE SPE SPE SPE SPE SPE i 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Multicore SMP Systems BIPS BIPS (peak flops) Intel Xeon (Clovertown) AMD Opteron (rev.F) (each direction) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s 4MB 4MB 4MB 4MB 1MB 1MB 1MB 1MB HT HT Shared L2 Shared L2 Shared L2 Shared L2 victim victim victim victim 75 Gflop/s 17 Gflop/s SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade MT MT MT MT MT MT MT MT SPE SPE SPE SPE SPE SPE SPE SPE PPE PPE Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 256K 256K 256K 256K 256K 256K 256K 256K PPEs: 13 Gflop/s 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 512KB 512KB MFC MFC MFC MFC MFC MFC MFC MFC L2 L2 Crossbar Switch EIB (Ring Network) EIB (Ring Network) 11 Gflop/s <20GB/s 90 GB/s (writethru) 179 GB/s (fill) each direction 4MB Shared L2 (16 way) MFC MFC MFC MFC MFC MFC MFC MFC (address interleaving via 8x64B banks) SPEs: 29 Gflop/s 256K 256K 256K 256K XDR BIF BIF XDR 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Recommend
More recommend