TSUBAME2.0, 2.5 towards 3.0 for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) HP-CAST SC2014 Presentation New Orleans, USA 20141114
TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” • GPU-centric (> 4000) high performance & low power • Small footprint (~200m2 or 2000 sq.ft), low TCO • High bandwidth memory, optical network, SSD storage… TSUBAME 2.0 New Development >600TB/s Mem BW 220Tbps NW Bisecion BW >1.6TB/s Mem BW >12TB/s Mem BW >400GB/s Mem BW 1.4MW Max 35KW Max 80Gbps NW BW 40nm 32nm ~1KW max 2 2
TSUBAME2.0 ⇒ 2.5 Thin Node Upgrade (Fall 1993) Peak Perf. Thin 4.08 Tflops Productized Node ~800GB/s as HP Mem BW ProLiant Infiniband QDR 80GBps NW SL390s x2 (80Gbps) ~1KW max Modified for TSUABME2.5 HP S HP SL390G7 L390G7 (De (Developed f eloped for or TSUB TS UBAM AME 2.0, Modified f E 2.0, Modified for or 2.5) 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem Mem(per GPU) CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD : 60GBx2, 120GBx2 NVIDIA Kepler NVIDIA Fermi K20X M2050 3950/1310 1039/515 GFlops GFlops
Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size ( 1GPU+4 CPU cores ) :4096 x 162 x 130 TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640 TSUBAME 2.0 Developing lightweight strengthening 2.000 PFlops material by controlling microstructure (4,000 GPUs+16,000 CPU cores) Low-carbon society 4,096 x 6,480 x 13,000 • Peta-Scale phase-field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. • 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution
Application TSUBAME2.0 TSUBAME2.5 Boost Performance Performance Ratio Top500/Linpack 1.192 2.843 2.39 4131 GPUs (PFlops) Green500/Linpack 0.958 3.068 3.20 4131 GPUs (GFlops/W) Semi-Definite Programming Nonlinear 1.019 1.713 1.68 Optimization 4080 GPUs (PFlops) Gordon Bell Dendrite Stencil 3968 GPUs 2.000 3.444 1.72 (PFlops) LBM LES Whole City Airflow 3968 GPUs 0.592 1.142 1.93 (PFlops) Amber 12 pmemd 3.44 11.39 3.31 4 nodes 8 GPUs (nsec/day) GHOSTM Genome Homology Search 1 19361 10785 1.80 GPU (Sec) MEGADOC Protein Docking 37.11 83.49 2.25 1 node 3GPUs (vs. 1CPU core)
TSUBAME2.0=>2.5 Power Improvement 2012/12 18% Power Reduction inc. cooling 2013/11 Green 500 #6 in the world 2013/12 • Along with TSUBAME-KFC (#1) • 2014/6 #9
Com ompa paring g K K Com ompu puter r to o TS TSUB UBAM AME 2. 2.5 Pe Perf rf ≒ Cos ost << << K Computer (2011) TS TSUB UBAM AME2.0 .0(2010) → TS TSUB UBAM AME2.5(2013) 11.4 Petaflops SFP/DFP 17.1 Petaflops SFP $1400mil 6 years 5.76 Petaflops DFP (incl. power) $45mil / 6 years (incl. power) x30 TSUBAME2
TSUBAME2 vs. K Technological Comparisons (TSUBAME2 Deploying State-of-Art Tech.) TSUBAME2.5 BG/Q Sequoia K Computer Single Precision FP 17.1 Petaflops 20.1 Petaflops 11.3 Petaflops 3,068.71 (6 th ) 2,176.58 (26 th ) 830.18 (123 rd ) Green500 (MFLOPS/W) Nov. 2013 Operational Power (incl. Cooling) ~0.8MW 5~6MW? 10~11MW Hardware Architecture Many-Core (GPU) + Multi-Core Homo Multi-Core Homo Multi-Core Hetero Maximum HW Threads > 1 Billion ~6 million ~700,000 Memory Technology GDDR5+DDR3 DDR3 DDR3 Network Technology Luxtera Silicon Standard Optics Copper Photonics Non Volatile Memory / SSD SSD Flash all nodes None None ~250TBytes Power Management Node/System Active Rack-level Rack-level measurement only measurement only Power Cap Virtualization KVM (G & V queues, None None Resource segragation)
TSUBAME3.0 : Leadership “Template” Machine • Under Design : Deployment 2016H2~H3 • High computational power: ~20 Petaflops, ~5 Petabyte/s Mem BW • Ultra high density: ~0.6 Petaflops DFP/rack (x10 TSUBAME2.0) • Ultra power efficient: 10 Gigaflops/W (x10 TSUBAME2.0, TSUBAME-KFC) – Latest power control, efficient liquid cooling, energy recovery • Ultra high-bandwidth network: over 1 Petabit/s bisection, new topology? – Bigger capacity than the entire global Intenet (several 100Tbps) • Deep memory hierarchy and ultra high-bandwidth I/O with NVM – Petabytes of NVM, several Terabytes/s BW, several 100 million IOPS – Next generation “scientific big data” support • Advanced power aware resource management, high resiliency SW/HW co- design, VM & container- based dynamic deployment…
Focused Research Towards Tsubame 3.0 and Beyond towards Exa • Software and Algorithms for new memory hiearchy – Pushing the envelops of low Power vs. Capacity, Communication and Synchroniation Reducing Algorithms (CSRA) • Post Petascale Networks – Topology, Routing Algorithms, Placement Algorithms… (SC14 paper Tue 14:00- 14:30 “Fail in Place Network…”) • Green Computing : Power aware APIs, fine-grained resource scheduling • Scientific “Extreme” Big Data – GPU Hadoop Acceleration, Large Graphs, Search/Sort, Deep Learning • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Post Petascale Programming – OpenACC extensions and other many-core programming substrates, • Performance Analysis and Modeling – For CSRA algorithms, for Big Data, for deep memory hierarchy, for fault tolerance, …
TSUBAME KFC TSUBAME-KFC Towards TSUBAME3.0 and Beyond Oil-Immersive Cooling #1 Green 500 SC13, ISC14, … (Paper @ ICPADS14)
Extreme Big Data Examples Rates and Volumes are extremely immense Social NW – large graph processing Social Simulation NOT simply mining • • Facebook Applications – 〜 1 billion users Tbytes Silo Data – Target Area: Planet – (Open Street Map) Average 130 friends – 7 billion people – 30 billion pieces of content Peta~Zetabytes Data • shared per month Input Data • Twitter – Road Network for Planet: 300GB (XML) – 500 million active users Ultra High-BW Data – Trip data for 7 billion people – 340 million tweets per day 10KB (1trip) x 7 billion = 70TB Stream • Internet – Real-Time Streaming Data – 300 million new websites per year (e.g., Social sensor, physical data) – 48 hours of video to YouTube per minute • Highly Unstructured, Simulated Output for 1 Iteration Weather – real time – 30,000 YouTube videos played per second – 700TB Genomics advanced Irregular large data assimilation sequence matching Complex correlations Impact of new generation sequencers� � Phased Array Radar Himawari between data from 1GB/30sec/2 radars 500MB/2.5min multiple sources A-1. Quality Control B-1. Quality Control A-2. Data Processing B-2. Data Processing ① 30-sec Ensemble ② Ensemble Extreme Capacity, Analysis Data Forecast Simulations Data Assimilation 2GB 2 PFLOP 2 PFLOP Bandwidth, Compute ③ 30-min Ensemble Forecasts Ensemble Analyses シミュレーション シミュレーション Forecast Simulation All Required シミュレーション シミュレーション 200GB 200GB データ データ 1.2 PFLOP データ データ Sequencing� data� (bp)/$� � becomes� x4000� per� 5� years� Repeat every 30 sec. years � � c.f.,� HPC� x33� in� 5� 30-min Forecast 4� � 2GB Lincoln Stein, Genome Biology, vol. 11(5), 2010
Graph500 “Big Data” Benchmark Kronecke ecker graph h BSP Problem blem November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will A: 0.57, B: 0.19 almost certainly dominate a major chunk of part of the C: 0.19, D: 0.05 list .” The 8 th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS RIKEN Advanced Institute for Computational ) ’s K computer S cience (AICS is ranked Reality: Top500 Supercomputers Dominate No.1 on the G raph500 Ranking of S upe rcompute rs with 17977.1 GE/ s on Scale 40 on the 8th G raph500 lis t publis he d at the Inte rnational S upercomputing Conference, June 22, 2014. No Cloud IDCs at all Congratulations from the G raph500 Executive Committee #1 K Computer G lobal Scienti fic Information and Computing Center, Tokyo Institute of Technology ’s TSUBAME 2.5 is ranked No.12 on the G raph500 Ranking of S upercomputers with 1280.43 GE/ s on Scale 36 on the 8th G raph500 list published at the International S upercomputing Conference, June 22, 2014. Congratulations from the G raph500 Executive Committee #12 TSUBAME2
Recommend
More recommend