capacity vs capability
play

Capacity vs. Capability Increase in Computing Capabilities Can and - PDF document

1 The TSUBAME Now and Future--- Running a 100TeraFlops-Scale Supercomputer for Everyone as a NAREGI Resource and Its Future Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er Tokyo I nst .


  1. 1 The TSUBAME Now and Future--- Running a 100TeraFlops-Scale Supercomputer for Everyone as a NAREGI Resource and Its Future Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics 2 Capacity vs. Capability Increase in Computing Capabilities Can and How do they Coexist? Can and How do they Coexist? Proliferation of e-Science via VO Support Massive Capacity Required

  2. 3 TSUBAME “Grid” Clust er Supercomput er • T okyo-t ech • S upercomput er and • UB iquit ously • A ccessible 燕 • M ass-st orage • E nvironment TSUBAME means “a swallow” in J apanese, Tokyo-t ech (Tit ech)’s symbol bird, and it s logo (but we are home t o massive # of parakeet s) 4 The TSUBAME Production “Supercomputing Grid Cluster” Spring 2006-2010 Voltaire ISR9288 Infiniband Sun Galaxy 4 (Opteron Dual 10Gbps x2 (xDDR) core 8-socket) “Fastest ~1310+50 Ports 10480core/655Nodes Supercomputer in ~1.4Terabits/s Japan” 7 th on the 28 th 21.4Terabytes 50.4TeraFlops Top500@38.18TF 10Gbps+External OS Linux (SuSE 9, 10) Unified IB Network NAREGI Grid MW network NEC SX-8 500GB Small Vector 48disks 500GB 48disks 500GB 48disks Nodes (under Storage plan) 1 Petabyte (Sun “Thumper”) ClearSpeed CSX600 0.1Petabyte (NEC iStore) SIMD accelerator Lustre FS, NFS, CIF, WebDAV 360 boards, 50GB/s aggregate I/O BW 35TeraFlops(Current))

  3. 5 TSUBAME Global Partnership NEC: Main Integrator, Storage, Operations SUN: Galaxy Compute Nodes, Storage AMD: Opteron CPU Voltaire: Infiniband Network ClearSpeed: CSX600 Accel. CFS: Parallel FSCFS NAREGI: Grid MW Titech GSIC: us UK Germany AMD:Fab36 USA Japan Israel 6 Titech TSUBAME Titech TSUBAME ~76 76 racks racks ~ 350m2 floor area 350m2 floor area 1.2 MW (peak) 1.2 MW (peak)

  4. 7 Local Infiniband Switch Local Infiniband Switch (288 ports) (288 ports) Currently Currently Node Rear Node Rear 2GB/s / node 2GB/s / node Easily scalable to Easily scalable to 8GB/s / node 8GB/s / node Cooling Towers (~32 32 units) units) Cooling Towers (~ ~500 TB out of 1.1PB ~500 TB out of 1.1PB 8 TSUBAME Archit ect ure = Commodit y PC Clust er + Tradit ional FAT node Supercomput er + The I nt ernet & Grid + (Modern) Accelerat ion

  5. 9 Design Principles of TSUBAME(1) • Capability and Capacity : have the cake and eat it, too! – High-performance, low power x86 multi-core CPU • High INT-FP, high cost performance, Highly reliable • Latest process technology – high performance and low power • Best applications & software availability: OS (Linux/Solaris/Windows), languages/compilers/tools, libraries, Grid tools, all ISV Applications – FAT Node Architecture (later) • Multicore SMP – most flexible parallel programming • High memory capacity per node (32/64GB) • Large total memory – 21.4 Terabytes • Low node count – improved fault tolerance, easen network design – High Bandwidth Infiniband Network, IP-based (over RDMA) • (Restricted) two-staged fat tree • High bandwidth (10-20Gbps/link), multi-lane, low latency (< 10microsec), reliable/redundant (dual-lane) • Very large switch (288 ports) => low switch count, low latency • Resilient to all types of communications; nearest neighbor, scatter/gather collectives, embedding multi-dimensional networks • IP-based for flexibility, robustness, synergy with Grid & Internet 10 Design Principles of TSUBAME(2) • PetaByte large-scale, high-perfomance, reliable storage – All Disk Storage Architecture (no tapes), 1.1Petabyte • Ultra reliable SAN/NFS storage for /home (NEC iStore), 100GB • Fast NAS/Lustre PFS for /work (Sun Thumper), 1PB – Low cost / high performance SATA2 (500GB/unit) – High Density packaging (Sun Thumper), 24TeraBytes/4U – Reliability thru RAID6, disk rotation, SAN redundancy (iStore) • Overall HW data loss: once / 1000 years – High bandwidth NAS I/O: ~50GBytes/s Livermore Benchmark – Unified Storage and Cluster interconnect : low cost, high bandwidth, unified storage view from all nodes w/o special I/O nodes or SW • Hybrid Architecture: General-Purpose Scalar + SIMD Vector Acceleration w/ ClearSpeed CSX600 – 35 Teraflops peak @ 90 KW (~ 1 rack of TSUBAME) – General purpose programmable SIMD Vector architecture

  6. 11 TSUBAME Timeline • 2005, Oct . 31: TSUBAME cont ract • Nov. 14 t h Announce @ SC2005 • 2006, Feb. 28: st opped services of old SC – SX-5, Origin2000, HP GS320 • Mar 1~Mar 7: moved t he old machines out • Mar 8~Mar 31: TSUBAME I nstallation • Apr 3~May 31: Experiment al Product ion phase 1 – 32 nodes (512CPUs), 97 Terabyt es st orage, f ree usage – Linpack 38.18 Teraf lops May 8 t h , # 7 on t he 28 t h Top500 – May 1~8: Whole system Linpack, achieve 38. 18 TF • J une 1~Sep. 31: Experiment al Product ion phase 2 – 299 nodes, (4748 CPUs), st ill f ree usage • Sep. 25- 29 Linpack w/ ClearSpeed, 47. 38 TF • Oct . 1: Full product ion phase – ~10,000CPUs, several hundred Terabyt es f or SC – I nnovat ive account ing: I nt ernet -like Best Ef f ort & SLA 12 TSUBAME as No.1 in Japan >> All University National Centers >85 TeraFlops 1.1Petabyte Total 45 TeraFlops, 4 year procurement cycle 350 Terabytes Has beaten the Earth Simulator Has beaten all the other Univ. centers combined

  7. 13 TSUBAME Physical Installation 3 rooms (600m 2 ), 350m 2 • Titech Grid Cluster service area TSUBAME • 76 racks incl. network & storage, 46.3 tons 2 nd Floor A – 10 storage racks • 32 AC units, 12.2 tons • Total 58.5 tons (excl. rooftop AC heat exchangers) • Max 1.2 MWatts • ~3 weeks construction time TSUBAME 2 nd Floor B TSUBAME & Storage TSUBAME 1 st Floor 14 TSUBAME Network: (Restricted) Fat Tree, IB-RDMA & TCP-IP Ext ernal Et her Bisect ion BW = 2.88Tbps x 2 I B 4x 10Gbps Single mode f iber f or x 24 cross-f loor connect ions Volt air I SR9288 I B 4x I B 4x 10Gbps 10Gbps x 2 X4500 x 42nodes (42 port s) X4600 x 120nodes (240 port s) per swit ch => 42port s 420Gbps => 600 + 55 nodes, 1310 port s, 13.5Tbps

  8. 15 The Benef it s of Being “Fat Node” • Many HPC Apps f avor large SMPs • Flexble programming models---MPI , OpenMP, J ava, ... • Lower node count – higher reliabilit y/ manageabilit y • Full I nt erconnect possible --- Less cabling & smaller swit ches, mult i-link parallelism, no “mesh” t opologies CPUs/ Node Peak/ Node Memory/ Node 8, 32 48GF~217.6GF 16~128GB I BM eServer (SDSC Dat aSt ar) Hit achi SR11000 8, 16 60.8GF~135GF 32~64GB (U-Tokyo, Hokkaido-U) 64~128 532.48GF~799GF 512GB Fuj it su PrimePower (Kyot o-U, Nagoya-U) 16 128GF 16GB The Eart h Simulat or TSUBAME 16 76. 8GF+ 96GF 32~64GB (Tokyo Tech) I BM BG/ L 2 5.6 GF 0.5~1GB Typical PC Clust er 2~4 10~40GF 1~8GB 16 Sun Tsubame Technical Experiences t o be Published as Sun Blueprint s • Coming RSN • About 100 pages • Principally aut hored by Sun’s On-sit e Engineers

  9. 17 TSUBAME in Production Oct. 1 2006 (phase 3) ~10400 CPUs 18 TSUBAME Reliability • Very High Availability (over 99%) # Avail CPUs # Avail CPUs • Faults frequent but localized effect only – Jobs automatically restarted by SGE • Most faults NOT HW, mostly SW – Fixed with reboots & patches TSUBAME Fault Overview 8/15/2006 - 9/8/2006 Compute Nodes (655 nodes) Thumper Total Overall Possible HW HW Breakage Date Faults HDD Faults HW Compute Faults (incl. Faults (excl. (2016 HDDs) Faults Node faults unknowns) unknowns) Total 24 Days 39 34 12 3 4 7 Per Day 1.63 1.42 0.50 0.13 0.17 0.29 Over Year 593.1 517.1 182.5 45.6 60.8 106.5 Unit MTBF (Y) 1.1043 1.26672 3.589041 14.356164 33.13973 Unit MTBF (H) 9,674 11,096 31,440 125,760 290,304

  10. 19 TSUBAME Applications---Massively Complex Turbulant Flow and its Visualization (by Tanahashi Lab and Aoki Lab, Tokyo Tech.) Turbulant Flow from Airplane Taylor-Couette Flow 20 TSUBAME Turbulant Flow Visualization ( Prof. Tanahashi and Aoki, Tokyo Tech) � Used TSUBAME for bot computing and vis. � 2000CPUs for vis � ( parallel avs) � 20 Billion Polygons � 20,000 x 10,000 Pixels

Recommend


More recommend