Computer simulations create the future Operation of the K computer and the facility Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science � RIKEN Center for Computational Science
An announcement of the K computer’s shutdown 2019/1/31 https://www.r-ccs.riken.jp/en/topics/20190131.html 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 I moved to RIKEN and joined the early phase of the K project design & construction (facility) design & construction (K computer) early access official operation � over 8 years ! RIKEN Center for Computational Science
K computer and achievements The K computer: • developed by collaboration between RIKEN • and FUJITSU in a Japanese national project. designed to aim for a general-purpose • computing. no accelerators • broad memory/interconnect bandwidth • Achievements: • TOP500 list :No.1 at Jun. and Nov. 2011.(#18 in the latest list) – The world’s first supercomputer achieved over 10PF HPL performance. • Graph500 list :No.1 at Jun. 2014, Jul. 2015 – Nov. 2018. – HPCG results :No.1 in Nov. 2016 – Nov. 2017.(#3 in the latest list) – Gordon Bell prize :Winner in 2011 and 2012 – The other remarkable results for science and engineering – See http://www.r-ccs.riken.jp/en/ • � RIKEN Center for Computational Science
System overview The K computer Compute nodes # of CPU 82,944 Users Memory capacity 1.27PiB Internet 6D mesh/torus network (Tofu) Pre/Post Frontend I/O nodes Servers Servers Local File System(LFS) (11PB) Control & Management network Global I/O network Management Control Servers Servers Global File System(GFS) (40PB) 4
Facility overview (power supply) Total power consumption:14-16MW K computer Substation � 30MW 11-12MW Power supply company Storages & servers Power Generators Air handlers Gas Turbine Gas supply Power Generator company � 5MW active/ 3-5MW standby Chillers Gas Turbine Power Generator � 5MW etc. 5
Facility overview (Co-generation system) Gas turbine unusable power generator waste heat (5MW x 2) Electricity ~25% ~30% Gas ~45% heat steam for air condition Chiller type Quantity Cooling Cooling Power capability capability consumption (USRt) (MW) (kW) for cooling Absorption 4 1,700 5.98 273 Centrifugal 2 1,400 4.93 901 Centrifugal 1 700 2.46 389 Absorption chillers Co-generation system enables to achieve higher energy efficiency On the other hand, due to tight connection between power generator and chiller, facility operation is much more complicated. 6
Statistics 2012/9/28 – 2019/2/3 (6 years and 4 months) # of projects 649 # of (real) users 3,570 # of processed jobs 3,491,472 3,389,123,489 (*) Total used Node Hours (*) 73.5% for 6years4months � RIKEN Center for Computational Science
Yearly availability & job filling rate 100% 1.2% 1.2% 1.3% 1.8% 1.9% 2.3% 3.5% 3.8% 3.7% 4.0% 3.7% 4.2% 6.3% 3.2% 90% 78.9% 365d x 24h 80% 77.8% 76.5% 75.9% 75.6% 75.3% 95.0% 94.6% 94.7% 94.0% 93.9% 93.3% 91.9% 70% 365d × 24h − scheduled maintenance 9 irregular system down 61.2% Availability = 365d × 24h 60% @ABC DEFC GHCB IJ KAI Job ?illing rate = Available time 50% FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Sep.-Mar.) (Apr.-Jan.) Availability Scheduled maintenance Irregular system down Job filling rate Availability rate higher level (~95%) • Irregular system down is suppressed to less than 2% in the last 3 years • � Considering that direct interconnection between nodes and a block- • wise job allocation, job filling rate is at a sufficiently higher level. RIKEN Center for Computational Science
Irregular system down 350 100% 1.2% 1.2% 1.3% 1.8% 2.3% 1.9% 3.5% 4.0% 3.8% 3.7% 3.7% 4.2% 6.3% 3.2% 90% 300 down time (hours) 78.9% 80% 77.8% 76.5% 75.9% 75.6% 75.3% 250 94.7% 95.0% 94.6% 94.0% 93.9% 93.3% 70% 91.9% 61.2% 60% thunder: 43.0h 200 typhoon: 46.4h 50% FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.) 150 Availability Scheduled maintenance Irregular system down Job filling rate 100 50 0 FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Sep.-Mar.) (Apr.-Jan.) LFS GFS job scheduler MPI misc File system failures (GFS & LFS) are dominant irregular system down • We changed our mind to give priority to resuming service earlier than • investigating the cause of failures since FY2015. Misc. in FY2018 includes failure of power supply facility due to terrible rain • � and wind by typhoon (8/20) and power outage by thunder (6/8). RIKEN Center for Computational Science
Improvements (PUE) Power consumption & PUE(Power Usage Effectiveness) 100% 1.50 90% 1.48 80% 1.46 70% 1.44 60% 1.42 50% 1.40 40% 1.38 30% 1.36 20% 1.34 10% 0% 1.32 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 Dec-12 Mar-13 Jun-13 Dec-13 Mar-14 Jun-14 Dec-14 Mar-15 Jun-15 Dec-15 Mar-16 Jun-16 Dec-16 Mar-17 Jun-17 Dec-17 Mar-18 Jun-18 Dec-18 Computing resources Cooling(chillers, etc.) Cooling(air handlers, etc.) PUE(#49) Optimization of air cooling operation (2012-2013) • �� Optimization of power generator and chillers (2018-) • RIKEN Center for Computational Science
Improvements (Power capping) To avoid penalty when power consumption exceeds the upper limit Typical power consumption history (4/14 9:00- 14:00) 20 Site total K computer only 18 4MW down 4MW up 16 power consumption (MW) 14 12 10 8 full node job running idle idle (elapsed time ~ 2h30m) 6 Preview process for large scale job (more than 40% of full system): • Large and rapid change of power consumption 14 -> 18MW and 18 -> • 4 14MW very quickly. User who want to execute large scale job must execute a small version (10% of full • 2 • It is concerned impacts for power and cooling facilities. system) of the large scale job before large scale mode period. 0 00 10 20 30 40 50 We evaluate the power consumption profile of the job and estimate the upper power • 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 consumption and decide to admit to execute the job or not. Prepare large power consumption: • If the estimated power consumption exceed the limit, we also consider to activate 2nd • power generator during the job is running. Safety valve: • If power consumption excess occur unfortunately, the monitoring system will work and • �� the job will be killed automatically. RIKEN Center for Computational Science
Improvements (others) for active user support • based on data analysis of automatically corrected job profiling data, user • support team can identify and approach users who have potential of performance improvement. “micro” queue • job queue for small job to fill spatial and temporal scheduling gap. • “Waiting for K” • command which provides estimated waiting time between submit to run. • “ksub” • command which allow to submit many jobs larger than system limit. • “K pre-post cloud” • An OpenStack based pre-post environment for various user needs • “R-CCS software center” • An activity to support development and promotion of outstanding • software made in R-CCS. etc. • �� RIKEN Center for Computational Science
Towards to operation/services of Post-K • Increase an effective usage rate • to increase job filling rate +10%, we should consider rational node allocation and “charge” roles • to increase availability and decrease PUE, we have to improve efficiency and quality of the operation by including automation based on data analysis • Improve service quality • commit to construct software eco-system • collaborate with service providers We are now discussing about operation of Post-K Numerous users/projects from various fields of science and engineering �� come to Post-K RIKEN Center for Computational Science
Thank you for your attention �� RIKEN Center for Computational Science
Recommend
More recommend