operation of the k computer and the facility
play

Operation of the K computer and the facility Fumiyoshi Shoji - PowerPoint PPT Presentation

Computer simulations create the future Operation of the K computer and the facility Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science RIKEN Center for Computational


  1. Computer simulations create the future Operation of the K computer and the facility Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science � RIKEN Center for Computational Science

  2. An announcement of the K computer’s shutdown 2019/1/31 https://www.r-ccs.riken.jp/en/topics/20190131.html 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 I moved to RIKEN and joined the early phase of the K project design & construction (facility) design & construction (K computer) early access official operation � over 8 years ! RIKEN Center for Computational Science

  3. K computer and achievements The K computer: • developed by collaboration between RIKEN • and FUJITSU in a Japanese national project. designed to aim for a general-purpose • computing. no accelerators • broad memory/interconnect bandwidth • Achievements: • TOP500 list :No.1 at Jun. and Nov. 2011.(#18 in the latest list) – The world’s first supercomputer achieved over 10PF HPL performance. • Graph500 list :No.1 at Jun. 2014, Jul. 2015 – Nov. 2018. – HPCG results :No.1 in Nov. 2016 – Nov. 2017.(#3 in the latest list) – Gordon Bell prize :Winner in 2011 and 2012 – The other remarkable results for science and engineering – See http://www.r-ccs.riken.jp/en/ • � RIKEN Center for Computational Science

  4. System overview The K computer Compute nodes # of CPU 82,944 Users Memory capacity 1.27PiB Internet 6D mesh/torus network (Tofu) Pre/Post Frontend I/O nodes Servers Servers Local File System(LFS) (11PB) Control & Management network Global I/O network Management Control Servers Servers Global File System(GFS) (40PB) 4

  5. Facility overview (power supply) Total power consumption:14-16MW K computer Substation � 30MW 11-12MW Power supply company Storages & servers Power Generators Air handlers Gas Turbine Gas supply Power Generator company � 5MW active/ 3-5MW standby Chillers Gas Turbine Power Generator � 5MW etc. 5

  6. Facility overview (Co-generation system) Gas turbine unusable power generator waste heat (5MW x 2) Electricity ~25% ~30% Gas ~45% heat steam for air condition Chiller type Quantity Cooling Cooling Power capability capability consumption (USRt) (MW) (kW) for cooling Absorption 4 1,700 5.98 273 Centrifugal 2 1,400 4.93 901 Centrifugal 1 700 2.46 389 Absorption chillers Co-generation system enables to achieve higher energy efficiency On the other hand, due to tight connection between power generator and chiller, facility operation is much more complicated. 6

  7. Statistics 2012/9/28 – 2019/2/3 (6 years and 4 months) # of projects 649 # of (real) users 3,570 # of processed jobs 3,491,472 3,389,123,489 (*) Total used Node Hours (*) 73.5% for 6years4months � RIKEN Center for Computational Science

  8. Yearly availability & job filling rate 100% 1.2% 1.2% 1.3% 1.8% 1.9% 2.3% 3.5% 3.8% 3.7% 4.0% 3.7% 4.2% 6.3% 3.2% 90% 78.9% 365d x 24h 80% 77.8% 76.5% 75.9% 75.6% 75.3% 95.0% 94.6% 94.7% 94.0% 93.9% 93.3% 91.9% 70% 365d × 24h − scheduled maintenance 9 irregular system down 61.2% Availability = 365d × 24h 60% @ABC DEFC GHCB IJ KAI Job ?illing rate = Available time 50% FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Sep.-Mar.) (Apr.-Jan.) Availability Scheduled maintenance Irregular system down Job filling rate Availability rate higher level (~95%) • Irregular system down is suppressed to less than 2% in the last 3 years • � Considering that direct interconnection between nodes and a block- • wise job allocation, job filling rate is at a sufficiently higher level. RIKEN Center for Computational Science

  9. Irregular system down 350 100% 1.2% 1.2% 1.3% 1.8% 2.3% 1.9% 3.5% 4.0% 3.8% 3.7% 3.7% 4.2% 6.3% 3.2% 90% 300 down time (hours) 78.9% 80% 77.8% 76.5% 75.9% 75.6% 75.3% 250 94.7% 95.0% 94.6% 94.0% 93.9% 93.3% 70% 91.9% 61.2% 60% thunder: 43.0h 200 typhoon: 46.4h 50% FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.) 150 Availability Scheduled maintenance Irregular system down Job filling rate 100 50 0 FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Sep.-Mar.) (Apr.-Jan.) LFS GFS job scheduler MPI misc File system failures (GFS & LFS) are dominant irregular system down • We changed our mind to give priority to resuming service earlier than • investigating the cause of failures since FY2015. Misc. in FY2018 includes failure of power supply facility due to terrible rain • � and wind by typhoon (8/20) and power outage by thunder (6/8). RIKEN Center for Computational Science

  10. Improvements (PUE) Power consumption & PUE(Power Usage Effectiveness) 100% 1.50 90% 1.48 80% 1.46 70% 1.44 60% 1.42 50% 1.40 40% 1.38 30% 1.36 20% 1.34 10% 0% 1.32 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 Dec-12 Mar-13 Jun-13 Dec-13 Mar-14 Jun-14 Dec-14 Mar-15 Jun-15 Dec-15 Mar-16 Jun-16 Dec-16 Mar-17 Jun-17 Dec-17 Mar-18 Jun-18 Dec-18 Computing resources Cooling(chillers, etc.) Cooling(air handlers, etc.) PUE(#49) Optimization of air cooling operation (2012-2013) • �� Optimization of power generator and chillers (2018-) • RIKEN Center for Computational Science

  11. Improvements (Power capping) To avoid penalty when power consumption exceeds the upper limit Typical power consumption history (4/14 9:00- 14:00) 20 Site total K computer only 18 4MW down 4MW up 16 power consumption (MW) 14 12 10 8 full node job running idle idle (elapsed time ~ 2h30m) 6 Preview process for large scale job (more than 40% of full system): • Large and rapid change of power consumption 14 -> 18MW and 18 -> • 4 14MW very quickly. User who want to execute large scale job must execute a small version (10% of full • 2 • It is concerned impacts for power and cooling facilities. system) of the large scale job before large scale mode period. 0 00 10 20 30 40 50 We evaluate the power consumption profile of the job and estimate the upper power • 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 consumption and decide to admit to execute the job or not. Prepare large power consumption: • If the estimated power consumption exceed the limit, we also consider to activate 2nd • power generator during the job is running. Safety valve: • If power consumption excess occur unfortunately, the monitoring system will work and • �� the job will be killed automatically. RIKEN Center for Computational Science

  12. Improvements (others) for active user support • based on data analysis of automatically corrected job profiling data, user • support team can identify and approach users who have potential of performance improvement. “micro” queue • job queue for small job to fill spatial and temporal scheduling gap. • “Waiting for K” • command which provides estimated waiting time between submit to run. • “ksub” • command which allow to submit many jobs larger than system limit. • “K pre-post cloud” • An OpenStack based pre-post environment for various user needs • “R-CCS software center” • An activity to support development and promotion of outstanding • software made in R-CCS. etc. • �� RIKEN Center for Computational Science

  13. Towards to operation/services of Post-K • Increase an effective usage rate • to increase job filling rate +10%, we should consider rational node allocation and “charge” roles • to increase availability and decrease PUE, we have to improve efficiency and quality of the operation by including automation based on data analysis • Improve service quality • commit to construct software eco-system • collaborate with service providers We are now discussing about operation of Post-K Numerous users/projects from various fields of science and engineering �� come to Post-K RIKEN Center for Computational Science

  14. Thank you for your attention �� RIKEN Center for Computational Science

Recommend


More recommend