HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 - PowerPoint PPT Presentation

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 1 / 33

Outline Outline ◮ Results of the survey ◮ What is H YDRA ◮ How to use H YDRA ◮ Answer to some survey questions ◮ Discussion: h/w, s/w, other SGK ( hpc@cfa ) HPC @ SAO February 2013 2 / 33

Introduction Results of the Survey SGK ( hpc@cfa ) HPC @ SAO February 2013 3 / 33

Introduction Results of the Survey - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 4 / 33

What is H YDRA What is H YDRA ◮ H YDRA is a Linux based Beowulf cluster. ◮ Started at SAO a while back, managed by the CF. ◮ Moved to the Smithsonian’s Data Center, in Herndon, VA. ◮ Managed by SI’s Office of Information Technology Operations (OITO/OCIO). ◮ Has grown from a 200+ to a 3000+ core machine. ◮ Has become an SI-wide resource. ◮ The cluster is managed by DJ Ding, sys-admin (in Herndon, VA.) ◮ Additional support for SAO: HPC analyst. (0.25 FTE) SGK ( hpc@cfa ) HPC @ SAO February 2013 5 / 33

What is H YDRA : Hardware What is H YDRA : Hardware ◮ 296 compute nodes, distributed over 10 racks. ◮ Total of 3,116 compute cores (CPUs). ◮ All the nodes are interconnected on regular Ethernet (1 Gbps). ◮ Some nodes are on InfiniBand (40 Gbps) fabric (856 cores in IB). ◮ Some 40 TB of public disk space (56% full). ◮ Comparable user specific disk space (indiv. purchase). ◮ A parallel file system (60 TB), leveraging the IB fabric, is in the works. CF/HPC web page http://www.cfa.harvard.edu/cf/services/cluster incl. how to request an account. The hardware config is described in the HPC Wiki https://www.cfa.harvard.edu/twiki/bin/view/HPC/WebHome SGK ( hpc@cfa ) HPC @ SAO February 2013 6 / 33

What is H YDRA : Hardware What is H YDRA : Hardware - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 7 / 33

What is H YDRA : Software What is H YDRA : Software ◮ The cluster is a Linux-based distributed cluster, running R OCKS ◮ Uses the G RID E NGINE queuing system (aka SGE, OGE or GE). ◮ Access to the cluster is via 2 login nodes: ◮ hydra.si.edu , or ◮ hydra-login.si.edu ◮ From one of the login nodes: ◮ you submit jobs via the queuing system: qsub ◮ all jobs run in batch mode ◮ You do not start jobs interactively on any of the compute nodes, ◮ instead, you submit a script and request resources ◮ the GE selects the compute nodes and starts your job on that/these node(s). ◮ The login nodes are for normal interactive use like editing, compiling, script writing, short debugging, etc. SGK ( hpc@cfa ) HPC @ SAO February 2013 10 / 33

What is H YDRA : Software What is H YDRA : Software - cont’d ◮ Compilers (3) ◮ GNU: gcc, f77 ◮ Intel, icc, icpc, ifort , including the Cluster Studio ◮ PGI: pgcc, phgc+, pgf77, pgf90 (Portland Group), incl. the Cluster Development Kit (CDK) ◮ Libraries ◮ MPI for all compilers, including I NFINI B AND support ◮ Math libraries that come w/ compilers ◮ AMD math libraries ◮ Packages ◮ IDL , including 128 run-time licenses, GDL ◮ IRAF (v1.7) If you need some specific software or believe that it would benefit the user community to have some additional sofware, let me know ( hpc@cfa ). SGK ( hpc@cfa ) HPC @ SAO February 2013 11 / 33

What is H YDRA : Documentation What is H YDRA : Documentation At the HPC Wiki ◮ Primers ◮ How to compile ◮ How to submit jobs ◮ How to monitor your jobs ◮ How to use IDL & GDL on the cluster ◮ How to copy files to/from cluster and what disk(s) to use ◮ FAQs ◮ Queues ◮ Error Messages ◮ Compilers ◮ Disk Use ◮ S/W packages ◮ Man Pages for Grid Engine Commands https://www.cfa.harvard.edu/twiki/bin/view/HPC/WebHome SGK ( hpc@cfa ) HPC @ SAO February 2013 12 / 33

H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Connect and Support ◮ Connect to H YDRA ◮ Must ssh to one of H YDRA login nodes from a trusted host: login.cfa.harvard.edu , CF/HEA managed hosts, VPN ◮ Accounts and directories separate from CF/HEA ◮ SD 931 enforced (incl. password expiration) ◮ Support for H YDRA ◮ DJ Ding - sys-admin (OTIO/OCIO) ◮ SGK - HPC analyst (SAO) ◮ HPCC-L on si-listserv.si.edu ( mailing list on SI’s list-serv) ◮ Do not contact CF/HEA support (except for requesting an account) ◮ Contact the OCIO Help Desk ( OCIOHelpDesk@si.edu ) for password reset or other access related issues. ◮ Configuration is different from CF/HEA systems ◮ Must customize ∼ /.cshrc , ∼ /.bash_profile & ∼ /.bashrc ◮ Look under ∼ hpc SGK ( hpc@cfa ) HPC @ SAO February 2013 13 / 33

H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Disks and Copy ◮ Disks on H YDRA ◮ /home - is not for data storage ◮ Public space, first come first served basis ◮ /pool/cluster* (34 TB over 7 filesys, half used, no scrubber) ◮ /pool/temp* (8 TB over 2 filesys, 10% used, 14 day scrubber) ◮ /pool/cluster* is not for long term storage ◮ Parallel file system: under development ( PVFS : 60 TB) ◮ User specific storage: possible ◮ Local disk space (on compute nodes): uneven & discouraged ◮ Not cross-mounted to CfA. ◮ Copy to/from H YDRA ◮ Use scp , sftp or rsync ◮ Use rsync --bwlimit=1000 for large transfer (> 10 GB): rsync --bwlimit=1000 -azv * hydra:/pool/cluster2/user/mydata/. ◮ Serialize or limit the number of heavy I/O: cp, mv, rm, scp, rsync ... Public key ( ssh-keygen ) OK: no passwd w/ ssh or rsync SGK ( hpc@cfa ) HPC @ SAO February 2013 14 / 33

H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Compile and Submit ◮ Compilers (3): GNU, PGI, Intel ◮ Cannot mix and match compilers/libraries ◮ Same 3 compilers available on CF-managed machines ◮ MPI and IB support avail. for all 3, can be tricky ◮ OpenMP avail. for all 3 ( de facto h/w limit) ◮ Submit jobs ◮ qsub : submit job(s) ◮ qstat : monitor job(s) ◮ qalter : change job resource(s) ◮ qdel : kill queued or running job(s) ◮ qconf : query queue(s) configuration ◮ qacct : query queue(s) accounting (used resources) ◮ MPI, IB & OpenMP: must use ◮ appropriate compiler flags and libs ◮ corresponding execs, scripts and queues SGK ( hpc@cfa ) HPC @ SAO February 2013 15 / 33

H YDRA : Trivial example H YDRA : Trivial example hydra% pwd /home/user/test hydra% cat hello.c #include <stdio.h> int main() { printf("hello world!\n"); } hydra% pgcc -o hello hello.c hydra% cat hello.job # using csh syntax (default) echo hello.job started ‘date‘ in queue $QUEUE \ with jobid=$JOB_ID on ‘hostname‘ uptime pwd ./hello echo hello.job done ‘date‘ hydra% ls hello hello.c hello.job SGK ( hpc@cfa ) HPC @ SAO February 2013 16 / 33

H YDRA : Trivial example H YDRA : Trivial example cont’d hydra% qsub -cwd -j y -o hello.log -N hello hello.job Your job 4539322 ("hello") has been submitted hydra% qstat -u user job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------- 4539322 0.00000 hello user qw 01/10/2013 18:01:40 1 hydra% qstat -u user job-ID prior name user state submit/start at queue slots ja-task-ID ---------------------------------------------------------------------------------------------- 4539322 0.50500 hello user r 01/10/2013 18:01:53 sTz.q@compute-1-29.local 1 hydra% ls hello hello.c hello.job hello.log hydra% cat hello.log hello.job started Thu Jan 10 18:01:53 EST 2013 in queue sTz.q with jobid=4539322 on compute-1-29.local 18:01:53 up 211 days, 29 min, 0 users, load average: 0.00, 0.00, 0.00 /home/user/test hello world! hello.job done Thu Jan 10 18:01:54 EST 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 17 / 33

H YDRA : Too Many Choices, Queues and Limits H YDRA : Too Many Choices Serial -q [sml]Tz.q regular job array -t & -tc ◮ Examples on the Wiki and in ∼ hpc/tests ◮ Typical job: qsub -N crunch_1 -o crunch_1.log crunch.job -set 1 ◮ Job array: slew of cases using same script qsub -t 1-1999:2 -tc 50 crunch.job ◮ Optimization, namespace, I/O load, throughput & scalability ◮ Monitor progress ⇒ trade off ◮ Best to request resouce(s) rather than to hard-wire name(s) ◮ Consider checkpointing SGK ( hpc@cfa ) HPC @ SAO February 2013 18 / 33

H YDRA : Too Many Choices, Queues and Limits H YDRA : Resources & Embedded Directives ◮ Request resources with -l (repeat as many times as needed) request with to get memory use limit 2GB of memory use -l s_data=2G virtual memory use limit 2GB of virtual memory use -l s_vmem=2G host with free memory 2GB of free memory -l mem_free=2G cpu time limit two hour cpu limit -l s_cpu=2:00:00 elapsed time limit two hour real-time limit -l s_rt=2:00:00 ◮ Embedded directives hydra% qsub -cwd -j y -o hello.log -N hello \ -l s_cpu=48:00:00 -l s_data=2G hello.job can be simplified by embedding them near the top of the job file # using csh syntax (default) # #$ -cwd -j y -o hello.log #$ -N hello #$ -l s_cpu=48:00:00 -l s_data=2G # echo hello.job started ‘date‘ in queue $QUEUE \ with jobid=$JOB_ID on ‘hostname‘ ... and using, for example hydra% qsub -l s_cpu=24:00:00 -N short_hello hello.job Flags added on the qsub command line overwrite the embedded value(s). SGK ( hpc@cfa ) HPC @ SAO February 2013 19 / 33

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 - PowerPoint PPT Presentation

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 1 / 33 Outline Outline Results of the survey What is H YDRA How to use H YDRA Answer to some survey questions

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

2ND WORKSHOP OF THE SAO PAULO JOURNAL OF MATHEMATICAL SCIENCES: JEAN-LOUS KOSZUL IN SAO PAULO,

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

ICUC10 Numerical characterization of the Urban Heat Island in Sao Paulo Brazil during

Mr. Zoltn Gyula Szab, SAO of Hungary Methodology of the SAO's Integrity Survey Methodology

Student Learning Outcomes Service Area Outcomes Objectives Define SLO/SAO Discuss benefits

Transport club update Orlando Strambi Universidade de Sao Paulo 4 th WC2 meeting Sao Paulo,

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

Dialog-based Payload Aggregation Tobias Limmer, Falko Dressler Chair for Computer Networks and

The need for File Systems Need to store data and programs in files Must be able to store lots of

STATS 507 Data Analysis in Python Lecture 5: Files, Classes, Operators and Inheritance

Scripting languages can be used for small tools as well as larger applications Perforce

I/O: A Typical Hardware System I/O: A Typical Hardware System CS 105 CPU chip Tour of the

Introduction to Puppet Paul Waring (paul@xk7.net, @pwaring) June 21, 2014 Configuration

CSCI 350 Ch. 11 File Systems Mark Redekopp Michael Shindler & Ramesh Govindan 2

Hands-on activities Day 1 Virtual HPC cluster Setup with ROCKS 7.0 Introduction Computer

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 - PowerPoint PPT Presentation

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 1 / 33 Outline Outline Results of the survey What is H YDRA How to use H YDRA Answer to some survey questions

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

2ND WORKSHOP OF THE SAO PAULO JOURNAL OF MATHEMATICAL SCIENCES: JEAN-LOUS KOSZUL IN SAO PAULO,

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

ICUC10 Numerical characterization of the Urban Heat Island in Sao Paulo Brazil during

Mr. Zoltn Gyula Szab, SAO of Hungary Methodology of the SAO's Integrity Survey Methodology

Student Learning Outcomes Service Area Outcomes Objectives Define SLO/SAO Discuss benefits

Transport club update Orlando Strambi Universidade de Sao Paulo 4 th WC2 meeting Sao Paulo,

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

Dialog-based Payload Aggregation Tobias Limmer, Falko Dressler Chair for Computer Networks and

The need for File Systems Need to store data and programs in files Must be able to store lots of

STATS 507 Data Analysis in Python Lecture 5: Files, Classes, Operators and Inheritance

Scripting languages can be used for small tools as well as larger applications Perforce

I/O: A Typical Hardware System I/O: A Typical Hardware System CS 105 CPU chip Tour of the

Introduction to Puppet Paul Waring (paul@xk7.net, @pwaring) June 21, 2014 Configuration

CSCI 350 Ch. 11 File Systems Mark Redekopp Michael Shindler &amp; Ramesh Govindan 2

Hands-on activities Day 1 Virtual HPC cluster Setup with ROCKS 7.0 Introduction Computer

CSCI 350 Ch. 11 File Systems Mark Redekopp Michael Shindler & Ramesh Govindan 2