HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 1 / 33
Outline Outline ◮ Results of the survey ◮ What is H YDRA ◮ How to use H YDRA ◮ Answer to some survey questions ◮ Discussion: h/w, s/w, other SGK ( hpc@cfa ) HPC @ SAO February 2013 2 / 33
Introduction Results of the Survey SGK ( hpc@cfa ) HPC @ SAO February 2013 3 / 33
Introduction Results of the Survey - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 4 / 33
What is H YDRA What is H YDRA ◮ H YDRA is a Linux based Beowulf cluster. ◮ Started at SAO a while back, managed by the CF. ◮ Moved to the Smithsonian’s Data Center, in Herndon, VA. ◮ Managed by SI’s Office of Information Technology Operations (OITO/OCIO). ◮ Has grown from a 200+ to a 3000+ core machine. ◮ Has become an SI-wide resource. ◮ The cluster is managed by DJ Ding, sys-admin (in Herndon, VA.) ◮ Additional support for SAO: HPC analyst. (0.25 FTE) SGK ( hpc@cfa ) HPC @ SAO February 2013 5 / 33
What is H YDRA : Hardware What is H YDRA : Hardware ◮ 296 compute nodes, distributed over 10 racks. ◮ Total of 3,116 compute cores (CPUs). ◮ All the nodes are interconnected on regular Ethernet (1 Gbps). ◮ Some nodes are on InfiniBand (40 Gbps) fabric (856 cores in IB). ◮ Some 40 TB of public disk space (56% full). ◮ Comparable user specific disk space (indiv. purchase). ◮ A parallel file system (60 TB), leveraging the IB fabric, is in the works. CF/HPC web page http://www.cfa.harvard.edu/cf/services/cluster incl. how to request an account. The hardware config is described in the HPC Wiki https://www.cfa.harvard.edu/twiki/bin/view/HPC/WebHome SGK ( hpc@cfa ) HPC @ SAO February 2013 6 / 33
What is H YDRA : Hardware What is H YDRA : Hardware - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 7 / 33
What is H YDRA : Hardware What is H YDRA : Hardware - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 8 / 33
What is H YDRA : Hardware What is H YDRA : Hardware - cont’d SGK ( hpc@cfa ) HPC @ SAO February 2013 9 / 33
What is H YDRA : Software What is H YDRA : Software ◮ The cluster is a Linux-based distributed cluster, running R OCKS ◮ Uses the G RID E NGINE queuing system (aka SGE, OGE or GE). ◮ Access to the cluster is via 2 login nodes: ◮ hydra.si.edu , or ◮ hydra-login.si.edu ◮ From one of the login nodes: ◮ you submit jobs via the queuing system: qsub ◮ all jobs run in batch mode ◮ You do not start jobs interactively on any of the compute nodes, ◮ instead, you submit a script and request resources ◮ the GE selects the compute nodes and starts your job on that/these node(s). ◮ The login nodes are for normal interactive use like editing, compiling, script writing, short debugging, etc. SGK ( hpc@cfa ) HPC @ SAO February 2013 10 / 33
What is H YDRA : Software What is H YDRA : Software - cont’d ◮ Compilers (3) ◮ GNU: gcc, f77 ◮ Intel, icc, icpc, ifort , including the Cluster Studio ◮ PGI: pgcc, phgc+, pgf77, pgf90 (Portland Group), incl. the Cluster Development Kit (CDK) ◮ Libraries ◮ MPI for all compilers, including I NFINI B AND support ◮ Math libraries that come w/ compilers ◮ AMD math libraries ◮ Packages ◮ IDL , including 128 run-time licenses, GDL ◮ IRAF (v1.7) If you need some specific software or believe that it would benefit the user community to have some additional sofware, let me know ( hpc@cfa ). SGK ( hpc@cfa ) HPC @ SAO February 2013 11 / 33
What is H YDRA : Documentation What is H YDRA : Documentation At the HPC Wiki ◮ Primers ◮ How to compile ◮ How to submit jobs ◮ How to monitor your jobs ◮ How to use IDL & GDL on the cluster ◮ How to copy files to/from cluster and what disk(s) to use ◮ FAQs ◮ Queues ◮ Error Messages ◮ Compilers ◮ Disk Use ◮ S/W packages ◮ Man Pages for Grid Engine Commands https://www.cfa.harvard.edu/twiki/bin/view/HPC/WebHome SGK ( hpc@cfa ) HPC @ SAO February 2013 12 / 33
H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Connect and Support ◮ Connect to H YDRA ◮ Must ssh to one of H YDRA login nodes from a trusted host: login.cfa.harvard.edu , CF/HEA managed hosts, VPN ◮ Accounts and directories separate from CF/HEA ◮ SD 931 enforced (incl. password expiration) ◮ Support for H YDRA ◮ DJ Ding - sys-admin (OTIO/OCIO) ◮ SGK - HPC analyst (SAO) ◮ HPCC-L on si-listserv.si.edu ( mailing list on SI’s list-serv) ◮ Do not contact CF/HEA support (except for requesting an account) ◮ Contact the OCIO Help Desk ( OCIOHelpDesk@si.edu ) for password reset or other access related issues. ◮ Configuration is different from CF/HEA systems ◮ Must customize ∼ /.cshrc , ∼ /.bash_profile & ∼ /.bashrc ◮ Look under ∼ hpc SGK ( hpc@cfa ) HPC @ SAO February 2013 13 / 33
H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Disks and Copy ◮ Disks on H YDRA ◮ /home - is not for data storage ◮ Public space, first come first served basis ◮ /pool/cluster* (34 TB over 7 filesys, half used, no scrubber) ◮ /pool/temp* (8 TB over 2 filesys, 10% used, 14 day scrubber) ◮ /pool/cluster* is not for long term storage ◮ Parallel file system: under development ( PVFS : 60 TB) ◮ User specific storage: possible ◮ Local disk space (on compute nodes): uneven & discouraged ◮ Not cross-mounted to CfA. ◮ Copy to/from H YDRA ◮ Use scp , sftp or rsync ◮ Use rsync --bwlimit=1000 for large transfer (> 10 GB): rsync --bwlimit=1000 -azv * hydra:/pool/cluster2/user/mydata/. ◮ Serialize or limit the number of heavy I/O: cp, mv, rm, scp, rsync ... Public key ( ssh-keygen ) OK: no passwd w/ ssh or rsync SGK ( hpc@cfa ) HPC @ SAO February 2013 14 / 33
H YDRA : Connect, Support, Disks, Copy, Compile and Submit H YDRA : Compile and Submit ◮ Compilers (3): GNU, PGI, Intel ◮ Cannot mix and match compilers/libraries ◮ Same 3 compilers available on CF-managed machines ◮ MPI and IB support avail. for all 3, can be tricky ◮ OpenMP avail. for all 3 ( de facto h/w limit) ◮ Submit jobs ◮ qsub : submit job(s) ◮ qstat : monitor job(s) ◮ qalter : change job resource(s) ◮ qdel : kill queued or running job(s) ◮ qconf : query queue(s) configuration ◮ qacct : query queue(s) accounting (used resources) ◮ MPI, IB & OpenMP: must use ◮ appropriate compiler flags and libs ◮ corresponding execs, scripts and queues SGK ( hpc@cfa ) HPC @ SAO February 2013 15 / 33
H YDRA : Trivial example H YDRA : Trivial example hydra% pwd /home/user/test hydra% cat hello.c #include <stdio.h> int main() { printf("hello world!\n"); } hydra% pgcc -o hello hello.c hydra% cat hello.job # using csh syntax (default) echo hello.job started ‘date‘ in queue $QUEUE \ with jobid=$JOB_ID on ‘hostname‘ uptime pwd ./hello echo hello.job done ‘date‘ hydra% ls hello hello.c hello.job SGK ( hpc@cfa ) HPC @ SAO February 2013 16 / 33
H YDRA : Trivial example H YDRA : Trivial example cont’d hydra% qsub -cwd -j y -o hello.log -N hello hello.job Your job 4539322 ("hello") has been submitted hydra% qstat -u user job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------- 4539322 0.00000 hello user qw 01/10/2013 18:01:40 1 hydra% qstat -u user job-ID prior name user state submit/start at queue slots ja-task-ID ---------------------------------------------------------------------------------------------- 4539322 0.50500 hello user r 01/10/2013 18:01:53 sTz.q@compute-1-29.local 1 hydra% ls hello hello.c hello.job hello.log hydra% cat hello.log hello.job started Thu Jan 10 18:01:53 EST 2013 in queue sTz.q with jobid=4539322 on compute-1-29.local 18:01:53 up 211 days, 29 min, 0 users, load average: 0.00, 0.00, 0.00 /home/user/test hello world! hello.job done Thu Jan 10 18:01:54 EST 2013 SGK ( hpc@cfa ) HPC @ SAO February 2013 17 / 33
H YDRA : Too Many Choices, Queues and Limits H YDRA : Too Many Choices Serial -q [sml]Tz.q regular job array -t & -tc ◮ Examples on the Wiki and in ∼ hpc/tests ◮ Typical job: qsub -N crunch_1 -o crunch_1.log crunch.job -set 1 ◮ Job array: slew of cases using same script qsub -t 1-1999:2 -tc 50 crunch.job ◮ Optimization, namespace, I/O load, throughput & scalability ◮ Monitor progress ⇒ trade off ◮ Best to request resouce(s) rather than to hard-wire name(s) ◮ Consider checkpointing SGK ( hpc@cfa ) HPC @ SAO February 2013 18 / 33
H YDRA : Too Many Choices, Queues and Limits H YDRA : Resources & Embedded Directives ◮ Request resources with -l (repeat as many times as needed) request with to get memory use limit 2GB of memory use -l s_data=2G virtual memory use limit 2GB of virtual memory use -l s_vmem=2G host with free memory 2GB of free memory -l mem_free=2G cpu time limit two hour cpu limit -l s_cpu=2:00:00 elapsed time limit two hour real-time limit -l s_rt=2:00:00 ◮ Embedded directives hydra% qsub -cwd -j y -o hello.log -N hello \ -l s_cpu=48:00:00 -l s_data=2G hello.job can be simplified by embedding them near the top of the job file # using csh syntax (default) # #$ -cwd -j y -o hello.log #$ -N hello #$ -l s_cpu=48:00:00 -l s_data=2G # echo hello.job started ‘date‘ in queue $QUEUE \ with jobid=$JOB_ID on ‘hostname‘ ... and using, for example hydra% qsub -l s_cpu=24:00:00 -N short_hello hello.job Flags added on the qsub command line overwrite the embedded value(s). SGK ( hpc@cfa ) HPC @ SAO February 2013 19 / 33
Recommend
More recommend