High Performance Computing @ AUB GradEx Workshop Mher Kazandjian November 2018 American University of Beirut
How this talk is structured? • History of computing • Scientifjc computing workfmows • Computer architecture overview • Do's and Don'ts • Demo's and walk throughs
Goals • Demonstrate how you (as users) can benefjt from AUB's HPC facilities • Attract users, because: • we want to boost scientifjc computing research • we want to help you • we have capacity This presentation is based on actual feedback and use cases collected from users over the past year
History of computing Alan Turing 1912-1954
Growth over time 12 orders of magnitude since 1960
 Growth over time ~12 orders of magnitude since 1960 if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today
 What is HPC used for today? ● Solving scientifjc problems ● Data mining and deep learning ● Military research and security ● Cloud computing ● Blockchain (cryptocurrency)
 What is HPC used for today? ● https://blog.openai.com/ai-and-compute/
 Growth over time Multicores hit the markets in ~ 2005 Users at home started benefjting from parallelism Click to add text Click to add text Click to add text Prior to that Click to add text Click to add text applications that Click to add text Click to add text Click to add text scaled well were Click to add text Click to add text restricted to Click to add text Click to add text mainframes / Click to add text Click to add text datacenters and HPC clusters
HPC @ AUB 8 compute nodes in 2006 Specs per node - 4 cores - 8 GB ram ~ 80 GFlops
HPC is all about scalability • The high speed network is the " most" important component
But what is scalability? Performance improvements as the number of cores (resources) increases for the same problem size - hard scalability
But what is scalability? This is a CPU under a microscope
But what is scalability? 2 sec Prog.exe Serial runtime = T_serial
But what is scalability? 1 sec Prog.exe Prog.exe parallel runtime = T_parallel
But what is scalability? 0.5 sec Prog.exe Prog.exe Prog.exe Prog.exe parallel runtime = T_parallel
But what is scalability? 0.5 sec Prog.exe Prog.exe Prog.exe Prog.exe Very nice!! but this is usually never the case
First demo – First scalability diagram
But what is scalability? Repeat the same process across multiple processors Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe
But what is scalability? Wait! - how do these processors talk to each other? - how much data needs to be transferred for a certain task? - how fast do the processes communicate with each other? - how often should the processes communicate with each other? Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe Prog.exe
At the single chip level Through the cache memory of the CPU Typical latency ~ ns (or less) Typical bandwidth > 150 GB/s
At the single chip level Through the RAM Through the RAM Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more) https://ark.intel.com/#@Processors Random Access Momory (aka RAM)
Second demo: bandwidth and some lingo - An array is just a bunch of bytes - Bandwidth is the speed with which information is tranferred - A fmoat (double precision) is 8 bytes - an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB - if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero) - bandwidth = size of array / time to initialize it
Second demo: bandwidth and some lingo - An array is just a bunch of bytes - Bandwidth is the speed with which information is tranferred - A fmoat (double precision) is 8 bytes - an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB - if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero) - bandwidth = size of array / time to initialize it Intel i7-6700HQ - https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz- - Advertised bandwidth = 34 GB/s - measured bandwidth (single thread quickie) = 22.8 GB/s
At the single motherboard level Through QPI (quick path interconnect) - typical latency for small data ~ ns - typical bandwidth 100 GB/s QPI TIP: server = node = compute node = numa node Random Access Random Access Memory (aka RAM) Memory (aka RAM) Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s Through the RAM (sometimes more)
Second demo: bandwidth multi-threaded - https://github.com/jefghammond/STREAM https://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GT s-Intel-QPI Another benchmark 2 socket Intel Xeon server - 2 x sockets, expected bandwidth ~102 GB/s - measured ~ 75 GB/s - on a completely idle node ~95 GB/s is possible
At the cluster level (multiple nodes) Through the network (ethernet) Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s
At the cluster level (multiple nodes) Through the network (infiniband – high speed network) Typical latency ~ a few to micro-seconds to < 1 micro sec Typical bandwidth > 3 GB/s Benefits over ethernet: - Remote direct memory access - higher bandwidth - much lower latency https://en.wikipedia.org/wiki/InfiniBand
What hardware we have at AUB What hardware we have at AUB? - Arza: - 256 core, 1 TB RAM IBM cluster - production simulations, benchmarking - http://website.aub.edu.lb/it/hpc/Pages/home.aspx - vLabs - see Vassili’s slide - very flexible, easy to manage, windows support - public cloud - infinite resources – limited by $$$ - two pilot projects being tested – will be open soon for testing
Parallelization libraries / software SMP parallelism - OpenMP - CUDA - Matlab - Spark (recently deployed and tested) distributed parallelism (cluster wide) - MPI - Spark - MPI + OpenMP (hybrid) - MPI + CUDA - MPI + CUDA + OpenMP - Spark + CUDA (not tested – any volunteers?)
Linux/Unix culture > 99% of HPC clusters wold wide use some kind of linux / unix - Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power users. - Linux is: - open-source - free - secure (at least much secure than windows et. al ) - no need for an antivirus that slows down your system - respects your privacy - huge community support in scientific computing - 99.8% of all HPC systems world wide since 1996 are non-windows machines https://github.com/mherkazandjian/top500parser
Software stack on the HPC cluster - Matlab - C, Java, C++, fortran - python 2 and python 3 - jupyter notebooks - Tensorflow (Deep learning) - Scala - Spark - R - R studio, R server (new)
Cluster usage: Demo - The scheduler: resource manager - bjobs - bqueues - bhosts - lsload - important places - / gpfs1 /my_username - / gpfs1/ apps/sw - basic linux knowledge - sample job script
Cluster usage: Documentation https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide The guide is for you - we want you to contribute to it directly - please send us pull requests
Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html In the user guide, there are samples and templates for many use cases: - we will help you write your own if your use case is not covered - this is 90% of the getting started task - recent success story: - spark server job template
Cluster usage: Job scripts https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
How to benefjt from the HPC hardware? - run many serial jobs that do not need to communicate - aka embarrassingly parallel jobs (nothing embarrasing about it though as long as you get your job done) - e.g - train several neural networks with different layer numbers - do a parameter sweep for a certain model ./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously - difficulty: very easy
How to benefjt from the HPC hardware? - run many serial jobs that do not need to communicate Demo
How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - e.g - matlab - C/C++/python/Java Difficulty: very easy to medium (problem dependent)
How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - C
How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - C
How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - Demo: matlab parfor
How to benefjt from the HPC hardware? - run a SMP parallel program (i.e on one node using threads) - Demo: matlab parfor
Recommend
More recommend