Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu
Why use Python in HPC? ● everybody else is already using it – including your students, whether you like it or not... – large body of documentation available on the web ● Python's design principles: – Beautiful is better than ugly. – Explicit is better than implicit. – Simple is better than complex. – Readability counts. make for code well suited to scientific projects ● Python was originally designed to be usable as a glue language – highly extensible – can bind to many compiled languages: C, C++, Fortran 2
Pros and cons of using Python in your science project ● Very low learning curve ● Very low learning curve – for you – low quality code possible – for your students ● not initially designed for HPC ● Quick turnaround while developing – most developers aren't scientists – Python itself is not very fast ● fully open source ● Large startup costs, hard on cluster – no licensing costs IO subsystem – encourages sharing code ● not always backwards compatible, ● large number of scientific packages: even between minor versions – numpy, scipy ● duck-typing makes code validation – PyTrilinos, petsc4py, hard, errors only detected at runtime Elemental, SLEPc – mpi4py, h5py, netcdf 3
Usage cases of Python for HPC by task ● preparing your input deck ● orchestrate simulations – set up data for multi-stage – create input files based on physical simulations parameters – check success of each step – create directory structures – start MPI parallel simulation code – submit simulations ● glue code in simulation binary – mostly string handling and scripting – Python handles simulation ● process simulation results infrastructure tasks – combine data from checkpoints – most lines of code are Python – interactively explore data – most execution time is in compiled – distill scientific results from data code ● Python for science code – produce plots and other representation of results – no custom compiled code – mostly serial but possible bag-of- – Python code or public packages do task parallelism actual science calculations 4
Python startup time issues 60 modules – Lustre, 1 rank per node ● Python startup and the import statement are very metadata intensive python3 -c 'import numpy' 10x faster 60 modules – bwpy, 1 rank per node ● has 1600 open & stat calls – per MPI rank, hitting a single metadata server ● solved in BWPY for provided modules ● e.g. a 1ms response time, 1024 ● for you own modules ranks → 1,600s startup time – install to /dev/shm/$USER on login node – makes shared file system slow – tar up /dev/shm/$USER for every user on the system – extract tarball to /dev/shm/$USER on compute nodes, put first in $PYTHONPATH 5
Workflows in python ● for simple bag-of-tasks workflows, use MPICommExecutor mpi4py 's MPICommExecutor (see from mpi4py import MPI from mpi4py.futures import MPICommExecutor BWPY presentation) def sqr(x): return x*x – do not use 1000 aprun -n1 python data = range(21) ● Python workflows in with MPICommExecutor(root=0) as executor: if executor is not None: # on root Blue Waters webinars series: squared = executor.map( sqr , data) print(squared) – Parsl, modern, pure python, standalone – Pegasus, very mature, builds on Parsl HTCondor from parsl import App, DataFlowKernel ● IO challenge import parsl.configs.local as lc dfk = DataFlowKernel(lc.localThreads) – no file system likes millions of tiny files. @App('python', dfk) Lustre is no exception def sqr(x): return x*x – store temporary files in /dev/shm on data = range(21) compute nodes squared = map(sqr, data) – pre-stage files in the background using print([i.result() for i in squared]) Globus, has a python interface 6
Numerical computations using python ● both numpy and scipy leverage ● numpy the de-facto standard way to handle numerical arrays in python BLAS, LAPACK, FFT, FITPACK – N-dimensional arrays of integer, – sub-optimal performance if real and complex numbers those are incorrectly build – linear algebra (BLAS, LAPACK), – BWPY does “the right thing” FFT, random numbers – pip does not (usually) – linkages to C/C++/Fortran ● PyTrilinos, petsc4py, ● scipy provides higher level Elemental, SLEPc build on these functions import numpy as np – optimization A = np.random.random((1000,1000)) – integration b = np.random.random((1000,)) – interpolation c = A*b – signal and image processing pip: 0.02s 5x faster – ODE solvers BWPY: 0.004s 7
Computing in python code ● Not all are equally well suited for ● How CPython works all tasks – compile script to bytecode – pypy does not deal well with – execute one line of byte code after the other numpy ● CPython is designed for import numpy as np maintainability, not speed a = np.zeros(10000) for i in range(10000): – no look ahead a[i] = np.sqrt(i) – no parallelism (threads, vectorization) is 2x slower in pypy than – hard to change this due to duck CPython (uses numpy-pypy) typing a = list() ● Alternatives for i in range(1000): – pypy a.append(str(i)) – numba is 10x faster in pypy than – Cython CPython 8
Numba and Cython ● Numba is a just-in-time compiler ● Cython compiles python-like for numerical operations in code to C, designed to link C Cpython extensions to python – needs (simple) annotations – load result as module – do threading and parallelization – deals well with numpy in C code import numpy as np from libc.math cimport sqrt from numba import jit def my_sqrt(): @jit cdef int i def my_sqrt(): cdef double a[10000] a = np.zeros(10000) for i in range(10000): for i in range(10000): a[i] = sqrt(i) a[i] = np.sqrt(i) 12x faster than plain CPython 481x faster than plan CPython 9
Calling compiled code (the easy way) ● numpy has convenience code to link to Fortran code – very easy to use (much easier than C) SUBROUTINE FIB(A,N) $ python -m numpy.f2py -m myfib \ INTEGER N -c fib.f90 REAL*8 A(N) DO I=1,N import numpy IF (I.EQ.1) THEN import myfib A(I) = 0.0D0 ELSEIF (I.EQ.2) THEN a = numpy.zeros(8, 'float64') A(I) = 1.0D0 myfib.fib(a) ELSE print(a) A(I) = A(I-1) + A(I-2) ENDIF For C code, you may even want to ENDDO write a Fortran wrapper END SUBROUTINE from http://scipy-lectures.org 10
More on using compiled modules ● Cython: https://scipy-lectures.org/advanced/interfacing_with_c/interfacing_with _c.html#id13 ● f2py (very easy!): https://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html#f2py ● SWIG: http://swig.org/Doc1.3/Python.html ● Boost – interferes with HDF5 on BW ● Ctypes: https://scipy-lectures.org/advanced/interfacing_with_c/interfacing_with _c.html#id6 ● Numpy bindings in C/C++: https://dfm.io/posts/python-c-extensions/ 11
Code profiling ● Profile you code to find out where it ● output profile using -o switch for spends most time. Assuming that it in depth analysis must be your innermost loop is – pstats module lets you read it dangerous... ● object code profilers like CrayPat python -o prof.dat -m cProfile \ loop.py profile the python interpreter, but not your python code import pstats ● Python comes with a built in profiler p = pstats.Stats('prof.dat') in the cProfile module p.sort_stats('cumulative').\ print_stats(5) ● included in BWPY ● install line_profiler for line- – default is function level granularity by-line usage – add extra profiling modules and – annotate functions to profile analysis tools in a virtualenv using @profile ● can be as simple as – run kernprof -l script.py python -m cProfile loop.py 12
Code profiling example $ python -m cProfile loop.py ncalls tottime percall cumtime percall filename:lineno(function) 1 0.019 0.019 0.477 0.477 test-profile.py:1(<module>) 1 0.334 0.334 0.457 0.457 test-profile.py:1(loop) 1 0.000 0.000 0.477 0.477 {built-in method builtins.exec} 1000000 0.124 0.000 0.124 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profil... @profile def loop(): $ virtualenv --system-site-packages $PWD a = [] $ pip install line_profiler for i in range(1000000): $ kernprof -l loop.py a.append(i) $ python -m line_profiler loop.py.lprof Line # Hits Time Per Hit % Time Line Contents 1 @profile 2 def loop(): 3 1 5.0 5.0 0.0 a = [] 4 1000001 957889.0 1.0 44.3 for i in range(1000000): 5 1000000 1206173.0 1.2 55.7 a.append(i) 13
Questions? This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
Refeferences and extra material ● This presentation is heavily based on William Scullin's presentations: https://www.alcf.anl.gov/files/Scullin-Pavlyk _SDL2018_Python.pdf ● https://github.com/bccp/nbodykit, https://wiki.fysik.dtu.dk/gpaw/ ● https://bluewaters.ncsa.illinois.edu/webinars/workflows ● https://cython.org/, https://www.pypy.org/, https://numba.pydata.org/ ● https://bluewaters.ncsa.illinois.edu/python, https://bluewaters.ncsa.illinois.edu/Python-profiling 15
Recommend
More recommend