python best practices in hpc
play

Python Best Practices in HPC Roland Haas (NCSA) Email: - PowerPoint PPT Presentation

Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu Why use Python in HPC? everybody else is already using it including your students, whether you like it or not... large body of documentation available on the


  1. Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu

  2. Why use Python in HPC? ● everybody else is already using it – including your students, whether you like it or not... – large body of documentation available on the web ● Python's design principles: – Beautiful is better than ugly. – Explicit is better than implicit. – Simple is better than complex. – Readability counts. make for code well suited to scientific projects ● Python was originally designed to be usable as a glue language – highly extensible – can bind to many compiled languages: C, C++, Fortran 2

  3. Pros and cons of using Python in your science project ● Very low learning curve ● Very low learning curve – for you – low quality code possible – for your students ● not initially designed for HPC ● Quick turnaround while developing – most developers aren't scientists – Python itself is not very fast ● fully open source ● Large startup costs, hard on cluster – no licensing costs IO subsystem – encourages sharing code ● not always backwards compatible, ● large number of scientific packages: even between minor versions – numpy, scipy ● duck-typing makes code validation – PyTrilinos, petsc4py, hard, errors only detected at runtime Elemental, SLEPc – mpi4py, h5py, netcdf 3

  4. Usage cases of Python for HPC by task ● preparing your input deck ● orchestrate simulations – set up data for multi-stage – create input files based on physical simulations parameters – check success of each step – create directory structures – start MPI parallel simulation code – submit simulations ● glue code in simulation binary – mostly string handling and scripting – Python handles simulation ● process simulation results infrastructure tasks – combine data from checkpoints – most lines of code are Python – interactively explore data – most execution time is in compiled – distill scientific results from data code ● Python for science code – produce plots and other representation of results – no custom compiled code – mostly serial but possible bag-of- – Python code or public packages do task parallelism actual science calculations 4

  5. Python startup time issues 60 modules – Lustre, 1 rank per node ● Python startup and the import statement are very metadata intensive python3 -c 'import numpy' 10x faster 60 modules – bwpy, 1 rank per node ● has 1600 open & stat calls – per MPI rank, hitting a single metadata server ● solved in BWPY for provided modules ● e.g. a 1ms response time, 1024 ● for you own modules ranks → 1,600s startup time – install to /dev/shm/$USER on login node – makes shared file system slow – tar up /dev/shm/$USER for every user on the system – extract tarball to /dev/shm/$USER on compute nodes, put first in $PYTHONPATH 5

  6. Workflows in python ● for simple bag-of-tasks workflows, use MPICommExecutor mpi4py 's MPICommExecutor (see from mpi4py import MPI from mpi4py.futures import MPICommExecutor BWPY presentation) def sqr(x): return x*x – do not use 1000 aprun -n1 python data = range(21) ● Python workflows in with MPICommExecutor(root=0) as executor: if executor is not None: # on root Blue Waters webinars series: squared = executor.map( sqr , data) print(squared) – Parsl, modern, pure python, standalone – Pegasus, very mature, builds on Parsl HTCondor from parsl import App, DataFlowKernel ● IO challenge import parsl.configs.local as lc dfk = DataFlowKernel(lc.localThreads) – no file system likes millions of tiny files. @App('python', dfk) Lustre is no exception def sqr(x): return x*x – store temporary files in /dev/shm on data = range(21) compute nodes squared = map(sqr, data) – pre-stage files in the background using print([i.result() for i in squared]) Globus, has a python interface 6

  7. Numerical computations using python ● both numpy and scipy leverage ● numpy the de-facto standard way to handle numerical arrays in python BLAS, LAPACK, FFT, FITPACK – N-dimensional arrays of integer, – sub-optimal performance if real and complex numbers those are incorrectly build – linear algebra (BLAS, LAPACK), – BWPY does “the right thing” FFT, random numbers – pip does not (usually) – linkages to C/C++/Fortran ● PyTrilinos, petsc4py, ● scipy provides higher level Elemental, SLEPc build on these functions import numpy as np – optimization A = np.random.random((1000,1000)) – integration b = np.random.random((1000,)) – interpolation c = A*b – signal and image processing pip: 0.02s 5x faster – ODE solvers BWPY: 0.004s 7

  8. Computing in python code ● Not all are equally well suited for ● How CPython works all tasks – compile script to bytecode – pypy does not deal well with – execute one line of byte code after the other numpy ● CPython is designed for import numpy as np maintainability, not speed a = np.zeros(10000) for i in range(10000): – no look ahead a[i] = np.sqrt(i) – no parallelism (threads, vectorization) is 2x slower in pypy than – hard to change this due to duck CPython (uses numpy-pypy) typing a = list() ● Alternatives for i in range(1000): – pypy a.append(str(i)) – numba is 10x faster in pypy than – Cython CPython 8

  9. Numba and Cython ● Numba is a just-in-time compiler ● Cython compiles python-like for numerical operations in code to C, designed to link C Cpython extensions to python – needs (simple) annotations – load result as module – do threading and parallelization – deals well with numpy in C code import numpy as np from libc.math cimport sqrt from numba import jit def my_sqrt(): @jit cdef int i def my_sqrt(): cdef double a[10000] a = np.zeros(10000) for i in range(10000): for i in range(10000): a[i] = sqrt(i) a[i] = np.sqrt(i) 12x faster than plain CPython 481x faster than plan CPython 9

  10. Calling compiled code (the easy way) ● numpy has convenience code to link to Fortran code – very easy to use (much easier than C) SUBROUTINE FIB(A,N) $ python -m numpy.f2py -m myfib \ INTEGER N -c fib.f90 REAL*8 A(N) DO I=1,N import numpy IF (I.EQ.1) THEN import myfib A(I) = 0.0D0 ELSEIF (I.EQ.2) THEN a = numpy.zeros(8, 'float64') A(I) = 1.0D0 myfib.fib(a) ELSE print(a) A(I) = A(I-1) + A(I-2) ENDIF For C code, you may even want to ENDDO write a Fortran wrapper END SUBROUTINE from http://scipy-lectures.org 10

  11. More on using compiled modules ● Cython: https://scipy-lectures.org/advanced/interfacing_with_c/interfacing_with _c.html#id13 ● f2py (very easy!): https://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html#f2py ● SWIG: http://swig.org/Doc1.3/Python.html ● Boost – interferes with HDF5 on BW ● Ctypes: https://scipy-lectures.org/advanced/interfacing_with_c/interfacing_with _c.html#id6 ● Numpy bindings in C/C++: https://dfm.io/posts/python-c-extensions/ 11

  12. Code profiling ● Profile you code to find out where it ● output profile using -o switch for spends most time. Assuming that it in depth analysis must be your innermost loop is – pstats module lets you read it dangerous... ● object code profilers like CrayPat python -o prof.dat -m cProfile \ loop.py profile the python interpreter, but not your python code import pstats ● Python comes with a built in profiler p = pstats.Stats('prof.dat') in the cProfile module p.sort_stats('cumulative').\ print_stats(5) ● included in BWPY ● install line_profiler for line- – default is function level granularity by-line usage – add extra profiling modules and – annotate functions to profile analysis tools in a virtualenv using @profile ● can be as simple as – run kernprof -l script.py python -m cProfile loop.py 12

  13. Code profiling example $ python -m cProfile loop.py ncalls tottime percall cumtime percall filename:lineno(function) 1 0.019 0.019 0.477 0.477 test-profile.py:1(<module>) 1 0.334 0.334 0.457 0.457 test-profile.py:1(loop) 1 0.000 0.000 0.477 0.477 {built-in method builtins.exec} 1000000 0.124 0.000 0.124 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profil... @profile def loop(): $ virtualenv --system-site-packages $PWD a = [] $ pip install line_profiler for i in range(1000000): $ kernprof -l loop.py a.append(i) $ python -m line_profiler loop.py.lprof Line # Hits Time Per Hit % Time Line Contents 1 @profile 2 def loop(): 3 1 5.0 5.0 0.0 a = [] 4 1000001 957889.0 1.0 44.3 for i in range(1000000): 5 1000000 1206173.0 1.2 55.7 a.append(i) 13

  14. Questions? This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

  15. Refeferences and extra material ● This presentation is heavily based on William Scullin's presentations: https://www.alcf.anl.gov/files/Scullin-Pavlyk _SDL2018_Python.pdf ● https://github.com/bccp/nbodykit, https://wiki.fysik.dtu.dk/gpaw/ ● https://bluewaters.ncsa.illinois.edu/webinars/workflows ● https://cython.org/, https://www.pypy.org/, https://numba.pydata.org/ ● https://bluewaters.ncsa.illinois.edu/python, https://bluewaters.ncsa.illinois.edu/Python-profiling 15

Recommend


More recommend