accelerating sanjeevini a drug
play

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash - PowerPoint PPT Presentation

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia AGENDA What to expect and what not to Quick Introduction to Computer Aided


  1. ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia

  2. AGENDA What to expect and what not to Quick Introduction to Computer Aided Drug Discovery software Sanjeevani • • Challenges • Code documentation in process of being improved • Code maintained by Non Computer Science • Designed to suit distributed programming • Constraints • Code modification should be minimal  Ease of Maintenance. • The current cluster has mix of CPU and GPU. Should run on both  Portable Learnings • 2

  3. COMPUTER AIDED DRUG DISCOVERY Introduction Target Discovery 2.5yrs 4% Lead Generation Lead Optimization 3.0yrs 15% Preclinical Development 1.0yrs 10% Phase I, II & III Clinical Trials 6.0yrs 68% FDA Review & Approval 1.5yrs 3% Drug to the Market 3 14 yrs $1.4 billion

  4. SANJEEVINI FOR COMPUTER AIDED DRUG DESIGN Overview NRDBSM/Million molecule Self drawn Protein-ligand Complex/ Protein/DNA sequence library/Natural products ligand molecule database Predict all Check Lipinski compliance possible binding sites Generate and store top canonical A/B ten sites DNA or MD averaged Generate rapid binding energy estimates by structure of B RASPD protocol DNA Optimize geometry / Assign force field Assign TPACM4/derive quantum parameters mechanical charges Dock and Score 4 Perform molecular dynamics simulations and post facto free energy component analyses (Optional)

  5. SANJEEVINI GPU acceleration OpenACC ▪ acceleration of ParDOCK module All atom energy ▪ based Monte Carlo docking for protein- ligand complexes 5

  6. PERFORMANCE OPTIMIZATION Strategy Analyze Optimize Parallelize 6

  7. PERFORMANCE OPTIMIZATION Strategy Analyze Optimize Parallelize 7

  8. SANJEEVINI: PARDOCK Hotspots Flat profile: % time Cumulative Self Calls Self calls Total Name seconds seconds s/calls 69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator() 12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination() 7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1() 5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom() 8

  9. PERFORMANCE OPTIMIZATION Strategy Analyze Optimize Parallelize 9

  10. SANJEEVINI: PARDOCK CPU code: EnergyCalculator double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms, … ){ for( int atomcount = 0 ; atomcount < totalDockAtoms ; atomcount ++ ){ for( int counter = 0 ; counter < vDrugGrid.size() ; counter ++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] // minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis } ene += EnergyGrid[minCounter][atomcount]; } return ene; } 10

  11. University of Illinois PowerGrid- MRI Reconstruction main() main() { <serial code> #pragma acc kernels OpenACC //automatically runs on GPU { { <parallel code> } Simple | Powerful | Portable } 70x Speed-Up 2 Days of Effort Fueling the Next Wave of RIKEN Japan NICAM- Climate Modeling Scientific Discoveries in HPC 7-8x Speed-Up 5% of Code Modified http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway 11 http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

  12. OPENACC DIRECTIVES Manage Incremental #pragma acc data copyin(x,y) copyout(z) Data { Movement Single source ... #pragma acc parallel Interoperable { Initiate #pragma acc loop gang vector Parallel for (i = 0; i < n; ++i) { Performance portable Execution z[i] = x[i] + y[i]; ... } Optimize } Loop ... Mappings } 12

  13. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyCalculator (1) double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms , … ){ #pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin() firstprivate() for( int atomcount = 0 ; atomcount < totalDockAtoms ; atomcount ++ ){ #pragma acc loop reduction(min:minDis) for( int counter = 0 ; counter < vDrugGrid.size() ; counter ++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] minDis = (minDis > distance) ? distance; } 13

  14. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyCalculator (2) #pragma acc loop reduction(min:minCounter) for( int counter = 0 ; counter < vDrugGrid.size() ; counter ++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] if ( distance == minDis ){ minCounter = (minCounter > counter) ? counter; } } ene += EnergyGrid[minCounter][atomcount]; } return ene; } 14

  15. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyCalculator (3) Use ‘raw data pointer’ to access ▪ const points *vDrugGridData = vDrugGrid.data(); vectors // compute ‘distance’ between coords[atomcount] and vDrugGridData[counter] 15

  16. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyCalculator (4) Use ‘raw data pointer’ to access ▪ unsigned totDockAtoms = totalDockAtoms ; vectors float **eneGrid = EnergyGrid ; ▪ Avoid using C++ references in #pragma acc parallel loop reduction(+:ene ) … OpenACC pragmas copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount]; 16

  17. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyCalculator (4) Use ‘raw data pointer’ to access ▪ unsigned totDockAtoms = totalDockAtoms ; vectors float **eneGrid = EnergyGrid ; ▪ Avoid using C++ references in #pragma acc parallel loop reduction(+:ene ) … OpenACC pragmas copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount]; PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, Runtime const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): memory access violation 22, Generating present(vDrugGridData[:]) Generating copyin(coords[: totalDockAtoms-> ]) Generating present( EnergyGrid[:][:][:] ) 17

  18. OPENACC: 3 LEVELS OF PARALLELISM • Vector threads work in lockstep (SIMD/SIMT Vector Workers parallelism) • Workers compute a vector Gang • Gangs have 1 or more workers and share resources (such as cache, the Vector Workers streaming multiprocessor, etc.) Gang • Multiple gangs work independently of each other 18

  19. SANJEEVINI: PARDOCK OpenACC compiler output: EnergyCalculator PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): 22, Generating present(vDrugGridData[:],eneGrid[:][:]) Generating copyin(coords[:totDockAtoms]) 22, Accelerator kernel generated Generating Tesla code 22, Generating reduction(+:ene) 24, #pragma acc loop gang /* blockIdx.x */ 31, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minDis) 45, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minIdx) 31, Loop is parallelizable 45, Loop is parallelizable 19

  20. MANAGE DATA HIGHER IN THE PROGRAM Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU We know that the data is only needed on the CPU after convergence We should inform the compiler when data movement is really needed to improved performance 20

  21. STRUCTURED DATA REGIONS The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region. #pragma acc data Arrays used within the { data region will remain #pragma acc parallel loop ... on the GPU until the Data Region end of the data region. #pragma acc parallel loop ... } 21

  22. UNSTRUCTURED DATA DIRECTIVES Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class). enter data Defines the start of an unstructured data lifetime clauses: copyin(list), create(list) • exit data Defines the end of an unstructured data lifetime • clauses: copyout(list), delete(list), finalize #pragma acc enter data copyin(a) ... #pragma acc exit data delete(a) 22

  23. SANJEEVINI: PARDOCK OpenACC parallelization: EnergyAtom (3) Use ‘raw data pointer’ to access ▪ int **vProteinListData = new int *[ vProteinList.size() ] ; vectors n = vProteinList.size(); ▪ How will you access ‘ vector of vector (jagged arrays)’ ? #pragma acc enter data create(vProteinListData[0:n][0:1]) for( int count = 0 ; count < n ; count ++ ){ Creation and copy int numPro = vProteinList[count].size(); of jagged arrays vProteinListData[count] = vProteinList[count].data(); #pragma acc enter data copyin(vProteinListData[count:1][0:numPro]) } 23

Recommend


More recommend