Parallel scripting with Swift for applications at the petascale and beyond VecPar PEEPS Workshop Berkeley, CA – June 22, 2010 Michael Wilde – wilde@mcs.anl.gov Computation Institute, University of Chicago and Argonne National Laboratory www.ci.uchicago.edu/swift 1
Problems addressed by Swift • Many applications need loosely coupled scripting • Swift harnesses parallel & distributed resources through a simple scripting language • Productivity gains by enabling use of more powerful systems with less concern for the mechanics 2
Modeling uncertainty for CIM ‐ EARTH Parallel AMPL workflow by Joshua Elliott, Meredith Franklin, Todd Munson, Allan Espinosa.
Fast Ocean Atmosphere Model (MPI) NCAR Manual config, execution, bookkeeping VDS on Teragrid Automated Visualization courtesy Pat Behling and Yun Liu, UW Work of Madison 4 Veronica Nefedova and Rob Jacob, Argonne
Problem: Drug screening at APS O(Millions) of drug 2M+ ligands candidates O(tens) of fruitful (B) candidates 5 for wetlab & (Mike Kubal, Benoit Roux, and others) APS
Manually prep Manually prep NAB script ZINC DOCK6 rec file FRED rec file NAB parameters 3-D Script (defines flexible structures DOCK6 FRED Template residues, Receptor Receptor #MDsteps) 6 2M (1 per protein: (1 per protein: structures GB defines pocket defines pocket PDB 1 (6 GB) to bind to) to bind to) BuildNABScript protein protein descriptions (1MB) Amber prep: NAB start 2. AmberizeReceptor Script 4. perl: gen nabscript ~4M x 60s x 1 cpu FRED DOCK6 ~60K cpu-hrs Select best ~5K Select best ~5K Amber Score: ~10K x 20m x 1 cpu 1. AmberizeLigand Amber ~3K cpu-hrs 3. AmberizeComplex 5. RunNABScript Select best ~500 ~500 x 10hr x 100 cpu GCMC ~500K cpu-hrs end report ligands complexes 6 Work of Andrew Binkoaski and Michael Kubal
Problem: preprocessing and analysis of neuroscience experiments Many 3a.h 3a.i 4a.h 4a.i ref.h ref.i 5a.h 5a.i 6a.h 6a.i Data align_warp/1 align_warp/3 align_warp/5 align_warp/7 Files: 3a.w 4a.w 5a.w 6a.w Many reslice/2 reslice/4 reslice/6 reslice/8 Application Programs: 3a.s.h 3a.s.i 4a.s.h 4a.s.i 5a.s.h 5a.s.i 6a.s.h 6a.s.i softmean/9 atlas.h atlas.i slicer/10 slicer/12 slicer/14 atlas_x.ppm atlas_y.ppm atlas_z.ppm convert/11 convert/13 convert/15 atlas_x.jpg atlas_y.jpg atlas_z.jpg
Automated image registration for spatial normalization AIRSN workflow: AIRSN workflow expanded: reorientRun reorient re orient/ 25 reorie nt/27 reorie nt/29 reori ent/09 reori ent/01 re orient /05 re orient/ 31 reorient /33 reorie nt/35 reo r i ent/3 7 reorientRun reorient reorien t/51 re orient/ 52 re orient/ 53 reo rient/1 0 reori ent/02 reorie nt/06 reorie nt/54 r e orient /55 reo r i ent/5 6 reorie nt/57 random_select alignlinear al ignli near/1 1 al ignli near/ 03 ali gnli near/07 alignlinearRun reslice re s lic e/12 res lic e/04 res lic e/08 resliceRun softmean s oftm e an/13 softmean alignlinear ali gnli near/1 7 alignlinear combine_warp c om bi newarp/21 combinewarp reslice_warp re s li c e_warp/2 6 r e s li c e_warp/2 8 res li c e_warp/ 30 res li c e_ w arp /24 res l ic e_ w arp /22 res l ic e _w a r p /23 res l ic e _w a rp/32 res lic e _warp/34 res lic e_warp/36 res lic e_warp/38 reslice_warpRun strictmean s t r i c tm ean/39 strictmean binarize binarize bi nariz e/40 gsmooth gsmoothRun gs m ooth/4 4 gs m ooth /45 gs mooth/ 46 gs m ooth/4 3 gs m oo th/41 gs mooth/4 2 g s m oo th/47 g s mo oth/48 g s m o oth/49 gs moot h/50
Swift programs • A Swift script is a set of functions – Atomic functions wrap & invoke application programs (on parallel compute nodes) – Composite functions invoke other functions (run in Swift engine) • Data is typed as composable arrays and structures of files and simple scalar types (int, float, string) • Collections of persistent file structures are mapped into this data model as arrays and structures • Variables are single assignment • Expressions and statements are executed in data ‐ flow dependency order and concurrency • Members of datasets can be processed in parallel • Provenance is gathered as scripts execute 9
A simple Swift script To run the Image Magick app “convert”: convert ‐ rotate 180 $in $out 1 type imagefile { } // Declare a “file” type. 2 3 app (imagefile output) rotate (imagefile input) { 4 { 5 convert " ‐ rotate" 1 80 @input @output ; 6 } 7 8 imagefile image <"m 1 0 1 .20 1 0.060 1 .jpg">; 9 1 0 imagefile newimage <"output.jpg">; 11 1 2 newimage = rotate(image); 10
Execution is driven by data flow (int result) myproc (int input) 1 { 2 j = f(input); 3 k = g(input); 4 result = j + k; 5 } 6 j=f() and k=g() are computed in parallel. 7 This parallelism is automatic , based on futures; 8 Works recursively down the scripts’s call graph. 9 11
Parallelism via foreach { } type imagefile; // Declare a “file” type. 1 2 app (imagefile output) rotate (imagefile input) { 3 convert " ‐ rotate" " 1 80" @input @output; 4 } 5 Map inputs from local directory 6 imagefile observations[ ] <simple_mapper; prefix=“m 1 0 1‐ raw”>; 7 imagefile flipped[ ] <simple_mapper; prefix=“m 1 0 1‐ flipped”>; 8 9 Name outputs based on index 1 0 11 1 2 foreach obs,i in observations { flipped[i] = rotate(obs); 1 3 1 4 } Process all dataset members in parallel 12
Many domains process structured datasets Many 3a.h 3a.i 4a.h 4a.i ref.h ref.i 5a.h 5a.i 6a.h 6a.i Data Files: align_warp/1 align_warp/3 align_warp/5 align_warp/7 3a.w 4a.w 5a.w 6a.w Many reslice/2 reslice/4 reslice/6 reslice/8 Application Programs: 3a.s.h 3a.s.i 4a.s.h 4a.s.i 5a.s.h 5a.s.i 6a.s.h 6a.s.i softmean/9 atlas.h atlas.i slicer/10 slicer/12 slicer/14 atlas_x.ppm atlas_y.ppm atlas_z.ppm convert/11 convert/13 convert/15 atlas_x.jpg atlas_y.jpg atlas_z.jpg
Swift Data Mapping type Study { Group g[ ]; } type Group { Subject s[ ]; } On-Disk Data type Subject { Layout Swift’s Volume anat; in-memory Run run[ ]; data model } type Run { Volume v[ ]; } type Volume { Mapping function Mapping function Image img; or script Header hdr; or script }
Application: Protein structure prediction Fasta Fasta file file To run: psim –s 1ubq.fas –pdb p \ seq seq –temp 100.0 –inc 25.0 >log dt t dt t In Swift code: app (PDB pg, File log) predict (Protein seq, PSim application PSim application Float temp, Float dt) { psim "-s" @pseq.fasta "-pdb" @pg "–temp" temp ”-inc" dt; Swift app function Swift app function } “predict()” “predict()” Protein p <ext; exec="Pmap", id="1ubq">; pg pg log log ProtGeo structure; Encapsulation is the key to TextFile log; transparent distribution, (structure, log) = predict(p, 100., 25.); parallelization, and provenance
Parallelism via foreach { } 1000 predict() calls Analyze() foreach sim in [1:1000] { (structure[sim], log[sim]) = predict(p, 100., 25.); } result = analyze(structure)
Application: 3D Protein structure prediction type Fasta; // Primary protein sequence file in FASTA format 1 . type SecSeq; // Secodary structure file 2. type RamaMap; // “Ramachandra” mapping info files 3. type RamaIndex; 4. type ProtGeo; // PDB ‐ format file – protein geometry: 3D atom coords 5. type SimLog; 6. 7. type Protein { // Input file struct to protein simulator 8. Fasta fasta; // sequence to predict structure of 9. SecSeq secseq; // Initial secondary structure to use 1 0. ProtGeo native; // 3D structure from experimental data when known 11 . RamaMap map; 1 2. RamaIndex index; 1 3. 1 4. } 1 5. 1 6. type PSimCf { // Science configuration parameters to simulator float st; 1 7. float tui; 1 8. float coeff; 1 9. 20. } 2 1 . 22. type ProtSim { // Output file struct from protein simulator ProtGeo pgeo; 23. SimLog log; 24. 25. } 17
Protein structure prediction 1 . app (ProtGeo pgeo) predict (Protein pseq) 2. { 3. PSim @pseq.fasta @pgeo; 4. } 5. 6. (ProtGeo pg[ ]) doRound (Protein p, int n) { 7. foreach sim in [0:n ‐1 ] { 8. pg[sim] = predict(p); 9. } 1 0. } 11 . 1 2. Protein p <ext; exec="Pmap", id=" 1 af7">; 1 3. ProtGeo structure[ ]; 1 4. int nsim = 1 0000; 1 5. structure = doRound(p, nsim); 18
Protein structure prediction 1 (ProtSim psim[ ]) doRoundCf (Protein p, int n, PSimCf cf) { 2 foreach sim in [0:n ‐1 ] { 3 psim[sim] = predictCf(p, cf.st, cf.tui, cf.coeff ); 4 } 5 } 6 (boolean converged) analyze( ProtSim prediction[ ], int r, int numRounds) 7 { 8 if( r == (numRounds ‐1 ) ) { 9 converged = true; 1 0 } 11 else { 1 2 converged = test_convergence(prediction); 1 3 } 1 4 } 19
Protein structure prediction 1 . ItFix( Protein p, int nsim, int maxr, float temp, float dt) 2. { 3. ProtSim prediction[ ][ ]; 4. boolean converged[ ]; 5. PSimCf config; 6. 7. config.st = temp; 8. config.tui = dt; 9. config.coeff = 0. 1 ; 1 0. 11 . iterate r { 1 2. prediction[r] = 1 3. doRoundCf(p, nsim, config); 1 4. converged[r] = 1 5. analyze(prediction[r], r, maxr); 1 6. } until ( converged[r] ); 1 7. } 20
Recommend
More recommend