Compute and data management strategies for grid deployment of high throughput protein structure studies Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010
Overview Context: Structural biology computing (think proteins) Infrastructure: Open Science Grid Computational model Application Data Workflow Identity management and security Perspectives & Conclusions Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
SBGrid Consortium Cornell U. Washington U. School of Med. R. Cerione NE-CAT T. Ellenberger B. Crane R. Oswald D. Fremont S. Ealick C. Parrish Rosalind Franklin NIH M. Jin H. Sondermann D. Harrison M. Mayer A. Ke UMass Medical U. Washington U. Maryland W. Royer T. Gonen E. Toth Brandeis U. UC Davis N. Grigorieff H. Stahlberg Tufts U. K. Heldwein UCSF Columbia U. JJ Miranda Q. Fan Y. Cheng Rockefeller U. Stanford R. MacKinnon A. Brunger Yale U. K. Garcia T. Boggon K. Reinisch T. Jardetzky D. Braddock J. Schlessinger Y. Ha F. Sigworth CalTech E. Lolis F. Zhou P. Bjorkman Harvard and Affiliates W. Clemons N. Beglova A. Leschziner Rice University G. Jensen S. Blacklow K. Miller D. Rees E. Nikonowicz B. Chen A. Rao Y. Shamoo Vanderbilt J. Chou T. Rapoport Y.J. Tao J. Clardy M. Samso Center for Structural Biology WesternU M. Eck P. Sliz W. Chazin C. Sanders M. Swairjo B. Furie T. Springer B. Eichman B. Spiller R. Gaudet G. Verdine M. Egli M. Stone UCSD M. Waterman M. Grant G. Wagner B. Lacy T. Nakagawa S.C. Harrison L. Walensky H. Viadiu Thomas Jefferson J. Hogle S.Walker D. Jeruzalmi T.Walz J. Williams D. Kahne J. Wang Not Pictured: T. Kirchhausen S. Wong University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Northeast BioGrid Virtual Organization Biomedical researchers Life sciences Universities Tufts Universit y School of Medicin e Hospitals Government agencies Currently Boston-focused Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Protein Structure Studies sample imaging data fragments structure X-ray crystallography ... O(1e5) processed using grid infrastructure cryo electron microscopy Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Single Structure Study
Broad Structure Study
550 structures x 4000 iterations = 1 million iterations in broad study
Single Structure Wide Search 100,000 iterations 20,000 core-hours 12 hours wall-clock (typical) Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Open Science Grid US National Cyberinfrastructure Primarily used for high energy physics computing 80 sites O(1e5) job slots LIGO O(1e6) core-hours per day SBGrid PB scale aggregate storage Engage 4,654,878 Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Ian Stokes-Rees, NEBioGrid, Harvard Medical School February 16th, 2010
Typical Layered Environment Command line application (e.g. Fortran) Fortran bin Friendly application API wrapper Python API Batch execution wrapper for N-iterations MAP- Multi-exec wrapper REDUCE Results extraction and aggregation Result aggregator Grid job management wrapper Grid management Web interface Web interface forms, views, static HTML results GOAL eliminate shell scripts often found as “glue” language between layers Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009
Shell Scripting vs. Structured Language ✓ Good modularization ✓ Rich set of easy-to-use file system operations ✓ Good Web/RPC integration ✓ Quick to translate “experimental” ✓ Good error handling operations from command line into ✓ Rich data structures reusable script ✓ Portable ✓ GUI interfaces possible - - Limited error handling File system interaction difficult - - Configuration and parameter Portability processing - Translating CLI operations laborious - Limited data structures - Difficult to build larger systems - Poor web integration Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Developing MTC Workflows Single CLI execution Job submission Configuration API for invocation Results suitable for aggregation Multi-exec format (important for short invocations) Meta-data suitable for MTC management and metrics Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Application Model Application binary API wrapper (single invocation) shex Python module for shell-like operations xconfig Python module for environment and module configuration Grid wrapper (single invocation) grid job description for single invocation Workflow generator Create DAG and job descriptors Standard results format Standard meta-data format Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Modules http://portal.nebiogrid.org/devel/projects/shex/shex http://portal.nebiogrid.org/devel/projects/xconfig/xconfig Results jobset job status start runtime exitcode score ba9 1scza_ OK 1287230825 635 0 614 Job meta-data: JOB_MARKER entry JOB_MARKER WQCG-Harvard-OSG tuscany.med.harvard.edu 1287198043 ba9-1c5pa_ sbgrid@tuscany01.med.harvard.edu:/scratch/condor/execute/dir_16947/glide_e16995/ execute/dir_27129 Sat Oct 16 03:00:43 UTC 2010 Application deployment Locally host “gold standard” Replicate to predictable location at all sites: $OSG_APP/sbgrid System configuration Sanity check basic pre-requisites (memory, disk space, applications, common data sets, directory existence and permissions, network) Environment: PATH, LD_LIBRARY_PATH, PYTHONPATH, etc. Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Data Model (I) Per-job data Need to minimize this to smallest unique set Even then, may need to pre-stage data to remote file server Staged using job manager or pulled by rsync, curl (HTTP), scp Removed on job completion Per-job set (workflow instance) data Pre-staged to each site at job set creation time: $OSG_DATA/users/$USERNAME/workflows/$WORKFLOWNAME Fetched by each job to worker node local disk (or read from NFS) Removed on job set cleanup or by tmpwatch weekly sweep NEW : Large data sets for workflow instance: pre-stage to UCSD, pull on per-job basis (insufficient quota, but big pipes) Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Data Model (II) User project data Pre-staged and manually managed at each site by user: $OSG_DATA/users/$USERNAME/projects/$PROJECTNAME Fetched by each job to worker node local disk (or read from NFS) Removed by user or manually by administrators on quota basis Static data Maintain “gold standard” and rsync or bulk update as required 20 GB of protein models pre-staged to $OSG_DATA/sbgrid/biodb Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Workflow Model Continuous aggregation “Inprogress” view of data - accept possibility of corruption Track errors sort by execution site - key predictor (network, disk, library, config problems) Retain only key output STDOUT, STDERR, and single per-job “results” file enough to easily retry arbitrary sub-sets of overall jobset (timeout, error, etc.) On-demand updates User-driven “expensive” status updates on queued, running, complete, failed jobs, plus aggregated results and report generation Finalized results Cleaned results Augmented results (inclusion of static per-job information) Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Application Exit States OK - job executed application correctly and usable results are returned done NO_SOLUTION - job executed, but no usable results failed, don’t rerun ERROR - job failed to execute properly failed, rerun (up to retry limit) SHORT - job executed and produced output, but runtime is suspicious complete, but don’t trust - rerun TIMEOUT - job was aborted before completing, no results available cancelled, don’t rerun Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Integrated View & Debug Web portal access to files X.509 access control (full file management) .htpasswd read-only sharing CLI or web interaction with running jobs Web view of data (files, tables, reports, AJAX) Web “file browsing” of all results With augmented hyperlinking to details or static information ssh/CLI access to files Users need to be able to drill down to the 1 million files and 5 GB of data generated by the execution of their workflow Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Access, IdM and Security Relying heavily on OSG facilities for federated environment X.509 DNs Proxy certs MyProxy LDAP for local accounts Access control: mod_gridsite and GACL policies Data access: apache and mod_gridsite Service access: web portal and gsi-enabled ssh Challenge : Making facilities available to user community alternatives to web portal and gsi-ssh would be nice: local to user Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009
Recommend
More recommend