Pegasus : Introducing Integrity to Scientific Workflows Karan Vahi vahi@isi.edu https://pegasus.isi.edu
HTCondor DAGMan Compute Compute Pipelines Pipelines – DAGMan DAGMan is is a a reliable reliable and and a a scalable scalable workflow workflow executor executor • Sits Sits on on top top of of HTCondor HTCondor Schedd Schedd Building Building Blocks Blocks Can Can handle handle very very large large workflows workflows Has Has useful useful reliability reliability features features in in-built built • Automatic job Automatic job retries retries and and rescue rescue DAG’s DAG’s ( ( recover recover from from where where you you left left off off in in case case of of failures) failures) Throttling for Throttling for jobs jobs in in a a workflow workflow • However, However, it it is is still still up up-to to user user to to figure figure out out Data Data Management Management • How How do do you you ship ship in in the the small/large small/large amounts amounts data data required required by by your your pipeline pipeline and and protocols protocols to to use? use? How How best best to to leverage leverage different different infrastructure infrastructure setups setups • OSG OSG has has no no shared shared filesystem filesystem while while XSEDE XSEDE and and your your local local campus campus cluster cluster has has one! one! Debug Debug and and Monitor Monitor Computations. Computations. • Correlate Correlate data data across across lots lots of of log log files. files. Need Need to to know know what what host host a a job job ran ran on on and and how how it it was was invoked invoked Restructure Restructure Workflows Workflows for for Improved Improved Performance Performance • Short hort running running tasks? tasks? http://pegasus.isi.edu Pegasus 2 Data Data placement placement
Why Pegasus ? Automate Automates complex, multi-stage processing pipelines Enables parallel, distributed computations Portable: Describe once; execute multiple times Recover Automatically executes data transfers Reusable, aids reproducibility Records how data was produced (provenance) Provides to tools to handle and debug failures Debug Keeps track of data and files NSF funded NSF funded project project since since 2001, 2001, with with close close Collaboration Collaboration with with HTCondor HTCondor team. team. Pegasus http://pegasus.isi.edu 3
DAG directed-acyclic directed acyclic graphs graphs DAG DAG in in XML XML Portable Portable Description Description stage stage-in in job job Users Users don’t don’t worry worry about about Transfers the workflow input data low level low level execution execution details details clustered clustered job job Groups small jobs together to improve performance cleanup cleanup job job Removes unused data stage-out stage out job job Transfers the workflow output data registration registration job job Registers the workflow output data Pegasus https://pegasus.isi.edu 4
Data Staging Configurations Data Staging Configurations Condor I/O (HTCondor pools, OSG, …) Submit WN Host Jobs • Worker nodes do not share a file system Data • Data is pulled from / pushed to the submit host Local FS WN via HTCondor file transfers Compute Site • Staging site is the submit host Non-shared File System (clouds, OSG, …) WN Submit • Worker nodes do not share a file system Staging Host WN Site • Data is pulled / pushed from a staging site, possibly not co-located with the computation Amazon Compute Site EC2 with S3 Shared File System (HPC sites, XSEDE, Campus clusters, …) WN Submit Shared • I/O is directly against the shared file system Host WN FS Pegasus Guarantee - Wherever and whenever a job runs Compute Site it’s inputs will be in the directory where it is launched. HPC Cluster
pegasus-transfer HTTP SCP • Pegasus’ internal data transfer tool with support for a number GridFTP of different protocols Globus Online • Directory creation, file removal iRods • If protocol supports, used for cleanup Amazon S3 • Two stage transfers Google Storage • e.g. GridFTP to S3 = GridFTP to local file, local file to S3 SRM • Parallel transfers FDT • Automatic retries stashcp cp • Credential management • Uses the appropriate credential for each site and each protocol (even ln -s 3 rd party transfers)
Scientific Workflow Integrity with Pegasus NSF CICI Awards 1642070, 1642053, and 1642090 GOALS Provide additional assurances that a scientific workflow is not accidentally or maliciously tampered with during its execution Allow for detection of modification to its data or executables at later dates to facilitate reproducibility. PIs: Von Welch, Ilya Baldin, Ewa Deelman, Steve Myers Team: Omkar Bhide, Rafael Ferrieira da Silva, Randy Heiland, Integrate cryptographic support for data Anirban Mandal, Rajiv Mayani, Mats Rynge, Karan Vahi integrity into the Pegasus Workflow Management System. cacr.iu.edu/projects/swip/
Challenges to Scientific Data Integrity Modern IT systems are not Plus there is the threat of perfect - errors creep in. intentional changes: malicious attackers, insider At modern “Big Data” sizes threats, etc. we are starting to see checksums breaking down. cacr.iu.edu/projects/swip/
Motivation: CERN Study of Disk Errors Examined Disk, Memory, RAID 5 errors. “The error rates are at the 10-7 level, but with complicated patterns.” E.g. 80% of disk errors were 64k regions of corruption. Explored many fixes and their often significant performance trade-offs. https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf cacr.iu.edu/projects/swip/
Motivation: Network Corruption Network router software inadvertently corrupts TCP data and checksum! XSEDE and Internet2 example from 2013. Second similar case in 2017 example with FreeSurfer/Fsurf project. https://www.xsede.org/news/-/news/item/6390 Brocade TSB 2013-162-A cacr.iu.edu/projects/swip/
Motivation: Software failure Bug in StashCache data transfer However, failures in the final staging software would occasionally cause out of data were not detected silent failure (failed but returned because their was no workflow next zero). stage to catch the errors. Internal to the workflow this was The workflow management system, detected when input to a stage of the believing workflow was complete, workflow was detected as corrupted cleaned up, so final data incomplete and retry invoked. (60k retries and an and all intermediary data lost. Ten extra 2 years of cpu hours!) CPU*years of computing came to naught. cacr.iu.edu/projects/swip/
Enter application-level checksums Application-level checksums To include all aspects of the address these and other issues application workflow, requires (e.g. malicious changes). either manual application by a researcher or integration into the In use by many data transfer application(s). applications: scp, Globus/GridFTP, some parts of HTCondor, etc. cacr.iu.edu/projects/swip/
Automatic Integrity Checking - Goals • Capture data corruption in a workflow by performing integrity checks on data • Come up with a way to query , record and enforce checksums for different types of files • Raw input files – input files fetch from input data server • Intermediate files – files created by jobs in the workflow • Output files – final output files a user is actually interested in, and transferred to output site • Modify Pegasus to perform integrity checksums at appropriate places in the workflow. • Provide users a dial on scope of integrity checking
Automatic Automatic Integrity Integrity Checking Checking Pegasus will perform integrity checksums on input files before a job starts on the remote node. ● For raw inputs, checksums specified in the input replica catalog along with file locations. Can compute checksums while transferring if not specified . ● All intermediate and output files checksums are generated and tracked within the system. ● Support for sha256 checksums Failure is triggered if checksums fail Introduced in Pegasus 4.9 cacr.iu.edu/projects/swip/
Initial Initial Results Results with with Integrity Integrity Checking Checking on on • OSG-KINC workflow (50606 jobs) encountered 60 integrity errors in the wild (production OSG). The problematic jobs were automatically retried and the workflow finished successfully. • The 60 errors took place on 3 different hosts. The first one at UColorado, and group 2 and 3 at UNL hosts. Error Analysis • Host 2 had 3 errors, all the same bad checksum for the "kinc" executable with only a few seconds in between the jobs. • Host 3 had 56 errors, all the same bad checksum for the same data file, and over the timespan of 64 minutes. The site level cache still had a copy of this file and it was the correct file. Thus we suspect that the node level cache got corrupted. cacr.iu.edu/projects/swip/
Recommend
More recommend