Aug 2019: Integrity Protection for Scientific Workflow Data: Motivation and Initial Experiences
With the continued rise of scientific computing and the enormous increases in the size of data being processed, scientists must consider whether the processes for transmitting and storing data sufficiently assure the integrity of the scientific data. When integrity is not preserved, computations can fail and result in increased computational cost due to reruns, or worse, results can be corrupted in a manner not apparent to the scientist and produce invalid science results. Technologies such as TCP checksums, encrypted transfers, checksum validation, RAID and erasure coding provide integrity assurances at different levels, but they may not scale to large data sizes and may not cover a workflow from end-to-end, leaving gaps in which data corruption can occur undetected.
In this talk, we will present our findings from the “Scientific Workflow Integrity with Pegasus” (SWIP) project by describing an approach of assuring data integrity - considering either malicious or accidental corruption - for workflow executions orchestrated by the Pegasus Workflow Management System (WMS). A key goal of SWIP is to provide assurance that any changes to input data, executables, and output data associated with a given workflow can be efficiently and automatically detected. Towards this goal, SWIP has integrated data integrity protection into a newly released version of Pegasus WMS by automatically generating and tracking checksums for both when inputs files are introduced and for the files generated during execution. We will describe how we validate our integrity protection approach by leveraging Chaos Jungle - a toolkit providing an environment for validating integrity verification mechanisms by allowing researchers to introduce a variety of integrity errors during data transfers and storage. We will also provide an analysis of integrity errors and associated overheads that we encountered when running production workflows using Pegasus.
Speaker Bios:
Anirban Mandal serves as the Assistant Director for network research and infrastructure group at Renaissance Computing Institute (RENCI), UNC-Chapel Hill. He leads efforts in science cyberinfrastructures. His research interests include resource provisioning, scheduling, performance analysis, and anomaly detection for distributed computing systems, cloud computing, and scientific workflows. Prior to joining RENCI, he earned his PhD degree in Computer Science from Rice University in 2006 and a Bachelor’s degree in Computer Science & Engineering from IIT Mumbai, India in 2000.
Mats Rynge is a computer scientist in the Science Automation Technologies group at the USC Information Sciences Institute. He is a developer on the Pegasus Workflow Management System and related projects. He is also involved in several national cyberinfrastructure deployments such as the Open Science Grid and XSEDE, for which he provides user support, software engineering and system administration. Previously, he was at the Renaissance Computing institute where he was the technical lead on the RENCI Science TeraGrid Gateway and the Open Science Grid Engagement activities. Before that he was a release manager on the NPACI NPACKage and NSF Middleware Initiative projects where he planned, created, and tested software middleware stacks for larger science communities.He also worked on improving grid software as part of Community Driven Improvement of Globus Software (CDIGS) and Coordinated TeraGrid Software and Services (CTSS) efforts.