Data lineage model for Taverna workflows with lightweight - PowerPoint PPT Presentation

Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 – Salt Lake City, Utah, June 2008

Context and scope Ongoing work on a new provenance component for Taverna • myGrid consortium Scope: • capture raw provenance events – data transformations, data transfers • store one lineage graph for each dataflow execution • query over single or multiple lineage graphs IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow QTL -> genes -> Kegg pathways IPAW'08 – Salt Lake City, Utah, June 2008

Some user questions on lineage • on a single workflow run: – find all genes that participate in some pathway p – find all pathways derived from Uniprot genes – describe the complete derivation of each pathway in which gene g is involved • on a collection of runs: – find all distinct pathways produced by runs of a dataflow [over a period of time, produced by a member of my group, ...] IPAW'08 – Salt Lake City, Utah, June 2008

Shortcomings of lineage data • Granularity – risk of returning trivial answers – “all outputs depend on all inputs” • Semantics – Results not expressed in the language of the designer • Abstraction level, noise – the “latent data model” – many processors are irrelevant – shims, mundane tasks IPAW'08 – Salt Lake City, Utah, June 2008

The need for selective annotations • As long as processors are black boxes, these remain difficult problems • Adding annotations to processors is tempting Scope of this work: to explore the “gray box” region • simple annotations with minimal semantics • driving principle: justified by technical benefits – precision of query results – efficiency of query processing IPAW'08 – Salt Lake City, Utah, June 2008

Test dataflow model configuration P 1 V I1 P 1 V I2 documents P 1 extract query terms P 1 V O1 P 4 V I1 P 2 V I1 P 2 query prep P 4 query 2 P 4 V O1 P 2 V O1 P 3 V I1 P 5 V I1 P 3 query1 P 5 post-proc P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 merge results number of P 6 V O1 P 6 V O2 duplicates P 7 V I merged P 7 sort results P 7 V O IPAW'08 – Salt Lake City, Utah, June 2008

Two main annotation types Focusing: processor selection  some processors are more interesting than others  “boring” annotations  query-time user selection of interesting processors Precision: fine-grained lineage tracing  goal: trace lineage of individual items within a collection IPAW'08 – Salt Lake City, Utah, June 2008

Abstraction by modularization Lucene_query extract diseases NERecognize from OMIM shims IPAW'08 – Salt Lake City, Utah, June 2008

Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008

Focusing – processor selection P 1 V I1 P 1 V I2 = a1 = a2 P4 is P 1 extract query terms the = b P 1 V O1 o P 4 V I1 P 2 V I1 = b = b nly interesting processor P 2 query prep P 4 query 2 Assume all values atomic P 4 V O1 P 2 V O1 Query: lineage(P 7 V O ,{P 4 }) ‏ P 3 V I1 P 5 V I1 Goal: P 3 query1 P 5 post-proc • avoid recursive queries on P 3 V O1 P 5 V O1 instance tables P 6 V I1 P 6 V I2 Idea: P 6 merge results  use recursion on static P 6 V O1 P 6 V O2 model to generate a P 7 V I targeted query P 7 sort  execute query only once P 7 V O = g IPAW'08 – Salt Lake City, Utah, June 2008

Precision: elements within collections Problem: xform() also applies to list values • It may be impossible to trace individual elements – “which pathways (out) depend on which genes (in)” ? Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors P 1 P 1 Taverna resolves mismatches P 1 V o : l(s) = [a, b, P 1 V o : l(s) = [a, b, on nesting levels: c] c] (map P 2 [a,b,c]) ‏ P 2 V I : l(s) = [a, b, c] P 2 V I : s [a, b, c] P 2 P 2 IPAW'08 – Salt Lake City, Utah, June 2008

Loss of precision in transformations PV I : s = a PV I : s = a “lossless” P P transformations PV O : s = a' PV O : l(s) = [x, y, z] possible behaviours: PV I : l(s) = [a, b, c] • selection of an element P x → [a, b, c] • aggregation lossy PV O : s = x fun c tion f() useful annotation: PV I : l(s) = [a, b, c] lineage(PV O ) = f(PV I ) ‏ only useful annotation: x → [a, b, c] P P is index-preserving : y → [a, b, c] PV O [i] = PV I [i] PV O : l(s) = [x, y] lineage(PV O [i]) = PV I [i] PV O : l(s) = [a',b',c'] IPAW'08 – Salt Lake City, Utah, June 2008

Cooperative processors – Passive processors do not contribute explicit provenance info – Cooperative processors actively feed metadata to the lineage service PV I : l(s) = [a, b, c] PV I : l(s) = [a, b, c] P P PV O : s = x PV O : l(s) = [x, y] Static aggregation f() ‏ PV O [i] = PV I [i] annotations: sorting: selection: Dynamic PV O = Π (PV I ) ‏ annotations: x = PV I [i] ‏ IPAW'08 – Salt Lake City, Utah, June 2008

Other annotations • Distinction between configuration and input data PV I1 PV I2 PV I3 – PVI 3 is a configuration parameter P – compare effect of different config. PV O across multiple runs • specific functional dependencies [ PV I1 , PV I2 ] → PV O • stateless processor – execute process ↔ retrieve provenance More evaluation needed on these IPAW'08 – Salt Lake City, Utah, June 2008

Towards a 2 tier provenance model “describe the derivation of reference each pathway through Kegg, ontologies query in which gene g is involved” semantic Semantic resource overlays annotations lineag Lineage service e database structural (RDB) ‏ dataflow topology + annotations raw lineage events Taverna P1 P1 runtime P4 P4 P2 P1 P2 P5 P5 P3 P4 P3 P2 P6 P5 P6 P3 P6 IPAW'08 – Salt Lake City, Utah, June 2008

Conclusions A data lineage model for Taverna workflows • Raw lineage data has shortcomings • A few, selected lightweight annotations added in a principled way – win-win: – helpful to users – and enable query optimization • Form the base layer in a broader approach to efficient querying of semantic provenance for e- science • Ongoing implementation IPAW'08 – Salt Lake City, Utah, June 2008

Data lineage model for Taverna workflows with lightweight - PowerPoint PPT Presentation

Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 Salt Lake City, Utah, June 2008

Bonsai: Balanced Lineage Authentication Ashish Gehani Bonsai:Balanced Lineage Authentication

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Mentor: Christine E. Edwards A separately evolving metapopulation lineage where lineage

The Taverna Workbench: Integrating and analysing biological and clinical data with computerised

BioVeL: Taverna Workflows on distributed grid computing for Biodiversity Giacinto DONVITO

Thermal Flywheeling Alex Woolf, PhD - Principal Data Scientist Lineage Logistics 1 THE NEED FOR

Jehoshua (Shuki) Bruck From Screws to Systems The Lineage of BMW It happens in biological

Low rate of lineage High rates of diversification lineage diversification Ancestral trait

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

1.Lineage 2.Consistency Relational 3.Query Mining 4 6 Lineage + Interactions Lineage +

The lightweight beam for Heavyweight applications The impact of this lightweight beam concept

The lightweight beam for Heavyweight applications The impact of this lightweight steel beam will

Its time to Think Lightweight! www.thinklightweight.com TO D A Y S TO P IC S 1.

Lightweight Cryptography and and RFID Security Svetla Nikova COSIC KUL COSIC, KULeuven and

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil

Cannon Contribution to Innovation and Sustainability Max Taverna With the contribution of the

Price of Free New Knowledge from research, dialog, and further thought Student Names: RESEARCH:

Idiom -based Exception Handling using Aspects Bram Adams GH-SEL, UGent Kris De Schutter PROG,

PRESENTATION THE BEST WAY TO MANAGE YOUR BUSINESS SINCE 1988 PROGREST IS A SOFTWARE FOR MANAGE

Annotation and High Throughput Sequencing Martin Morgan Fred Hutchinson Cancer Research Center

hours writing research papers - EndNote Bob Green Solution Specialist Web of Science Group

Primary Sources A primary source is a piece of information about a historical event or period in

Baid idu Clo loud In Industry ry Quali lity In Inspection Solu lution Baidu Inc. Lei Nie

The BeSt Eval at the 2016 NIST TAC KBP Overview BeSt Eval Task

Sambuz

Useful Links

Newsletter

Mail Us