Data lineage model for Taverna workflows with lightweight - PowerPoint PPT Presentation
Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 Salt Lake City, Utah, June 2008
Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 – Salt Lake City, Utah, June 2008
Context and scope Ongoing work on a new provenance component for Taverna • myGrid consortium Scope: • capture raw provenance events – data transformations, data transfers • store one lineage graph for each dataflow execution • query over single or multiple lineage graphs IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow QTL -> genes -> Kegg pathways IPAW'08 – Salt Lake City, Utah, June 2008
Some user questions on lineage • on a single workflow run: – find all genes that participate in some pathway p – find all pathways derived from Uniprot genes – describe the complete derivation of each pathway in which gene g is involved • on a collection of runs: – find all distinct pathways produced by runs of a dataflow [over a period of time, produced by a member of my group, ...] IPAW'08 – Salt Lake City, Utah, June 2008
Shortcomings of lineage data • Granularity – risk of returning trivial answers – “all outputs depend on all inputs” • Semantics – Results not expressed in the language of the designer • Abstraction level, noise – the “latent data model” – many processors are irrelevant – shims, mundane tasks IPAW'08 – Salt Lake City, Utah, June 2008
The need for selective annotations • As long as processors are black boxes, these remain difficult problems • Adding annotations to processors is tempting Scope of this work: to explore the “gray box” region • simple annotations with minimal semantics • driving principle: justified by technical benefits – precision of query results – efficiency of query processing IPAW'08 – Salt Lake City, Utah, June 2008
Test dataflow model configuration P 1 V I1 P 1 V I2 documents P 1 extract query terms P 1 V O1 P 4 V I1 P 2 V I1 P 2 query prep P 4 query 2 P 4 V O1 P 2 V O1 P 3 V I1 P 5 V I1 P 3 query1 P 5 post-proc P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 merge results number of P 6 V O1 P 6 V O2 duplicates P 7 V I merged P 7 sort results P 7 V O IPAW'08 – Salt Lake City, Utah, June 2008
Two main annotation types Focusing: processor selection some processors are more interesting than others “boring” annotations query-time user selection of interesting processors Precision: fine-grained lineage tracing goal: trace lineage of individual items within a collection IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by modularization Lucene_query extract diseases NERecognize from OMIM shims IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008
Focusing – processor selection P 1 V I1 P 1 V I2 = a1 = a2 P4 is P 1 extract query terms the = b P 1 V O1 o P 4 V I1 P 2 V I1 = b = b nly interesting processor P 2 query prep P 4 query 2 Assume all values atomic P 4 V O1 P 2 V O1 Query: lineage(P 7 V O ,{P 4 }) P 3 V I1 P 5 V I1 Goal: P 3 query1 P 5 post-proc • avoid recursive queries on P 3 V O1 P 5 V O1 instance tables P 6 V I1 P 6 V I2 Idea: P 6 merge results use recursion on static P 6 V O1 P 6 V O2 model to generate a P 7 V I targeted query P 7 sort execute query only once P 7 V O = g IPAW'08 – Salt Lake City, Utah, June 2008
Precision: elements within collections Problem: xform() also applies to list values • It may be impossible to trace individual elements – “which pathways (out) depend on which genes (in)” ? Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors P 1 P 1 Taverna resolves mismatches P 1 V o : l(s) = [a, b, P 1 V o : l(s) = [a, b, on nesting levels: c] c] (map P 2 [a,b,c]) P 2 V I : l(s) = [a, b, c] P 2 V I : s [a, b, c] P 2 P 2 IPAW'08 – Salt Lake City, Utah, June 2008
Loss of precision in transformations PV I : s = a PV I : s = a “lossless” P P transformations PV O : s = a' PV O : l(s) = [x, y, z] possible behaviours: PV I : l(s) = [a, b, c] • selection of an element P x → [a, b, c] • aggregation lossy PV O : s = x fun c tion f() useful annotation: PV I : l(s) = [a, b, c] lineage(PV O ) = f(PV I ) only useful annotation: x → [a, b, c] P P is index-preserving : y → [a, b, c] PV O [i] = PV I [i] PV O : l(s) = [x, y] lineage(PV O [i]) = PV I [i] PV O : l(s) = [a',b',c'] IPAW'08 – Salt Lake City, Utah, June 2008
Cooperative processors – Passive processors do not contribute explicit provenance info – Cooperative processors actively feed metadata to the lineage service PV I : l(s) = [a, b, c] PV I : l(s) = [a, b, c] P P PV O : s = x PV O : l(s) = [x, y] Static aggregation f() PV O [i] = PV I [i] annotations: sorting: selection: Dynamic PV O = Π (PV I ) annotations: x = PV I [i] IPAW'08 – Salt Lake City, Utah, June 2008
Other annotations • Distinction between configuration and input data PV I1 PV I2 PV I3 – PVI 3 is a configuration parameter P – compare effect of different config. PV O across multiple runs • specific functional dependencies [ PV I1 , PV I2 ] → PV O • stateless processor – execute process ↔ retrieve provenance More evaluation needed on these IPAW'08 – Salt Lake City, Utah, June 2008
Towards a 2 tier provenance model “describe the derivation of reference each pathway through Kegg, ontologies query in which gene g is involved” semantic Semantic resource overlays annotations lineag Lineage service e database structural (RDB) dataflow topology + annotations raw lineage events Taverna P1 P1 runtime P4 P4 P2 P1 P2 P5 P5 P3 P4 P3 P2 P6 P5 P6 P3 P6 IPAW'08 – Salt Lake City, Utah, June 2008
Conclusions A data lineage model for Taverna workflows • Raw lineage data has shortcomings • A few, selected lightweight annotations added in a principled way – win-win: – helpful to users – and enable query optimization • Form the base layer in a broader approach to efficient querying of semantic provenance for e- science • Ongoing implementation IPAW'08 – Salt Lake City, Utah, June 2008
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.