Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 – Salt Lake City, Utah, June 2008
Context and scope Ongoing work on a new provenance component for Taverna • myGrid consortium Scope: • capture raw provenance events – data transformations, data transfers • store one lineage graph for each dataflow execution • query over single or multiple lineage graphs IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow QTL -> genes -> Kegg pathways IPAW'08 – Salt Lake City, Utah, June 2008
Some user questions on lineage • on a single workflow run: – find all genes that participate in some pathway p – find all pathways derived from Uniprot genes – describe the complete derivation of each pathway in which gene g is involved • on a collection of runs: – find all distinct pathways produced by runs of a dataflow [over a period of time, produced by a member of my group, ...] IPAW'08 – Salt Lake City, Utah, June 2008
Shortcomings of lineage data • Granularity – risk of returning trivial answers – “all outputs depend on all inputs” • Semantics – Results not expressed in the language of the designer • Abstraction level, noise – the “latent data model” – many processors are irrelevant – shims, mundane tasks IPAW'08 – Salt Lake City, Utah, June 2008
The need for selective annotations • As long as processors are black boxes, these remain difficult problems • Adding annotations to processors is tempting Scope of this work: to explore the “gray box” region • simple annotations with minimal semantics • driving principle: justified by technical benefits – precision of query results – efficiency of query processing IPAW'08 – Salt Lake City, Utah, June 2008
Test dataflow model configuration P 1 V I1 P 1 V I2 documents P 1 extract query terms P 1 V O1 P 4 V I1 P 2 V I1 P 2 query prep P 4 query 2 P 4 V O1 P 2 V O1 P 3 V I1 P 5 V I1 P 3 query1 P 5 post-proc P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 merge results number of P 6 V O1 P 6 V O2 duplicates P 7 V I merged P 7 sort results P 7 V O IPAW'08 – Salt Lake City, Utah, June 2008
Two main annotation types Focusing: processor selection some processors are more interesting than others “boring” annotations query-time user selection of interesting processors Precision: fine-grained lineage tracing goal: trace lineage of individual items within a collection IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by modularization Lucene_query extract diseases NERecognize from OMIM shims IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008
Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008
Focusing – processor selection P 1 V I1 P 1 V I2 = a1 = a2 P4 is P 1 extract query terms the = b P 1 V O1 o P 4 V I1 P 2 V I1 = b = b nly interesting processor P 2 query prep P 4 query 2 Assume all values atomic P 4 V O1 P 2 V O1 Query: lineage(P 7 V O ,{P 4 }) P 3 V I1 P 5 V I1 Goal: P 3 query1 P 5 post-proc • avoid recursive queries on P 3 V O1 P 5 V O1 instance tables P 6 V I1 P 6 V I2 Idea: P 6 merge results use recursion on static P 6 V O1 P 6 V O2 model to generate a P 7 V I targeted query P 7 sort execute query only once P 7 V O = g IPAW'08 – Salt Lake City, Utah, June 2008
Precision: elements within collections Problem: xform() also applies to list values • It may be impossible to trace individual elements – “which pathways (out) depend on which genes (in)” ? Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors P 1 P 1 Taverna resolves mismatches P 1 V o : l(s) = [a, b, P 1 V o : l(s) = [a, b, on nesting levels: c] c] (map P 2 [a,b,c]) P 2 V I : l(s) = [a, b, c] P 2 V I : s [a, b, c] P 2 P 2 IPAW'08 – Salt Lake City, Utah, June 2008
Loss of precision in transformations PV I : s = a PV I : s = a “lossless” P P transformations PV O : s = a' PV O : l(s) = [x, y, z] possible behaviours: PV I : l(s) = [a, b, c] • selection of an element P x → [a, b, c] • aggregation lossy PV O : s = x fun c tion f() useful annotation: PV I : l(s) = [a, b, c] lineage(PV O ) = f(PV I ) only useful annotation: x → [a, b, c] P P is index-preserving : y → [a, b, c] PV O [i] = PV I [i] PV O : l(s) = [x, y] lineage(PV O [i]) = PV I [i] PV O : l(s) = [a',b',c'] IPAW'08 – Salt Lake City, Utah, June 2008
Cooperative processors – Passive processors do not contribute explicit provenance info – Cooperative processors actively feed metadata to the lineage service PV I : l(s) = [a, b, c] PV I : l(s) = [a, b, c] P P PV O : s = x PV O : l(s) = [x, y] Static aggregation f() PV O [i] = PV I [i] annotations: sorting: selection: Dynamic PV O = Π (PV I ) annotations: x = PV I [i] IPAW'08 – Salt Lake City, Utah, June 2008
Other annotations • Distinction between configuration and input data PV I1 PV I2 PV I3 – PVI 3 is a configuration parameter P – compare effect of different config. PV O across multiple runs • specific functional dependencies [ PV I1 , PV I2 ] → PV O • stateless processor – execute process ↔ retrieve provenance More evaluation needed on these IPAW'08 – Salt Lake City, Utah, June 2008
Towards a 2 tier provenance model “describe the derivation of reference each pathway through Kegg, ontologies query in which gene g is involved” semantic Semantic resource overlays annotations lineag Lineage service e database structural (RDB) dataflow topology + annotations raw lineage events Taverna P1 P1 runtime P4 P4 P2 P1 P2 P5 P5 P3 P4 P3 P2 P6 P5 P6 P3 P6 IPAW'08 – Salt Lake City, Utah, June 2008
Conclusions A data lineage model for Taverna workflows • Raw lineage data has shortcomings • A few, selected lightweight annotations added in a principled way – win-win: – helpful to users – and enable query optimization • Form the base layer in a broader approach to efficient querying of semantic provenance for e- science • Ongoing implementation IPAW'08 – Salt Lake City, Utah, June 2008
Recommend
More recommend