Tow ards a Model of Tow ards a Model of Provenance and User View s Provenance and User View s in Scientific W orkflow s in Scientific W orkflow s Shirley Cohen Sarah Cohen-Boulakia Susan Davidson University of Pennsylvania DILS’06 July, 22nd 1
Need for provenance! Need for provenance! Public Public TGCCGTGTGGC CCCTTTCCGTG sources sources TAAATGTCTGT TGGCTAAATGT TGCCGTGTGGC GC TGCCGTGTGGC TAAATGTCTGT CTGTGC TGCCGTGTGGC TAAATGTCTGT … GC … ATGGCCGTGTG TAAATGTCTGT GC GTCTGTGC… GCTAAATGTCT GC GTCTGTGC… GTGCCTAACTA GTCTGTGC… ACTAA… Bioinform atics Bioinform atics protocols protocols ClustalW Alignments How this tree has PAUPS been generated? Phillips Bootstrap … Which sequences have been used to ? ? produce this tree? Can I throw away some of these data? Which ones are really important to keep? CI PRES project CI PRES project Biologist’s w orkspace Biologist’s w orkspace Cyberinfrastructure for Cyberinfrastructure for Phylogenetic Phylogenetic RESearch RESearch DILS’06 July, 22nd 2
Scientific Analysis Scientific Analysis � Explosion of biological data, must be analyzed to create knowledge � Scientific analysis is complex � Reproducing, interpreting results depends on the provenance of the data (how, where, who…) provenance � Workflow systems � Support scientists in their analysis Trace the data used / generated at each step � Trace � � Are heterogeneous heterogeneous � Different graph-based models � Different technologies � Need a generic generic m odel m odel of provenance DILS’06 July, 22nd 3
Provenance Provenance � Provenance is an increasingly important topic � specialized workshops, survey papers… � Models for data provenance exist in the database community � E.g. [Buneman et al. ,01], [Bhagwat et al., 04], [Widom et al., 06] � However, several features of scientific workflows are not addressed � Data are derived by chaining chaining and com posing com posing analytical tools � Steps are black boxes black boxes � Different view s view s of a given workflow (sub-steps) may be considered � Model Model of provenance for scientific workflows must � incorporate these features features DILS’06 July, 22nd 4
Outline Outline � Motivation � Case study: Tree I nference Case study: Tree I nference � � Model for provenance and user views � Querying provenance � Conclusion DILS’06 July, 22nd 5
Tree I nference W orkflow Tree I nference W orkflow If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) GenBank GenBank GenBank repeat process repeat process repeat process (G) (G) (G) (O4) (O4) (O4) (O3) edited (O3) edited (O3) edited (O1) raw (O1) raw (O1) raw (O2) (O2) (O2) rooted rooted rooted alignment alignment alignment sequences sequences sequences alignment alignment alignment tree tree tree (S1) Download (S1) Download (S2) Create (S2) Create (S2) Create (S3) Refine (S3) Refine (S3) Refine Tree Tree Tree (S4) Infer Tree (S4) Infer Tree (S4) Infer Tree Sequences Sequences Alignment Alignment Alignment Alignment Alignment Alignment Repository Repository Repository � Designed in the context of the CIPRES project � Represents how phylogeneticists analyze data � Terminology � Nodes are step-classes (static) � Edges capture the flow of data between step-classes � Loops are possible � An execution of a workflow generates a partial order of steps (dynamic) � Instances of step classes � Each step has input and output data DILS’06 July, 22nd 6
Tree I nference W orkflow , cont. Tree I nference W orkflow , cont. If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) GenBank GenBank GenBank repeat process repeat process repeat process (G) (G) (G) (O4) (O4) (O4) (O3) edited (O3) edited (O3) edited (O1) raw (O1) raw (O1) raw (O2) (O2) (O2) rooted rooted rooted alignment alignment alignment sequences sequences sequences alignment alignment alignment tree tree tree (S1) Download (S1) Download (S2) Create (S2) Create (S2) Create (S3) Refine (S3) Refine (S3) Refine Tree Tree Tree (S4) Infer Tree (S4) Infer Tree (S4) Infer Tree Sequences Sequences Alignment Alignment Alignment Alignment Alignment Alignment Repository Repository Repository (S4a) Compute (S4b) Create (S4c) Bootstrap (S4d) Root Tree Trees Consensus Tree Tree (O4a) (O4b) (O4c) (O4) (O3) unrooted consensus bootstrap rooted edited alignment trees tree tree tree � A step-class may itself be a workflow � Users may zoom-in to the boxes � Kepler, myGrid… � Different user view s can be considered � Am I allowed to zoom in S4? DILS’06 July, 22nd 7
Querying Provenance Querying Provenance If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) GenBank GenBank GenBank repeat process repeat process repeat process (G) (G) (G) (O4) (O4) (O4) (O3) edited (O3) edited (O3) edited (O1) raw (O1) raw (O1) raw (O2) (O2) (O2) rooted rooted rooted alignment alignment alignment sequences sequences sequences alignment alignment alignment tree tree tree (S1) Download (S1) Download (S2) Create (S2) Create (S2) Create (S3) Refine (S3) Refine (S3) Refine Tree Tree Tree (S4) Infer Tree (S4) Infer Tree (S4) Infer Tree Sequences Sequences Alignment Alignment Alignment Alignment Alignment Alignment Repository Repository Repository (S4a) Compute (S4b) Create (S4c) Bootstrap (S4d) Root Tree Trees Consensus Tree Tree (O3) (O4a) (O4b) (O4c) (O4) edited unrooted consensus bootstrap rooted alignment trees tree tree tree � From what im m ediate data products did this tree originate? � What are all the data products which have been used to produce this tree? � What step produced this tree? � What sequence of steps produced this tree? � Data vs step provenance � Immediate vs deep provenance DILS’06 July, 22nd 8
Outline Outline � Motivation � Case study: Tree Inference � Model for provenance and user view s Model for provenance and user view s � � Querying provenance � Conclusion DILS’06 July, 22nd 9
Model of Provenance: Logs Model of Provenance: Logs � A log is a sequence of entries � Input(sid,iid,ts) sid takes iid as input at time ts � Output(sid,did,ts) sid produces did at time ts � Immediate provenance � All the data and steps directly used to produce did ImmProv(did,sid,iid):- Input(sid,iid,tsi) ∧ Output(sid,did,tso) ∧ tsi ≤ tso Input SID IID TSI Imm D Prov and Imm S Prov are also defined -------------- S1 I1 1 I1 D O1 Output S1 S2 S1 I2 1 I2 SID DID TSO S2 D 3 Imm. Provenance of O1 --------------- S1 D 2 Each Each ImmDProv: D S2 O1 4 input/ output input/ output ImmSProv: S2 data is stored! data is stored! DILS’06 July, 22nd 10
Deep Provenance Deep Provenance Recursive definition � Recursive � Deep Data provenance (D): � Deep Data � DProv(did, iid):- ImmProv(did,_, iid) DProv(did, iid):- ImmProv(did,_, x) ∧ DProv(x, iid) Deep Step provenance (S): � Deep Step � SProv(did, sid):- ImmProv(did,sid,_) SProv(did, sid):- ImmProv(did,_, x) ∧ Sprov(x,sid) I1 DProv for O1: [{D}, {I1, I2}] D O1 S1 S2 I2 SProv for O1: [{S2}, {S1}] DILS’06 July, 22nd 11
Com position and User View s Com position and User View s If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) If (rooted tree (O4) = unsatisfactory) GenBank GenBank GenBank repeat process repeat process repeat process (G) (G) (G) (O4) (O4) (O4) (O3) edited (O3) edited (O3) edited (O1) raw (O1) raw (O1) raw (O2) (O2) (O2) rooted rooted rooted alignment alignment alignment sequences sequences sequences alignment alignment alignment tree tree tree (S1) Download (S1) Download (S2) Create (S2) Create (S2) Create (S3) Refine (S3) Refine (S3) Refine U1 Tree Tree Tree (S4) Infer Tree (S4) Infer Tree (S4) Infer Tree Sequences Sequences Alignment Alignment Alignment Alignment Alignment Alignment Repository Repository Repository (S4a) Compute (S4b) Create (S4c) Bootstrap U2 (S4d) Root Tree Trees Consensus Tree Tree (O4a) (O4b) (O4c) (O4) (O3) unrooted consensus bootstrap rooted edited alignment trees tree tree tree � What is the immediate data provenance of O4? � If I can zoom into S4 � O4c � Otherwise � O3 UserView ( U) : set of the lowest level step classes that U � UserView ( U) : � is entitled to see. � Ordering on user view s: Ordering on user view s: U2 > U2 > u U1 u U1 � U2 is finer than U1 (sees provenance in more detail) DILS’06 July, 22nd 12
User View s User View s W hat are User views? � W hat � � Level of detail detail the user wishes to track Perm issions given to the user � Perm issions � Ability of the user to see / know the sub-steps � Ability � (distributed computation) � Similar to checkpoints checkpoints in logs W hy use User Views? � W hy � Throw aw ay unimportant intermediate results � Throw aw ay � Reduce the amount of work to be redone � Reduce � � Storage efficiency efficiency DILS’06 July, 22nd 13
Recommend
More recommend