A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia Kwasnikowska, Jan van den Bussche, & Stijn Vansummeren TaPP 2010
Provenance in ... • Databases • Workflows • • Mainly for (nested) Many different systems relational model • Many different models • Where-provenance • ("source location") (converging on OPM?) • • Lineage, why ("witnesses") Graphs/DAGs • • How/semiring model Relatively informal • Relatively formal
Provenance in ... • Databases • Workflows • • Mainly for (nested) Many different systems relational model • Many different models • Where-provenance ????? • ("source location") (converging on OPM?) • • Lineage, why ("witnesses") Graphs/DAGs • • How/semiring model Relatively informal • Relatively formal
This talk • Relate database & workflow "styles" • Develop a common graph formalism • Need a common, expressive language that • supports many database queries • describes some (simple) workflows
Previous work • Dataflow calculus (DFL), based on nested relational calculus (NRC) • Provenance "run" model by Kwasnikowska & Van den Bussche (DILS 07, IPAW 08) • "Provenance trace" model for NRC • by (Acar, Ahmed & C. '08) • Open Provenance Model (bipartite graphs) • (Moreau et al. 2008-9), used in many WF systems
NRC/DFL background • A very simple, functional language: • basic functions +, *,... & constants 0,1,2,3... • variables x,y,z • pair/record types (A:e,...,B:e), π A (e) • collection (set) types • {e,...} e ∪ e {e | x in e'} ∪ e
An example
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)} sum { x * y | (x,y,z) in R, x < y}
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)} sum { x * y | (x,y,z) in R, x < y} = sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)} sum { x * y | (x,y,z) in R, x < y} = sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5}
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)} sum { x * y | (x,y,z) in R, x < y} = sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20}
An example • Suppose R = {(1,2,3), (4,5,6), (9,8,7)} sum { x * y | (x,y,z) in R, x < y} = sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20} = 22
Another example • In DFL, built-in functions / constants can be whole programs & files, • as in Provenance Challenge 1 workflow: let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in softmean(Reslices)
Goal: Define "provenance graphs" for DFL
Goal: Define "provenance graphs" for DFL let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices)
Goal: Define "provenance graphs" for DFL let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices) http://www.flickr.com/photos/schneertz/679692806/
First step: values or v c copy v v v elem A 1 or or {} ... <> ... elem A n v v
Example value 1 A <> elem B {} 2 A elem <> B 3
Next step: evaluation nodes ("process") Constants, 1 e primitive c f ... functions e n Variables & e x let x temporary head e body bindings
Pairing A 1 e Record building <> ... e A n Field lookup π A e
Conditionals test test e e if if e e then else Note: Only taken branch is recorded
Sets: basic operations Empty set ∅ Singleton {} e 1 Union e ∪ e 2
Sets: complex operations Flattening e ∪ e head e for x Iteration body ... e body
Provenance graphs • are graphs with "both value and evaluation structure" ./01 " # % +,- * &'( % ! & &'( % # # " # % ! ) $ $ 2/34 ) ' $%&" ./01! 6%4" ! # , ( $%&" ' $%&" ( + (5 6%4" ' # '- ./01" " $%&" ( 2/34 *
A bigger example 0 #$% &'() 0 0 " &'() $-. 1 &'() / ;<=$8 %8$% 8=$8 1 *+, 2+3 8:(%) 2+3 4# 1 98<. &'() 2+3 =8%+! >'.) &'() >'.) &'() &'() 1 ! 98<. =8%+@ 98<. 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 >'.) %8$% &'() / 5678 *+, 2+3 >'.) 0 98<. &'() #'6+) 2+3 #$% 4# %98- @ >'.) 0 &'() " &'() 2+3 2+3 1 1 &'() A 0 $-. ) &'() #$% &'()
Value structure 0 #$% &'() 0 0 " &'() $-. 1 &'() / ;<=$8 %8$% 8=$8 1 *+, 2+3 8:(%) 2+3 4# 1 98<. &'() 2+3 =8%+! >'.) &'() >'.) &'() &'() 1 ! 98<. =8%+@ 98<. 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 >'.) %8$% &'() / 5678 *+, 2+3 >'.) 0 98<. &'() #'6+) 2+3 #$% 4# %98- @ >'.) 0 &'() " &'() 2+3 2+3 1 1 &'() A 0 $-. ) &'() #$% &'()
Value structure 1 C 0 #$% &'() 0 C 0 " &'() $-. 1 F C &'() / ;<=$8 %8$% <> 8=$8 2 {} {} 1 *+, 2+3 8:(%) 2+3 4# 1 C 98<. &'() {} 2+3 =8%+! C >'.) &'() C >'.) C &'() &'() 1 ! 98<. 1 =8%+@ C {} 98<. {} <> {} T 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 C >'.) C %8$% &'() / 5678 *+, 2+3 >'.) {} 0 98<. &'() #'6+) 2+3 1 #$% 4# %98- C @ >'.) 0 &'() C {} " &'() 2+3 2+3 2 1 C 1 &'() A C 0 $-. ) &'() C #$% &'()
Input values 1 C 0 #$% &'() 0 C 0 " &'() $-. 1 F C &'() / ;<=$8 %8$% <> 8=$8 2 {} {} 1 *+, 2+3 8:(%) 2+3 4# 1 C 98<. &'() {} 2+3 =8%+! C >'.) &'() C >'.) C &'() &'() 1 ! 98<. 1 =8%+@ C {} 98<. {} <> {} T 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 C >'.) C %8$% &'() / 5678 *+, 2+3 >'.) {} 0 98<. &'() #'6+) 2+3 1 #$% 4# %98- C @ >'.) 0 &'() C {} " &'() 2+3 2+3 2 1 C 1 &'() A C 0 $-. ) &'() C #$% &'()
Return value 1 C 0 #$% &'() 0 C 0 " &'() $-. 1 F C &'() / ;<=$8 %8$% <> 8=$8 2 {} {} 1 *+, 2+3 8:(%) 2+3 4# 1 C 98<. &'() {} 2+3 =8%+! C >'.) &'() C >'.) C &'() &'() 1 ! 98<. 1 =8%+@ C {} 98<. {} <> {} T 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 C >'.) C %8$% &'() / 5678 *+, 2+3 >'.) {} 0 98<. &'() #'6+) 2+3 1 #$% 4# %98- C @ >'.) 0 &'() C {} " &'() 2+3 2+3 2 1 C 1 &'() A C 0 $-. ) &'() C #$% &'()
Expression structure 0 #$% &'() 0 0 " &'() $-. 1 &'() / ;<=$8 %8$% 8=$8 1 *+, 2+3 8:(%) 2+3 4# 1 98<. &'() 2+3 =8%+! >'.) &'() >'.) &'() &'() 1 ! 98<. =8%+@ 98<. 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 >'.) %8$% &'() / 5678 *+, 2+3 >'.) 0 98<. &'() #'6+) 2+3 #$% 4# %98- @ >'.) 0 &'() " &'() 2+3 2+3 1 1 &'() A 0 $-. ) &'() #$% &'()
Expression structure fst 0 #$% &'() 0 x 0 " &'() $-. snd 1 = &'() / ;<=$8 %8$% 8=$8 empty 1 *+, 2+3 8:(%) 2+3 4# 1 if 98<. let R &'() 2+3 =8%+! >'.) &'() >'.) &'() let S snd &'() 1 ! R 98<. =8%+@ fst for x U 98<. = 1 &'() $-. 0 #'6+" 2+3 ? 2+3 0 >'.) %8$% &'() / 5678 *+, 2+3 >'.) s for y 0 98<. &'() #'6+) 2+3 #$% 4# if %98- @ >'.) 0 &'() {} x " &'() 2+3 2+3 1 + snd 1 &'() A 0 $-. y ) &'() fst #$% &'()
Building provenance graphs • is complicated • Here we'll use high-level "graph rewrite rule" formalism • Mostly because it is nicer to look at than formal version
c c c v 1 v 1 1 1 f f f(v 1 ,...,v n ) ... ... n n v n v n v v head head let x let x copy e e x x body copy body
v 1 v 1 A 1 A 1 A 1 <> ... <> <> ... A n v n A n A n v n v 1 v A 1 A 1 ... ... ... π Ai v i <> <> π Ai v i copy ... ... A n A n v n v
True test True test e 1 if then if copy else e 1 then e 2 False test False test e 1 if then if copy else e 2 else e 2
v elem v elem empty? {} empty? False {} ... ... elem elem v v empty? empty? True {} {}
∅ ∅ ∅ elem {} {} {} v v v elem v elem elem {} ... elem ... {} ... v elem ∪ v {} ∪ v elem v elem {} ... elem ... {} ... v elem elem v
OK, take a deep breath!
Recommend
More recommend