tracing lineage beyond relational operators
play

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu - PowerPoint PPT Presentation

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil Prabhakar 1 1 Computer Science 2 Bindley Bioscience Center Purdue University Introduction Lineage (Data Provenance) is defined as description of


  1. Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil Prabhakar 1 1 Computer Science 2 Bindley Bioscience Center Purdue University

  2. Introduction � Lineage (Data Provenance) is defined as description of the origin of the data and the process by which the data is derived. � Lineage is critical for determining data quality and reliability � (e.g. biological data, data cleansing) mandated by law (e.g. audit trails for FDA) � essential for data dissemination and reproduction � Informative (e.g. querying lineage) � � Database support for tracing lineage is urgent Mingwu Zhang et al.

  3. Lineage tracing Output Input Program X File 1 File 2 File 2 File 4 Coarse-grained e.g Workflow level lineage Output Input Program X File 1 File 2 T1,1,2 R1, 14 T2,3,4 R2, 12 Fine-grained Mingwu Zhang et al.

  4. Tracing fine-grained lineage Cui et al Relational Operators State-of-the-art is limited to three special cases ✓ Internal Woodruff & stonebraker Reversible Mathematical ✓ Functions A.P.Marathe Input Output T1,1,2 R1,12 T2,3,4 Array R2, 34 Manipulation ✓ Language (AML) Currently no known techniques for general fine-grained lineage tracing ✗ ??? External Mingwu Zhang et al.

  5. Contributions � Enable fine-grained external lineage tracing for any arbitrary program without requiring Domain expertise � Understanding the semantics of the operation � Source code � � Computed lineage is accurate (no false positives) � Lineage is derived directly from program execution. Mingwu Zhang et al.

  6. Outline � Introduction � Lineage tracing � Case study � Conclusion Mingwu Zhang et al.

  7. Our approach � Automatically trace lineage using only binary executables. � Monitor the data flow during program execution. � As the program is executed, the lineage of each variable is traced. � Each binary instruction is modified to keep track of the data/control dependencies generated by the instruction. Mingwu Zhang et al.

  8. Tracing Lineage A statement S data depends on another � statement t if and only if a variable is defined at t and then used at S . A statement S control depends on a � predicate statement t if and only if the execution of S is the result of the branch outcome of t Definition : Given a program execution, the � data lineage of variable v at an execution point of S i , denoted as DL(v@S i ) , is the set of input items that are directly or indirectly involved in the computation of the value v at S i Mingwu Zhang et al.

  9. Tracing example 1: y=3; 2: a1=3; = Υ Υ DL ( x @ 5 ) ( DL ( a 1 @ 5 ) DL ( b 1 @ 5 )) DL ( 4 ) 3: b1=4; = Υ Υ DL ( a 1 @ 2 ) DL ( b 1 @ 3 ) DL ( y @ 4 ) 4: if(y>2) = Υ Υ DL ( a 1 @ 2 ) DL ( b 1 @ 3 ) DL ( y @ 1 ) 5: x = a1 + b1; � At Statement 5, x depends upon a1,b1 (data dependency) and y (control dependence) � Thus, the lineage of x is the union of the lineages of a1, b1 and y Mingwu Zhang et al.

  10. Deriving lineage Let s i : dest = ? t j : f ( use 0 , use 1 , …, use n ) be the executed statement instance s i , which assigns a value to variable dest by using variables use 0 , use 1 , …, use n and s i control depends on t j . Let DEF(x) be the latest statement instance that defines x. � DL ( use x @ s i )) � DL ( t j ) DL ( dest @ s i ) = ( ∀ x � = DL ( t j ) � ( DL ( use x @ DEF ( use x ) ) ∀ x . DEF ( use x ) ≠ φ � � ( { use x } )) �� ∀ x . DEF ( use x ) = φ Mingwu Zhang et al.

  11. Instrumenting the code � We use an open-source instrumentation kernel called Valgrind � The lineage is typically set data and is stored using a structure called roBDD which is optimized for set operations. � Shadow memory (SM) stores the lineage sets associated with variables in stack/heap and shadow register file (SRF) stores lineage sets for variables in registers Mingwu Zhang et al.

  12. Instrumenting the code � Example instrumentation A=(int*) malloc(100) → � SM(A) = malloc_in_shadow(100) add (0x0884dc0), eax → � mov SM[0x0884dc0] U SRF(eax), SRF(eax) Mingwu Zhang et al.

  13. Architecture � We have developed a prototype system based on Valgrind engine to trace fine- grained lineage code Instrumenter x86 binary instrumented Valgrind Kernel code bdd input event roBDD Runtime bdd output lineage Lineage Repository Mingwu Zhang et al.

  14. Outline � Introduction � Lineage tracing � Case study � Conclusion Mingwu Zhang et al.

  15. Case study: Cancer biomarker discovery Cancer Label1 Isotope Digestion Labeling 1:1 mix Label2 Normal Isotope Labeling Digestion LC-MS Doublet Doublet intensity intensity De-Isotope m/z m/z Mingwu Zhang et al.

  16. De-isotoping Seq1: ATLNELVEYVSTNR 120 +1 100 80 +2 intensity 60 +3 40 20 0 Figure to show challenge of de-isotoping 400 600 800 1000 1200 1400 1600 m/z Mingwu Zhang et al.

  17. De-isotoping 120 120 +1 +1 100 100 +2 80 80 intensity intensity 60 60 +3 40 40 20 20 0 0 400 600 800 1000 1200 1400 1600 1606 1607 1608 1609 1610 1611 1612 1613 1614 m/z m/z 70 45 +2 +3 40 60 35 50 30 intensity intenisty 40 25 20 30 15 20 10 10 5 0 0 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 535 535.5 536 536.5 537 537.5 538 538.5 539 539.5 540 m/z m/z Mingwu Zhang et al.

  18. De-isotoping Seq1: ATLNELVEYVSTNR Seq2: ITCAELR 70 120 +2 +1 60 100 50 80 intenisty intensity 40 60 30 40 20 20 10 0 0 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z m/z 120 120 100 100 80 80 intensity intensity 60 60 40 40 20 20 0 0 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z m/z Mingwu Zhang et al.

  19. De-isotope algorithm � Complex, mostly heuristics � Numerous parameters picked by experts � Validity of results can be affected by choice of parameters � Identifying a reverse function is impossible, even for experts. � Using state-of-the-art algorithm. Mingwu Zhang et al.

  20. De-isotope result Pep 1 (H) 2500 2500 ( ν ) 2000 2000 intensity intensity 1500 1500 Pep 2 (H) Pep 1 (L) δ Pep 2 (L) ( ο ) ( � ) ( ξ ) 1000 ε 1000 ζ ι α η 500 500 λ θ κ γ β 0 0 912 912 914 914 916 916 918 918 920 920 922 922 924 924 m/z m/z Mingwu Zhang et al.

  21. Fine-grained lineage 2500 Pep 1 (H) 2500 ( ν ) 2000 2000 1500 1500 intensity intensity δ Pep 2 (H) Pep 1 (L) Pep 2 (L) ( ο ) ( � ) ( ξ ) 1000 ε 1000 ζ ι α η 500 λ 500 θ κ β γ 0 0 912 914 916 918 920 922 924 912 914 916 918 920 922 924 m/z m/z Mingwu Zhang et al.

  22. Case study � External fine-grained lineage is crucial for our biomarker discovery application � Our technique enabled experts to � detect errors � assess the reliability of data � identify false positives � identify program limitations Mingwu Zhang et al.

  23. Detect error 450 140 140 υ +4 400 ν ν 120 120 � � 350 δ δ ξ ξ τ +4 100 100 300 ε ε γ γ intensity intensity intensity 250 80 80 ο ο λ λ ζ ζ φ +4 200 σ +4 60 60 β β 150 π π η η 40 40 κ κ 100 ρ ρ θ θ ι ι α α 20 ς ς 20 50 0 0 0 1450 1450 1451 1451 1452 1452 1453 1453 1454 1454 1455 1455 1456 1456 1457 1457 1450 1451 1452 1453 1454 1455 1456 1457 m/z m/z m/z 450 σ ’ +3 140 ν ’ 400 120 � ’ 350 δ ’ ξ ’ υ ’ +3 100 300 ε ’ γ ’ intensity intensity 250 80 ο ’ λ ’ ζ ’ τ ’ +3 200 60 150 φ ’ +3 β ’ π ’ η ’ 40 θ ’ ι ’ κ ’ 100 ρ ’ α ’ 20 ς ’ 50 0 0 1450 1451 1452 1453 1454 1455 1456 1457 1450 1451 1452 1453 1454 1455 1456 1457 m/z m/z Mingwu Zhang et al.

  24. Identifying false positives 3500 3500 κ 3000 3000 θ 2500 2500 intensity intensity 2000 δ 2000 α 1500 1500 ε β 1000 1000 γ ι ζ 500 500 λ η 0 0 969 971 973 975 977 969 971 973 975 977 m/z m/z Mingwu Zhang et al.

  25. Program limitations 100 100 100 100 100 100 ω +1 90 90 90 90 90 90 80 80 80 80 80 80 70 70 70 70 70 70 φ +1 intensity intensity intensity intensity intensity intensity 60 60 60 60 60 60 ο ο ο ο ο 50 50 50 50 50 50 χ +1 ι ι ι ι ι ψ +1 40 40 40 40 40 40 υ +2 η θ η θ η θ η θ η θ ρ ρ ρ ρ ρ 30 30 30 30 30 30 λ λ λ λ λ 20 20 20 20 20 20 α β γ α β γ α β γ α β γ α β γ κ κ κ κ κ ν ν ν ν ν σ σ σ σ σ π π π π π 10 10 10 10 10 δ ε ζ δ ε ζ δ ε ζ δ ε ζ δ ε ζ � � � � � 10 ξ ξ ξ ξ ξ ς ς ς ς ς τ τ τ τ τ 0 0 0 0 0 0 1051 1053 1055 1057 1059 1061 1051 1051 1051 1051 1051 1053 1053 1053 1053 1053 1055 1055 1055 1055 1055 1057 1057 1057 1057 1057 1059 1059 1059 1059 1059 1061 1061 1061 1061 1061 m/z m/z m/z m/z m/z m/z 100 100 χ +2 ω +1 ω +1 90 90 80 80 70 70 φ +1 intensity intensity 60 60 50 50 χ +1 ψ +1 ψ +1 40 υ +1 40 υ +2 30 30 � +1 20 20 10 10 0 0 1051 1053 1055 1057 1059 1061 1051 1053 1055 1057 1059 1061 m/z m/z Mingwu Zhang et al.

Recommend


More recommend