scalable distributed lineage authentication
play

Scalable Distributed Lineage Authentication Ashish Gehani Scalable - PowerPoint PPT Presentation

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage Authentication p. 1/59 What is data lineage ? Output Operation Input 1 Input n (a) Primitive operation (b) Compound operation tree Scalable


  1. Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage Authentication – p. 1/59

  2. What is data lineage ? Output Operation Input 1 Input n (a) Primitive operation (b) Compound operation tree Scalable Distributed Lineage Authentication – p. 2/59

  3. Why track lineage? GIS - Data origins Material science - Component pedigree Biology - Experiment reproducibility Grid - Debugging Scalable Distributed Lineage Authentication – p. 3/59

  4. Why certify lineage? Reproduction costly PDB - $200,000 / protein Fermilab Collision Detector - 1 month, multiple TB / datum Reliability Accreditation Ownership Auditability Scalable Distributed Lineage Authentication – p. 4/59

  5. What’s been done? LFS - Inputs, Outputs, Options → SQL PASS - Runtime environs → Berkeley DB Trio - Tracks data accuracy using lineage CMCS - Chemistry toolkit → WebDAV Chimera - Workflow scripts my Grid - Biology Grid workflows V esta - Incremental builds ESSW - Earth Science data management Scalable Distributed Lineage Authentication – p. 5/59

  6. What’s the problem? Single trust domain Chimera , my Grid , V esta , ESSW Centralized service LFS , PASS , Trio , CMCS No assurance Unsigned Incomplete Scalable Distributed Lineage Authentication – p. 6/59

  7. What granularity? What to audit? Processes, System calls, File system? Fine grain → High overhead Coarse grain → False positives File system: Pro - Intermediate complexity Pro - Captures most persistent change Con - Can’t track data from: Network, Keyboard, Pipes, Memory maps Scalable Distributed Lineage Authentication – p. 7/59

  8. Certification approach ? Input = Output No global TCB Require commitments Consumer Check agreement of: Input Output Producer output Producer Consumer input Trusted user in subtree / path → Tampering detectable Scalable Distributed Lineage Authentication – p. 8/59

  9. Metadata generation Maintain process table entries for: accessed, modified files File 2 Read open() close() File 3 File 1 Read close() Process open() Owner Process execution Time close() File 1 File 2 open() File 3 Write Scalable Distributed Lineage Authentication – p. 9/59

  10. Minimal representation Executor Signature Output Input Input n 1 Net Address Inode Time Executor: 32 bit IPv4 address, 32 bit user ID Signature: 160 bits [ S IGN K E ( E, O, I 1 , . . . , I n ) ] Input / Output File: 32 bit IPv4 address 32 bit inode 32 bit time (Seconds since 1/1/70) Scalable Distributed Lineage Authentication – p. 10/59

  11. Workload Berkeley NOW file system traces Month of activity Access patterns stable Instruction - 20 workstations in teaching lab Research - 13 desktops of research group Web - 1 web server running Postgres Windows - 8 Windows desktops Scalable Distributed Lineage Authentication – p. 11/59

  12. Cumulative lineage Current paradigm Entire tree migrates with data Metadata grows rapidly: Steps 1 2 3 4 5 Workload Instruction 0.4 KB 3 KB 31 KB 253 KB 2 MB Research 0.2 KB 0.8 KB 2 KB 8 KB 29 KB Web 1 KB 39 KB 1 MB 29 MB 813 MB Windows 0.2 KB 0.8 KB 2 KB 9 KB 30 KB Scalable Distributed Lineage Authentication – p. 12/59

  13. Operational impact Time (in ms ) to read tree in open() : Steps 1 2 3 4 Workload Instruction 0.04 0.05 0.11 1.72 Research 0.05 0.05 0.04 0.04 Web 0.06 0.13 6.42 997.5 Windows 0.07 0.04 0.04 0.04 Time (in ms ) to write tree in close() : Steps 1 2 3 4 Workload Instruction 0.20 0.28 0.32 0.84 Research 0.16 0.19 2.39 3.1 Web 0.16 0.24 4.82 579.14 Windows 0.16 0.50 5.34 3.17 Scalable Distributed Lineage Authentication – p. 13/59

  14. In actu Larger representation Unless certification available for: DHCP bindings inode mappings Clock synchronization Scalable Distributed Lineage Authentication – p. 14/59

  15. Decentralized lineage Proposed paradigm Remote pointers replace branches Metadata remains small: Workload Storage Instruction 0.4 KB Research 0.2 KB Web 1 KB Windows 0.2 KB Scalable Distributed Lineage Authentication – p. 15/59

  16. Verifying lineage Algorithm : C HECK L INEAGE ( D ) { E, S, O, I 1 , . . . , I n } ← G ET R OOT ( D ) O UTPUT ( E ) P E ← P KI L OOKUP ( E ) if I 1 , . . . , I n = {}  Result ← V ERIFY ( P E , S, E, O )    then if Result = F ALSE   then CheckFailed   Result ← V ERIFY ( P E , S, E, O | I 1 | . . . | I n )     if Result = T RUE       else for i ← 1 to n  then   do C HECK L INEAGE ( I i ) ← − Reliability drops        else CheckFailed  Scalable Distributed Lineage Authentication – p. 16/59

  17. Increasing availability Traditional strategy: Form virtual topology Flood neighbors Inefficient use of storage Scalable Distributed Lineage Authentication – p. 17/59

  18. Bonsai Prune lineage tree Stored locally Pruned − must be recovered λ from remote node levels Pruned Scalable Distributed Lineage Authentication – p. 18/59

  19. Simplest pruning Trade verification reliability for storage Scalable Distributed Lineage Authentication – p. 19/59

  20. Scaling to the Grid Scalable Distributed Lineage Authentication – p. 20/59

  21. Grid computation Compute nodes partially trusted NCBI JGI TIGR PDB Swiss−Prot User’s administrative domain External trusted database New Data Query Grid nodes − trust but verify BLAST PFAM BLOCKS THMM GADU Server Pegasus Planner Comparative Analysis Database Globus Node Globus Node Globus Node 300 Nodes Scalable Distributed Lineage Authentication – p. 21/59

  22. New problems Long running jobs Simple pruning insufficient Lineage generated in many trust domains Cryptographic key retrieval slow Trusted timestamps requires data upload Grid data sets too large Scalable Distributed Lineage Authentication – p. 22/59

  23. Grid properties Nodes have a stake in cooperating Nontrivial quality of service Low churn rates Common software subset Small number of domains Scalable Distributed Lineage Authentication – p. 23/59

  24. Exploratory strategies Leverage transitive intra-domain trust Long paths in single domain Perform greedy verification Few malicious nodes Embed forward-secure temporal witnesses Common stake Scalable Distributed Lineage Authentication – p. 24/59

  25. Managing Lineage Metadata Scalable Distributed Lineage Authentication – p. 25/59

  26. "Grid" semantics Not middleware-specific Distributed system Large files Multiple administrative domains Range of data sources Loose collaboration Non-interactive workloads Scalable Distributed Lineage Authentication – p. 26/59

  27. Motivation Low latency lineage queries Enables use for: Quality GALE Dynamic toolchain selection Safety / Reliability Check tool dependencies / versions Trust Limit sources Scalable Distributed Lineage Authentication – p. 27/59

  28. Auditing Baseline Userspace filesystem (FUSE) Hooks in Linux kernels > 2.6.14 Intercede on: open(), close(), read(), write() Exploring LibAudit, BSM Inter-process: fork(), exec(), clone() Network: connect(), accept() Scalable Distributed Lineage Authentication – p. 28/59

  29. Process Table +-----------+--------------+ | Field | Type | +-----------+--------------+ | LPID | int(11) | | Host | varchar(256) | | IP | char(16) | | Time | datetime | | PID | int(11) | | PID_Name | varchar(256) | | PPID | int(11) | | PPID_Name | varchar(256) | | UID | int(11) | | UID_Name | char(32) | | GID | int(11) | | GID_Name | char(32) | | CmdLine | varchar(256) | | Environ | text | +-----------+--------------+ Scalable Distributed Lineage Authentication – p. 29/59

  30. File Table +------------+--------------+ | Field | Type | +------------+--------------+ | LFID | int(11) | | FileName | varchar(256) | | Time | datetime | | NewTime | datetime | | RdWt | int(11) | | LPID | int(11) | | Hash | varchar(256) | | Signature | varchar(256) | +------------+--------------+ Scalable Distributed Lineage Authentication – p. 30/59

  31. Auxiliary Files Application not lineage-aware Synchronization error Black node "forgets" Gray subtree "lost" Scalable Distributed Lineage Authentication – p. 31/59

  32. Initial Approaches Distributed transfer decouples metadata Auxiliary files Data structure in filesystem Local database Large files File server Headers / footers Application- / format-specific engineering In-band encoding Scalable Distributed Lineage Authentication – p. 32/59

  33. Overloaded Namespace Metadata in local database In overloaded namespace "Appear" as header Change open(), read(), close() sequence Transparently append lineage Change open(), write(), close() sequence Transparently extract lineage seek() semantics not supported Limited to ftp, scp, GridFTP , Web browsers Scalable Distributed Lineage Authentication – p. 33/59

  34. SQL Response Time Scalable Distributed Lineage Authentication – p. 34/59

  35. HQL Response Time Scalable Distributed Lineage Authentication – p. 35/59

  36. Record Replication Exploit redundancy between lineage Merge forest into single graph Weight vertices with metadata size Weight edges with tree (DAG) count Scalable Distributed Lineage Authentication – p. 36/59

Recommend


More recommend