data provenance and reproducability
play

Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad - PowerPoint PPT Presentation

Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad Outline Reproducability Data Provenance


  1. Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  2. Outline ● Reproducability ● Data Provenance ▪ ▪ Importance Importance ▪ ▪ Crisis Challenges ▪ ▪ In ML Current Standards ▪ ▪ In Companies Different Approaches ▪ ▪ What to do about it? Provenance Taxonomy ▪ Examples ● Future Work 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  3. Reproducability “A measure of whether results can be attained by a different research team, using the same methods.” 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  4. Importance of Reproducability shows that there are no confounding variables ● ○ protects against fraud ○ human error 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  5. Reproducability Crisis A crisis of repeatability: “Of these 100 studies, just 68 reproductions provided [..] results that matched the original findings.” A crisis of description: Of 400 algorithms [..] He found that only 6% [..] shared the algorithm’s code. Only a third shared the data they tested their algorithms on, and just half shared “pseudocode”. 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  6. Reproducability in ML ML swaps heuristics for blackbox for better results ● Randomness between runs (need to fix meta parameters) ● Need to store data ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  7. Reproducability in Companies If a researcher drops out, somebody else should be able to step in ● Cope with changed requirements or platforms ● ○ time saver in the long run It’s a lot more risky to try different variation ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  8. What to do about it? Science: Provide all the info: code, data, description ● Companies: No wrong incentives ● Repetition of experiments ● Keep the team educated ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  9. What to do about it? ML: Features causing non-deterministic results are disabled ● Versioning of models ● (Jupiter Notebooks are a nightmare) ➔ Data versioning ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  10. Data Provenance “Data Provenance is the documentation of data in sufficient detail to allow reproducibility of a specific dataset.” 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  11. Importance Compliance (e.g. DSGVO) ● Necessary if accused of fraud ● Prevents manual errors ● Changes in underlying database ● Data needs to be trustworthy ● Root case analysis ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  12. Challenges Large data sizes ● Provenance overhead ● Archiving (vs changes) ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  13. Archiving (vs changes) Key Added Deleted Modified 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  14. Current Standards “The PROV Family of Documents defines a model , corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web.” https://www.w3.org/TR/prov-overview/ 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  15. Different Approaches Where-Provenance (Original Source), Why-Provenance (Contributing ● Source) and How-Provenance (Transformation) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  16. Provenance Taxonomy 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  17. Use of Provenance ● Data Quality : Level of detail and error ● Audit Trail : Process with which data is produced ● Replication : Availability of similar sources ● Attribution : Pedigree (ownership) ● Informational : Metadata (descriptive) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  18. Subject of Provenance ● Data Oriented (Explicit) Model : Metadata from source data ● Process Oriented (Indirect) : Metadata from process inputs and outputs ● Granularity : Level of detail 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  19. Provenance Representation ● Annotation : Descriptions about source data and processes ● Inversion : Reverse-engineering queries ● Contents : Of annotation and inversion methods ● Syntactic Information : The form in which data is stored ● Semantic Information : The meaning given to the data 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  20. Storing Provenance ● Tightly Coupled : Close relation with data ● Loosely Coupled : Slight relation with data ● Scalability : Growth of system ● Overhead : Management costs 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  21. Provenance Dissemination 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  22. Practical Examples DVC (for small projects) ● ○ Git expansion for data Pachyderm (for bigger projects) ● ○ Runs on Kubernetes More Informations 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  23. Data Versioning Control Data Versioning: ● DVC File 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  24. Data Piplines & Experiments Link processing steps together ● Store versions and prameters ● Compare versions ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  25. Pachyderm Based on Kubernetes and Docker ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  26. Pachyderm Data Versioning Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  27. Pachyderm Containerized Analysis Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  28. Pachyderm Data Pipelines Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  29. Pachyderm Pre1 Net Pre2 Scalable Stages 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  30. Pachyderm Pre1 Net Pre2 Data Provenance 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  31. Google Dataset Search (GOODS) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  32. Google Dataset Search (GOODS) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  33. What is Next? How to rank datasets ● How to identify important datasets ● Handling missing metadata ● More work on data semantics ● Data citation ● Environment information ● Applications in ML, social media, ● block chain, cybersecurity 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  34. Questions? 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  35. Thank you! :) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

Recommend


More recommend