big data big problem
play

Big Data, Big Problem Data-intensive systems are highly complex - PowerPoint PPT Presentation

Towards Big Data Provenance Daniel Deutch Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv University cartoon by T. Gregorius Big Data, Big Problem Data-intensive systems are highly


  1. Towards Big Data Provenance Daniel Deutch Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv University cartoon by T. Gregorius

  2. Big Data, Big Problem • Data-intensive systems are highly complex – Manipulate large-scale data in intricate ways – Machine Learning, Data Mining systems are often black box, even black magic? • Error-prone – Errors in input (measurements, crowd, text) – Errors in processing (ambiguities, imperfect text understanding, “bugs”) • Inherent precision vs. recall tradeoff • Difficult to justify and explain (even correct) results • Essentially no one can know if the system has used only legitimate data • There’s a danger of these systems (being perceived as) getting out of control

  3. Example: Leakage in Data Mining • “One concrete example we've seen occurred in a prostrate cancer dataset. Hidden among hundreds of variables in the training data was a variable named PROSSURG . It turned out this represented whether the patient had received prostate surgery, an incredibly predictive but out-of-scope value.” (https://www.kaggle.com/wiki/Leakage) • “ An account number feature used for predicting whether a potential customer would open an account at a bank.” • “ An interviewer name feature, in a cellular company churn prediction problem. – […] It turns out that a specific salesperson was assigned to take over cases where customers had already notified they intend to churn .” “Leakage in Data Mining: Formulation, Detection, and Avoidance”, Kaufman,Rosset, Perlich, ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 15

  4. Provenance to the rescue • Tracking where data came from, how it was extracted how it was manipulated • Provenance leads to better applications and reliable data • Goal: seamless provenance tracking through tools – Allow application owners to easily integrate provenance solutions – With reasonable overhead in time and storage • Fundamental Challenge: Big data, even bigger provenance • Many modeling, algorithmic and implementation challenges

  5. What can we do? (Examples) • Develop provenance models for expressive languages Bouhris, D., Moskovitch, Analyzing Data-Centric Applications: Why, What-if, and How-to, ICDE ‘ 16 (to appear) D., Moskovitch, Tannen, A Provenance Framework for Data-Dependent Process Analysis, VLDB ‘ 14 D., Roy, Milo, Tannen, Provenance Circuits for Datalog, ICDT ‘ 14 • Store only relevant parts of the provenance D., Gilad, Moskovitch, Selective Provenance for Datalog Programs Using Top-K Queries, VLDB ‘ 15 • Summarize the provenance Ainy, Bourhis, Davidson, D., Milo, Approximated Summarization of Data Provenance , CIKM ‘ 15 • Express it in Natural Language D., Frost, Gilad, NLProv: Natural Language Provenance, Submitted

  6. Thank you for listening cartoon by T. Gregorius

Recommend


More recommend