Towards Big Data Provenance Daniel Deutch Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv University cartoon by T. Gregorius
Big Data, Big Problem • Data-intensive systems are highly complex – Manipulate large-scale data in intricate ways – Machine Learning, Data Mining systems are often black box, even black magic? • Error-prone – Errors in input (measurements, crowd, text) – Errors in processing (ambiguities, imperfect text understanding, “bugs”) • Inherent precision vs. recall tradeoff • Difficult to justify and explain (even correct) results • Essentially no one can know if the system has used only legitimate data • There’s a danger of these systems (being perceived as) getting out of control
Example: Leakage in Data Mining • “One concrete example we've seen occurred in a prostrate cancer dataset. Hidden among hundreds of variables in the training data was a variable named PROSSURG . It turned out this represented whether the patient had received prostate surgery, an incredibly predictive but out-of-scope value.” (https://www.kaggle.com/wiki/Leakage) • “ An account number feature used for predicting whether a potential customer would open an account at a bank.” • “ An interviewer name feature, in a cellular company churn prediction problem. – […] It turns out that a specific salesperson was assigned to take over cases where customers had already notified they intend to churn .” “Leakage in Data Mining: Formulation, Detection, and Avoidance”, Kaufman,Rosset, Perlich, ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 15
Provenance to the rescue • Tracking where data came from, how it was extracted how it was manipulated • Provenance leads to better applications and reliable data • Goal: seamless provenance tracking through tools – Allow application owners to easily integrate provenance solutions – With reasonable overhead in time and storage • Fundamental Challenge: Big data, even bigger provenance • Many modeling, algorithmic and implementation challenges
What can we do? (Examples) • Develop provenance models for expressive languages Bouhris, D., Moskovitch, Analyzing Data-Centric Applications: Why, What-if, and How-to, ICDE ‘ 16 (to appear) D., Moskovitch, Tannen, A Provenance Framework for Data-Dependent Process Analysis, VLDB ‘ 14 D., Roy, Milo, Tannen, Provenance Circuits for Datalog, ICDT ‘ 14 • Store only relevant parts of the provenance D., Gilad, Moskovitch, Selective Provenance for Datalog Programs Using Top-K Queries, VLDB ‘ 15 • Summarize the provenance Ainy, Bourhis, Davidson, D., Milo, Approximated Summarization of Data Provenance , CIKM ‘ 15 • Express it in Natural Language D., Frost, Gilad, NLProv: Natural Language Provenance, Submitted
Thank you for listening cartoon by T. Gregorius
Recommend
More recommend