Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - PowerPoint PPT Presentation

Oct 22, 2022 •45 likes •455 views

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP 'februari-22 2013' A: Yes, sometimes as often as 1 in every 10K calls. Or about once

Dirty Data It’s a mess. It’s your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP
'februari-22 2013'
A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.
þ
þ
TSV == thorn separated values?
þ == 0xFE
or -2, in Hive CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
• The format will change • Faulty deliveries will occur • Your parser will break • Records will be mistakingly produced (over-logging) • Other people test in production too (and you get the data from it) • Etc., etc.
• Simple deployment of ETL code • Scheduling • Scalable • Independent jobs • Fixable data store • Incremental where possible • Metrics
EXTRACT TRANSFORM LOAD
• No JVM startup overhead for Hadoop API usage • Relatively concise syntax (Python) • Mix Python standard library with any Java libs
• Flexible scheduling with dependencies • Saves output • E-mails on errors • Scales to multiple nodes • REST API • Status monitor • Integrates with version control
Deployment git push jenkins master
Independent jobs source (external) HDFS upload + move in place staging (HDFS) MapReduce + HDFS move hive-staging (HDFS) Hive map external table + SELECT INTO Hive
Out of order jobs • At any point, you don’t really know what ‘made it’ to Hive • Will happen anyway, because some days the data delivery is going to be three hours late • Or you get half in the morning and the other half later in the day • It really depends on what you do with the data • This is where metrics + fixable data store help...
Fixable data store • Using Hive partitions • Jobs that move data from staging create partitions • When new data / insight about the data arrives, drop the partition and re-insert • Be careful to reset any metrics in this case • Basically: instead of trying to make everything transactional, repair afterwards • Use metrics to determine whether data is fit for purpose
Metrics
Metrics service • Job ran, so may units processed, took so much time • e.g. 10GB imported, took 1 hr • e.g. 60M records transformed, took 10 minutes • Dropped partition • Inserted X records into partition
Go DataDriven We’re hiring / Questions? / Thank you! Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com

Recommend

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204 papers since 2012 in VLDB, ICDE, SIGMOD ( dirty data ) The Database Perspective Dirty data is a violation of constraints on a table. Data

961 views • 57 slides

Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang 1 Data is Dirty 2

Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang 1 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data inconsistent 3+ trillion $: US economy 20%: labor

1.09k views • 71 slides

Storing Data Review Data collection is an important issue Dirty data Multiple

Storing Data Review Data collection is an important issue Dirty data Multiple sources Data collection Data consolidation Data cleaning Data integration CMPT 354: Database I -- Storing Data 2 What Is the Next

273 views • 23 slides

Dirty COW Race Condition Attack Outline Dirty COW vulnerability Memory Mapping using

Dirty COW Race Condition Attack Outline Dirty COW vulnerability Memory Mapping using mmap() Map_shared, Map_Private Mapping Read-Only Files How to exploit? Dirty COW vulnerability Interesting case of the race

667 views • 24 slides

Dirty Jo Dirty Jobs a s at UMD: t UMD: Wa Waste Audit Coordinator Ma# Silverman What is a

Dirty Jo Dirty Jobs a s at UMD: t UMD: Wa Waste Audit Coordinator Ma# Silverman What is a waste audit? Overall goals: Assessment of waste produced Analysis of waste disposal habits Determine effec<veness of waste management

491 views • 10 slides

Cleaning Dirty Data With Just A Handful of SAS Functions Ben Cochran The Bedford Group

Cleaning Dirty Data With Just A Handful of SAS Functions Ben Cochran The Bedford Group bencochran@nc.rr.com A Silver Member of the SAS Alliance Contents 1. Leading Zero Blaster ( INDEXC, SUBSTR) 2. LENGTH 3. TRANSPOSE 4. ATTRN, MODTE

331 views • 32 slides

Quick & Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the

Quick & Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the language Interpreted high level language Reasonably simple to learn Rich set of libraries For details, see texts in syllabus or

774 views • 23 slides

Dirty COW Attack Instructor: Fengwei Zhang 1 SUSTech CS 315 Computer Security Outline

Dirty COW Attack Instructor: Fengwei Zhang 1 SUSTech CS 315 Computer Security Outline Dirty COW vulnerability Memory Mapping using mmap() Map_shared, Map_Private Mapping Read-Only Files How to exploit? 2 Dirty COW

466 views • 24 slides

Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech Data Cleaning How dirty is real data? How dirty is real data? Examples Jan 19, 2016 January 19, 16 1/19/16

309 views • 15 slides

Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech Data Cleaning How dirty is real data? How dirty is real data? Examples Jan 19, 2016 January 19, 16 1/19/16 2006-01-19

763 views • 19 slides

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris Geerts & Bart Goethals

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris Geerts & Bart Goethals 2/22 Ill talk about . . . What dirty data is What forbidden itemsets are and how to mine them How to repair dirty data using nearest

454 views • 22 slides

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Data Cleaning? 2 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data

1.07k views • 106 slides

Quick & Dirty (&Right) Ted Neward Neward & Associates

Quick & Dirty (&Right) Ted Neward Neward & Associates http://www.newardassociates.com | ted@tedneward.com Quick & Dirty Everybody agrees... it's bad! Quotes Everybody agrees... it's bad! "If 10 years from now, when you

743 views • 42 slides

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available Only a few people know how to deal with it Youre now one of them Applications The project is a start Keep your hands dirty Consider using the

176 views • 13 slides

Credit: Toronto Zoo CS109A, P ROTOPAPAS , R ADER , T ANNER 1 Lecture #3: Getting our hands dirty:

Credit: Toronto Zoo CS109A, P ROTOPAPAS , R ADER , T ANNER 1 Lecture #3: Getting our hands dirty: pandas and web scraping CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader, and Chris Tanner 2 ANNOUNCEMENTS Standard Sections :

771 views • 54 slides

MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data

Garrett Bingham & Ben Yip Summary and Cleaning June 16, 2017 University of North Carolina Wilmington MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data 4. New Datasets 5. Dirty Data 6.

655 views • 26 slides

Caching In Depth 1 Today Quiz Design choices in cache architecture 2 Basic Cache

Caching In Depth 1 Today Quiz Design choices in cache architecture 2 Basic Cache Organization Some number of cache lines each with Dirty bit -- does this data dirty valid Tag Data match what is in memory Valid -- does

700 views • 35 slides

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied Mathematics, National Sun Yat-sen University May 21, 2019 Practice of Applied Mathematics, Kaohsiung, Taiwan Etiquette and dirty tricks in L A T EX

1.08k views • 74 slides

Dirty Fuel: An Analysis of Official and Unofficial Petroleum Products in the Niger Delta

Dirty Fuel: An Analysis of Official and Unofficial Petroleum Products in the Niger Delta Supported by In collaboration with Starting 14:00 BST/WAT Dirty Fuel in the Niger Delta Agenda Introduction Emissions Modelling Project partners, How

791 views • 24 slides

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of Edinburgh Data in the real world is dirty. It is: Data preparation is a big issue for data mining. Cabena et al (1998) extimate that data

410 views • 8 slides

Dirty Sock Syndrome Dirty Sock Syndrome Why it happens! Why it happens! How to resolve it! How

Dirty Sock Syndrome Dirty Sock Syndrome Why it happens! Why it happens! How to resolve it! How to resolve it! Why it happens! Why it happens! From information gathered over many years, most professionals als From information gathered over

843 views • 12 slides

Dirty Tricks in the Name of Quality Ian Dees Tektronix ian.s.dees@tek.com Hi, Im Ian. Im

Dirty Tricks in the Name of Quality Ian Dees Tektronix ian.s.dees@tek.com Hi, Im Ian. Im here to talk about dirty tricks in software construction.

798 views • 50 slides

Dirty Clouds Done Dirt Cheap Matthew Treinish mtreinish@kortar.org mtreinish on Freenode May

Dirty Clouds Done Dirt Cheap Matthew Treinish mtreinish@kortar.org mtreinish on Freenode May 11, 2017 https://github.com/mtreinish/dirty-clouds-done-dirt-cheap Building a Cloud 1 / 35 Scope of the Project Pretend to be a sysadmin with no

991 views • 36 slides

Dirty Electricity and Electromagnetic Fields BOB JOHNSON GREEN BUILDING LECTURE SERIES

Dirty Electricity and Electromagnetic Fields BOB JOHNSON GREEN BUILDING LECTURE SERIES DECEMBER 5, 2019 Presentation Overview What the heck is Dirty Electricity? Electromagnetic Fields Health effects overview from

408 views • 16 slides

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - PowerPoint PPT Presentation

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP 'februari-22 2013' A: Yes, sometimes as often as 1 in every 10K calls. Or about once

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204

Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang 1 Data is Dirty 2

Storing Data Review Data collection is an important issue Dirty data Multiple

Dirty COW Race Condition Attack Outline Dirty COW vulnerability Memory Mapping using

Dirty Jo Dirty Jobs a s at UMD: t UMD: Wa Waste Audit Coordinator Ma# Silverman What is a

Cleaning Dirty Data With Just A Handful of SAS Functions Ben Cochran The Bedford Group

Quick &amp; Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the

Dirty COW Attack Instructor: Fengwei Zhang 1 SUSTech CS 315 Computer Security Outline

Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris Geerts &amp; Bart Goethals

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Quick &amp; Dirty (&amp;Right) Ted Neward Neward &amp; Associates

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Credit: Toronto Zoo CS109A, P ROTOPAPAS , R ADER , T ANNER 1 Lecture #3: Getting our hands dirty:

MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data

Caching In Depth 1 Today Quiz Design choices in cache architecture 2 Basic Cache

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied

Dirty Fuel: An Analysis of Official and Unofficial Petroleum Products in the Niger Delta

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Dirty Sock Syndrome Dirty Sock Syndrome Why it happens! Why it happens! How to resolve it! How

Dirty Tricks in the Name of Quality Ian Dees Tektronix ian.s.dees@tek.com Hi, Im Ian. Im

Dirty Clouds Done Dirt Cheap Matthew Treinish mtreinish@kortar.org mtreinish on Freenode May

Dirty Electricity and Electromagnetic Fields BOB JOHNSON GREEN BUILDING LECTURE SERIES

Quick & Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris Geerts & Bart Goethals

Quick & Dirty (&Right) Ted Neward Neward & Associates