Automated Data Curation at Scale Bernhard Bicher (CEO) Dr. Noah S. - PowerPoint PPT Presentation

Automated ¡Data ¡Curation ¡at ¡Scale Bernhard ¡Bicher ¡(CEO)   Dr. ¡Noah ¡S. ¡Bieler ¡(Principal ¡Data ¡Scientist) Winterthur, ¡12 th ¡of ¡June ¡2015

Data ¡Preparation ¡Today Data ¡Scientists ¡spend ¡up ¡to ¡80% ¡of ¡their ¡time ¡preparing ¡data. ¡ Data ¡Preparation ¡is ¡no ¡self-‑service ¡activity ¡without ¡IT ¡involvement. ¡ ¡ Semi-‑automatic ¡integration ¡of ¡more ¡than ¡25 ¡data ¡sources ¡is ¡unfeasible. ¡ Data ¡origins ¡and ¡lineage ¡are ¡frequently ¡lost ¡during ¡processing. ¡ 2

Three ¡Options ✔ ✖ ✖ Rule-‑based Probabilistic Manual Use ¡statistics, ¡NLP, ¡ML Hire ¡work ¡force ETL Choosing ¡and ¡combining ¡ Unreliable ¡ High ¡Maintenance   the ¡right ¡algorithms ¡ Not ¡sustainable ¡ Completeness ¡ Only ¡approximate ¡results Expensive Needs ¡expensive ¡IT ¡guy ETL ¡ ¡ ¡ ¡ ¡Extract ¡Transform ¡Load ¡ NLP ¡ ¡ ¡ ¡Natural ¡Language ¡Processing ¡ ML ¡ ¡ ¡ ¡ ¡ ¡Machine ¡Learning 3

The ¡Art ¡of ¡Data ¡Integration Low Identify ¡ ¡Sources ¡ Profile ¡Data Clean ¡Data Normalise ¡Data Automation ¡using ¡ Probabilistic ¡ Identify ¡Joins Approaches ¡ Entity ¡Resolution Deduplication Post-‑Processing Low Medium High Very ¡High Integrated ¡Data Automation ¡Potential 4

Probabilistic ¡Methods ¡and ¡Approaches Identify ¡ ¡Sources ¡ Profile ¡Data Outlier ¡Detection, ¡Authoritative ¡Data, ¡Type ¡Detection Clean ¡Data Encoding ¡Errors ¡Fixing, ¡Pattern ¡Mining, ¡Column ¡Swap Normalise ¡Data Probability ¡Distribution, ¡Entropy ¡Measurement Identify ¡Joins Naive ¡vs. ¡Advanced ¡ML ¡Approaches Entity ¡Resolution Deduplication Computational ¡Complexity ¡Reduction ¡ Post-‑Processing 5

Profile ¡Data Example: ¡Probabilistic ¡Schema ¡Detection First ¡ Last ¡ Premium City Country Name Name Identify ¡   Missing ¡Values Hans Müller TRUE Winterthur N/A Hans Mueller 1 Winterthur CH Content ¡Detection ¡ ¡ Jan Muster FALSE Windisch CH using ¡Decision ¡Trees ¡ ? Outlier ¡Detection ¡ ¡ Profiling ¡based ¡ based ¡on ¡Histograms on ¡Authoritative ¡ String Data Formatted ¡ ¡ Mostly ¡Characters Numbers Last ¡Name Müller Dates Phone ¡ ¡ All ¡Capital Mundt Mixed Numbers Muster … TRUE FALSE 1 0 6

Clean, ¡Normalise ¡and ¡Impute ¡Data First ¡Name Last ¡Name Premium City Country Max Morgenthal TRUE Winterthur Pattern ¡Mining Hans M ⧫ ller TRUE Winterthur CH city == “Winterthur” implies Country = “CH” Hans Mueller 1 CH Winterthur Jan Muster FALSE Windisch CHE Fix ¡Encoding ¡Errors ¡ Column ¡Swap M ⧫ ller ¡ → ¡Müller Normalisation ¡according ¡ ¡ to ¡a ¡Synonym ¡Table ISO2 ISO3 Name …. CH CHE Schweiz DE DEU Deutschland FR FRA Frankreich 7

Identify ¡Join ¡Columns Comparison ¡of ¡Probability ¡Distribution Datasilo ¡1 Datasilo ¡2 FirstName ClientID Premium …. CID ProductName ProductID …. Martin 1028934-‑1 TRUE C-‑9471991 Monitor ¡LCD 6413 Sara 7462946-‑5 TRUE C-‑7462946 Mouse ¡Laser 5433 Keyboard ¡ Anna 9471991-‑3 FALSE C-‑1028934 961 QWERTY µ 1 µ 1’ µ 2 similiar 8

Entity ¡Resolution ¡& ¡Deduplication Naive ¡Approach First ¡ Last ¡ Premium City Country All ¡weights ¡w i ¡are ¡the ¡same. ¡ Name Name w i ¡= ¡{0.2, ¡0.2, ¡0.2, ¡0.2, ¡0.2 ¡} Hans Müller TRUE Winterthur Hans Mueller 1 Winterthur CH X s = w i s i Jan Muster FALSE Windisch CH i Advanced ¡Approach First ¡ Last ¡ De-‑Noising ¡and ¡normalisation ¡ Premium City Country Name Name helps ¡to ¡compare ¡entities. ¡ Hans Müller TRUE Winterthur CH Hans Müller TRUE Winterthur CH User ¡feedback ¡is ¡incorporated ¡ into ¡the ¡estimate ¡of ¡the ¡ Jan Muster FALSE Windisch CH weights ¡{ w i } ¡using ¡ML. Adapt ¡the ¡weights ¡w i ¡using ¡ML ¡and ¡ optimise ¡similarity ¡calculations. ¡   w i ¡= ¡{0.3, ¡0.3, ¡0.1, ¡0.2, ¡0.1 ¡} Cleaned ¡data 9

Example: ¡Deduplication ¡of ¡1M ¡records 100 ML ¡& ¡pre-‑estimated ¡Weights 90 Time ¡Savings Pure ¡ML ¡Approach ¡ Accuracy 3 ¡iterations ¡with ¡1’000 ¡ 80 manual ¡feedback ¡à ¡30 ¡sec ¡ required ¡to ¡achieve ¡same ¡ 70 accuracy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Effort: ¡3.1 ¡days 60 0 1 2 3 4 5 Training ¡Iterations Better ¡out-‑of-‑the-‑box ¡precision ¡using ¡ML ¡and ¡pre-‑estimated ¡weights. ¡ Start ¡by ¡initialising ¡weights ¡according ¡to ¡the ¡column ¡content. ¡ For ¡some ¡cases, ¡this ¡can ¡even ¡eliminate ¡the ¡need ¡for ¡training ¡at ¡all. 10

Tackling ¡Complexity ¡in ¡Deduplication Clustering n ¡= ¡10 6 n 2 ¡—> ¡10 12 0.5n 2 ¡ ¡—> ¡0.5 ¡ . ¡ 10 12 k ¡ . ¡n ¡ . ¡ ¡ m ¡ + ¡0.5 ¡ . ¡k(n/k) 2 ¡   k ¡= ¡10 2 —> ¡10 10 m ¡= ¡50 n ¡ ¡ ¡ ¡Number ¡of ¡data ¡records k ¡ ¡ ¡ ¡Number ¡of ¡clusters m ¡ ¡ ¡ ¡Number ¡of ¡iterations Better ¡scalability ¡leads ¡to ¡faster ¡execution. ¡ Higher ¡data ¡locality, ¡a ¡“triangle” ¡can ¡run ¡on ¡a ¡single ¡node. 11

State-‑of-‑the-‑Art ¡Infrastructure Map-‑Reduce ¡style ¡using ¡Apache Map Reduce Scalable: ¡runs ¡on ¡a ¡single ¡Laptop ¡as ¡well ¡as ¡on ¡a ¡10k-‑node ¡Cluster. ¡ Programmed ¡in ¡Scala: ¡functional ¡and ¡object-‑oriented. ¡ Supports ¡streaming, ¡and ¡provides ¡MLlib ¡and ¡GraphX ¡for ¡machine ¡ learning ¡and ¡graph ¡algorithms. 12

Summary Probabilistic ¡methods ¡save ¡precious ¡time ¡ 1 Decide ¡on ¡trade-‑off ¡between ¡fast ¡data ¡integration ¡and ¡precision ¡ Leverage ¡machine ¡learning ¡ 2 Use ¡business ¡expert ¡feedback ¡to ¡improve ¡system ¡precision ¡and ¡degree ¡ of ¡automation. Broad ¡data ¡analysis ¡ 3 Mine ¡over ¡100 ¡instead ¡of ¡just ¡25 ¡data ¡sources. 13

Wealthport ¡AG ¡ Rütistrasse ¡16 ¡ CH-‑8952 ¡Schlieren ¡ +41 ¡76 ¡420 ¡67 ¡68 ¡ info@wealthport.ch ¡ www.wealthport.ch ¡ Twitter: ¡@wealthport Join ¡us ¡at ¡www.meetup.com/spark-‑zurich! Empowering ¡organisations ¡to ¡unlock ¡their ¡wealth ¡of ¡data

Automated Data Curation at Scale Bernhard Bicher (CEO) Dr. Noah S. - PowerPoint PPT Presentation

Automated Data Curation at Scale Bernhard Bicher (CEO) Dr. Noah S. Bieler (Principal Data Scientist) Winterthur, 12 th of June 2015 Data Preparation Today Data Scientists spend up to 80% of

INTEROPen FHIR Curation Work Dr. Munish Jokhani FHIR Curation Clinical Engagement Lead, NHS

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

Tools and Resources for Data Curation Stephen Abrams Perry Willett UC Curation Center /

Digital Curation at the National Space Science Data Center DigCCurr2007: Digital Curation In

The curation curation of laboratory experimental of laboratory experimental The data as part of

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

Content Curation What do I do with all this information? KRISTY BURROUGH ELEARNING MANAGER

Curation of computational biology models Curation of computational biology models Anand

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

Introduction to the Curation Costs Exchange (CCEx) 1 Collaboration to Clarify the Costs of

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Evaluation of text data mining for Evaluation of text data mining for database curation: lessons

Transitions & Thresholds Data Transfer & Bridging Infrastructure in Data Curation Ingrid

SABIO-RK Integration and Curation of Reaction Kinetics Data http://sabio.villa-bosch.de/SABIORK

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

ISOLATION ATTACKS GRAD SEC OCT 03 2017 TODAYS PAPERS ROWHAMMER ROWHAMMER ROWHAMMER

Efficient Locally Trackable from seed Deduplication in Replicated Systems Joo Barreto and

Block-level Inline Data Deduplication in ext3 Dedupfs Performance Summary Conclusions Aaron

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer

Bias in Learning to Rank Caused by Redundant Web Documents Bachelors Thesis Defence Jan

Oracle's official position is Oracle began btrfs development years before the Sun acquisition

OhioLINK Strategic Directions 2015 2018 Stewardship Cooperatively and cost-effectively

closing loan operations training Disclaimer This Disclaimer applies to all content provided

Automated Data Curation at Scale Bernhard Bicher (CEO) Dr. Noah S. - PowerPoint PPT Presentation

Automated Data Curation at Scale Bernhard Bicher (CEO) Dr. Noah S. Bieler (Principal Data Scientist) Winterthur, 12 th of June 2015 Data Preparation Today Data Scientists spend up to 80% of

INTEROPen FHIR Curation Work Dr. Munish Jokhani FHIR Curation Clinical Engagement Lead, NHS

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

Tools and Resources for Data Curation Stephen Abrams Perry Willett UC Curation Center /

Digital Curation at the National Space Science Data Center DigCCurr2007: Digital Curation In

The curation curation of laboratory experimental of laboratory experimental The data as part of

The Digital Curation Centre Michael Day Digital Curation Centre UKOLN, University of Bath

Content Curation What do I do with all this information? KRISTY BURROUGH ELEARNING MANAGER

Curation of computational biology models Curation of computational biology models Anand

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

Introduction to the Curation Costs Exchange (CCEx) 1 Collaboration to Clarify the Costs of

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Evaluation of text data mining for Evaluation of text data mining for database curation: lessons

Transitions &amp; Thresholds Data Transfer &amp; Bridging Infrastructure in Data Curation Ingrid

SABIO-RK Integration and Curation of Reaction Kinetics Data http://sabio.villa-bosch.de/SABIORK

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

ISOLATION ATTACKS GRAD SEC OCT 03 2017 TODAYS PAPERS ROWHAMMER ROWHAMMER ROWHAMMER

Efficient Locally Trackable from seed Deduplication in Replicated Systems Joo Barreto and

Block-level Inline Data Deduplication in ext3 Dedupfs Performance Summary Conclusions Aaron

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer

Bias in Learning to Rank Caused by Redundant Web Documents Bachelors Thesis Defence Jan

Oracle's official position is Oracle began btrfs development years before the Sun acquisition

OhioLINK Strategic Directions 2015 2018 Stewardship Cooperatively and cost-effectively

closing loan operations training Disclaimer This Disclaimer applies to all content provided

Transitions & Thresholds Data Transfer & Bridging Infrastructure in Data Curation Ingrid