Building a Big Data DWH Data Warehousing on Hadoop Friso van - PowerPoint PPT Presentation

Building a Big Data DWH Data Warehousing on Hadoop Friso van Vollenhoven @fzk CTO frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP

“In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.” -- Wikipedia

How to: • Add a column to the facts table? • Change the granularity of dates from day to hour? • Add a dimension based on some aggregation of facts?

Schema’s are designed with questions in mind. Changing it requires to redo the ETL.

Schema’s are designed Push things to the facts with questions in mind. level. Changing it requires to Keep all source data redo the ETL. available all times.

And now? • MPP databases? • Faster / better / more SAN? • (RAC?)

metadata + query engine distributed processing distributed storage

EXTRACT TRANSFORM LOAD

• No JVM startup overhead for Hadoop API usage • Relatively concise syntax (Python) • Mix Python standard library with any Java libs

• Flexible scheduling with dependencies • Saves output • E-mails on errors • Scales to multiple nodes • REST API • Status monitor • Integrates with version control

Deployment git push jenkins master

• Scheduling • Simple deployment of ETL code • Scalable • Developer friendly

'februari-22 2013'

A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.

TSV == thorn separated values?

þ == 0xFE

or -2, in Hive CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

• The format will change • Faulty deliveries will occur • Your parser will break • Records will be mistakingly produced (over-logging) • Other people test in production too (and you get the data from it) • Etc., etc.

• Simple deployment of ETL code • Scheduling • Scalable • Independent jobs • Fixable data store • Incremental where possible • Metrics

Independent jobs source (external) HDFS upload + move in place staging (HDFS) MapReduce + HDFS move hive-staging (HDFS) Hive map external table + SELECT INTO Hive

Out of order jobs • At any point, you don’t really know what ‘made it’ to Hive • Will happen anyway, because some days the data delivery is going to be three hours late • Or you get half in the morning and the other half later in the day • It really depends on what you do with the data • This is where metrics + fixable data store help...

Fixable data store • Using Hive partitions • Jobs that move data from staging create partitions • When new data / insight about the data arrives, drop the partition and re-insert • Be careful to reset any metrics in this case • Basically: instead of trying to make everything transactional, repair afterwards • Use metrics to determine whether data is fit for purpose

Metrics

Metrics service • Job ran, so may units processed, took so much time • e.g. 10GB imported, took 1 hr • e.g. 60M records transformed, took 10 minutes • Dropped partition • Inserted X records into partition

Go DataDriven We’re hiring / Questions? / Thank you! Friso van Vollenhoven @fzk CTO frisovanvollenhoven@godatadriven.com

Building a Big Data DWH Data Warehousing on Hadoop Friso van - PowerPoint PPT Presentation

Building a Big Data DWH Data Warehousing on Hadoop Friso van Vollenhoven @fzk CTO frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP In computing, a data warehouse or enterprise data warehouse (DW, DWH, or

LSU Science in Support of DWH Oil Spill Minerals Management Services June 3, 2010 Immediate

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Trag agical ically, ly, oil l spil ills ls happe ppen Unique Challenges of DWH Response

Oracle 10 g revolutioniert 10 g Business Intelligence & Warehouse Marcus Bender Strategisch

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Building a Big Data Chapel Chris Taylor DoD Overview Big Data? Chapel on Mesos

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Data Management Hints & Tips By Clark Lawson, Nationwide Building Society @thesasgeek Agenda

$$ + Difficult to High software cost incurred Use during Exploration phase when capital is

Make HTAP Real with TiFlash A TiDB native Columnar Extension About me Liu Cong,

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

DATA-WAREHOUSING MODEL A Do-It-Yourself Guide Alex Kolker, Quad Cities Campaign for Grade-Level

Drupal as a Data Warehouse Everybody Into the Data Lake! Gail Radecki, CHCP, American Academy of

IATA Webcast A Production-ready Solution to forecast and price under complex market conditions

Sambuz

Useful Links

Newsletter

Mail Us

Building a Big Data DWH Data Warehousing on Hadoop Friso van - PowerPoint PPT Presentation

Building a Big Data DWH Data Warehousing on Hadoop Friso van Vollenhoven @fzk CTO frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP In computing, a data warehouse or enterprise data warehouse (DW, DWH, or

LSU Science in Support of DWH Oil Spill Minerals Management Services June 3, 2010 Immediate

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Trag agical ically, ly, oil l spil ills ls happe ppen Unique Challenges of DWH Response

Oracle 10 g revolutioniert 10 g Business Intelligence &amp; Warehouse Marcus Bender Strategisch

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Building a Big Data Chapel Chris Taylor DoD Overview Big Data? Chapel on Mesos

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Data Management Hints &amp; Tips By Clark Lawson, Nationwide Building Society @thesasgeek Agenda

$$ + Difficult to High software cost incurred Use during Exploration phase when capital is

Make HTAP Real with TiFlash A TiDB native Columnar Extension About me Liu Cong,

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

DATA-WAREHOUSING MODEL A Do-It-Yourself Guide Alex Kolker, Quad Cities Campaign for Grade-Level

Drupal as a Data Warehouse Everybody Into the Data Lake! Gail Radecki, CHCP, American Academy of

IATA Webcast A Production-ready Solution to forecast and price under complex market conditions

Sambuz

Useful Links

Newsletter

Mail Us

Oracle 10 g revolutioniert 10 g Business Intelligence & Warehouse Marcus Bender Strategisch

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Data Management Hints & Tips By Clark Lawson, Nationwide Building Society @thesasgeek Agenda