Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH - PowerPoint PPT Presentation

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Gügi, Solution Architect 19.09.2013

AGENDA  Opportunities & Challenges  Integrating Hadoop  Lambda Architecture  Lambda in Practice  Recommendations

ABOUT ME  Solution Architect @ YMC  Founder and organizer Swiss Big Data User Group  http://www.bigdata-usergroup.ch/  Contact  christian.guegi@ymc.ch  http://about.me/cguegi  @chrisgugi

ABOUT YMC  Founded in 2001  Based in Kreuzlingen, Switzerland  Big Data Analytics, Web Solutions and Mobile Applications  24 experts  Consulting, creation, engineering

OPPORTUNITIES &

BIG DATA – WHAT IS THE BIG DEAL? A. New sources and types from inside & outside organisations  “Internet of things”, sensors, RFID, intelligent devices, etc.  Unstructured information – documents, web logs, email, social media, etc.  Trusted 3 rd party sources – industry provider & aggregators, governments “Open Data”, weather, etc. B. Technology innovations to exploit new world of data  Low cost storage and process power (cloud, on-premise & hybrid)  New software patterns to handle speed & volume, structured and unstructured (In-memory computation, Hadoop, Mapreduce, etc.)  Revolution in user experience, analytics, recommendations

BIG DATA – CHALLENGES • Volume • Velocity • Variety • Veracity Overwhelming Character landscape & of data integration Organisational Available issues talent • Align business • Lack of skilled and strategy experienced people • Data Management • Privacy protection

INTEGRATING

TYPICAL RDBMS SZENARIO Apps Web BI Mobile Systems Data DWH RDBMS ETL Sources Data RDBMS NFS Others

BIG DATA SZENARIO Apps BI Web Mobile 1) Recommendations, etc. Systems Data 1) DWH RDBMS Hadoop Sources Data Social RDBMS NFS Logs Sensors Media

HADOOP ECOSYSTEM

LAMBDA

LAMBDA ARCHITECTURE  Credits Nathan Marz  Former Engineer at Twitter  Storm, Cascalog, ElephantDB http://www.manning.com/marz/

DESIGN PRINCIPLES Lambda Architecture  Human fault-tolerance  Data immutability  Re-computation

HUMAN FAULT-TOLERANCE Lambda Architecture  Design for human error  Bugs in code  Accidental data loss  Data corruption  Protect good data, so you can always fix what went wrong

DATA IMMUTABILIY Lambda Architecture  Store data in it’s rawest form  Create and read but no update  No data can be lost  To fix the system just delete bad data  Can always revert to a true state

DATA IMMUTABILIY Lambda Architecture Capturing change traditionally (mutability) Name Location Name Location Alice Zurich Alice Basel Bob Lucerne Bob Lucerne Tom Bern Tom Bern Capturing change (immutability) Name Location Time Name Location Time Alice Zurich 2009/03/29 Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Tom Bern 2010/04/09 Alice Basel 2013/08/20

RE-COMPUTATION Lambda Architecture  Always able to re-compute from historical data  Basis for all data systems  query = function(all data) Pre-computed Query All Data views

LAYERS Lambda Architecture http://www.ymc.ch/en/lambda-architecture-part-1

Lambda in Practice

ONLINE MARKETING  Tracking and analytics solution  Improve customer targeting and segmentation  Various reports  Real-time not required

OVERVIEW HDFS AdServer Web Flume log HDFS Hive Impala Pig HBase Campaign Sqoop Database csv Up- & Aggregated Download fs -put Data DWH csv FTP BI apps Cloudera Oozie ZooKeeper Manager

DATA PIPELINE HDFS AdServer Flume M/R log Avro HDFS Tracking Bulk Importer Campaign Sqoop M/R Database Avro csv Profiles fs -put FTP M/R Avro csv DWH Extracting Transformation Loading

ADVANTAGES  Extensible – easily add speed layer later on  Complements existing DWH/BI system  ETL phases are decoupled  Reliable  Infrastructure  Each step can be replayed  Scalable  Storage  Processing  Highly available  Ad-hoc analysis right from the beginning

RECOMMENDATIONS

RECOMMENDATIONS  Not a fixed, one-size-fits-all approach  Adopt to your needs/requirements  Hadoop complements existing systems  How real-time do I need to be?  Immutability and pre-computation are just good ideas!  Store information in rawest format possible  Use a serialization framework (Avro, Thrift, Protocol Buffers)

THANK YOU!

CONTACT US christian.guegi@ymc.ch Tel. +41 (0)71 508 24 76 www.ymc.ch @chrisgugi YMC AG Photo Credits: Sonnenstrasse 4 Slide 05: Success opportunity achieve by Stephen McCulloch Slide 08: Matrix by Gamaliel Espinoza Macedo. CH-8280 Kreuzlingen Slide 12: Layers by Katelyn Leblanc Slide 20: Mining For Information by JD Hancock Switzerland Slide 27: Warning Question by longzijun

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH - PowerPoint PPT Presentation

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Ggi, Solution Architect 19.09.2013 AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice Recommendations

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

BUILDING REACTIVE PIPELINES WITH KOTLIN & SPRING MARK HECKLER @mkheck Copenhagen Denmark

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Access Pricing and Competition Dr Darryl Biggar Competition Law and Policy Division OECD, Paris

MOBILE APPLICATIONS AND CLOUD COMPUTING Roberto Beraldi Course Outline 6 CFUs Topics:

The Bro Network Security Monitor Robin Sommer International Computer Science Institute, &

CS 3200 Topic Overview Resource Suggestions Read about these topics in your favorite textbook.

Networked games: a QoS-sensitive application for QoS-insensitive users? Tristan Henderson and

PRETTY EASY PRIVACY 05-2014 It is called kinko Overview introduction spot the problem

Objectives List at least 2 reasons why initiating and maintaining Sohailla lifestyle change is

Marcel - Le docker franais Leons apprises en faisant nimporte quoi.

Sambuz

Useful Links

Newsletter

Mail Us

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH - PowerPoint PPT Presentation

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Ggi, Solution Architect 19.09.2013 AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice Recommendations

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

BUILDING REACTIVE PIPELINES WITH KOTLIN &amp; SPRING MARK HECKLER @mkheck Copenhagen Denmark

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Access Pricing and Competition Dr Darryl Biggar Competition Law and Policy Division OECD, Paris

MOBILE APPLICATIONS AND CLOUD COMPUTING Roberto Beraldi Course Outline 6 CFUs Topics:

The Bro Network Security Monitor Robin Sommer International Computer Science Institute, &amp;

CS 3200 Topic Overview Resource Suggestions Read about these topics in your favorite textbook.

Networked games: a QoS-sensitive application for QoS-insensitive users? Tristan Henderson and

PRETTY EASY PRIVACY 05-2014 It is called kinko Overview introduction spot the problem

Objectives List at least 2 reasons why initiating and maintaining Sohailla lifestyle change is

Marcel - Le docker franais Leons apprises en faisant nimporte quoi.

Sambuz

Useful Links

Newsletter

Mail Us

BUILDING REACTIVE PIPELINES WITH KOTLIN & SPRING MARK HECKLER @mkheck Copenhagen Denmark

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

The Bro Network Security Monitor Robin Sommer International Computer Science Institute, &