Data Engineering and Streaming Analytics Welcome and Housekeeping - PowerPoint PPT Presentation

Data Engineering and Streaming Analytics

Welcome and Housekeeping ● You should have received instructions on how to participate in the training session ● If you have questions, you can use the Q&A window in Go To Webinar ● The recording of the session will be made available afuer the event 2

About Your Instructor Doug Bateman is Director of Training and Education at Databricks. Prior to this role he was Director of Training at NewCircle. 3

Apache Spark - Genesis and Open Source Spark was originally created at the AMP Lab at Berkeley. The original creators went on to found Databricks. Spark was created to address bringing data and machine learning together Spark was donated to the Apache Foundation to create the Apache Spark open source project 4

Accelerate innovation by unifying data science, VISION engineering and business Unified Analytics Platform SOLUTION • Original creators of WHO WE • 2000+ global companies use our platform across big ARE data & machine learning lifecycle

Apache Spark: The 1st Unified Analytics Engine Uniquely combined Data & AI technologies Runtime Delta Spark Core Engine Big Data Processing Machine Learning ETL + SQL +Streaming MLlib + SparkR

Introducing Delta Lake A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark API’s

Apache Spark - A Unified Analytics Engine 8

Apache Spark “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing” ● Research project at UC Berkeley in 2009 ● APIs: Scala, Java, Python, R, and SQL ● Built by more than 1,200 developers from more than 200 companies 9

HOW TO PROCESS LOTS OF DATA?

M&Ms 11

Spark Cluster One Driver and many Executor JVMs 12

Data Lakes - A Key Enabler of Analytics Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing

Data Lake Challenges > 65% big data Unreliable Low Quality Data projects fail per Slow Performance X Gartner Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing

1. Data Reliability Challenges Failed production jobs leave data in corrupt ✗ state requiring tedious recovery Lack of schema enforcement creates inconsistent and low quality data Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming

2. Performance Challenges Too many small or very big files - more time opening & closing files rather than reading contents (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down if you picked the wrong fields or when data has many dimensions, high cardinality columns No caching - cloud storage throughput is low (S3 is 20-50MB/s/core vs 300MB/s/core for local SSDs)

Databricks Delta Next-generation engine built on top of Spark Databricks Delta Co-designed compute & storage ● Compatible with Spark API’s ● Built on open standards (Parquet) ● Transactional Indexes & Versioned Log Stats Parquet Files Leverages your cloud blob storage

Delta Makes Data Reliable Updates/Deletes Delta Table Reliable data always ready Streaming for analytics Transactional Versioned Log Parquet Files Batch ACID Transactions Upserts ● ● Key Features Schema Enforcement Data Versioning ● ●

Delta Makes Data More Performant Delta Engine I/O & Query Open Spark Optimizations API’s Fast, highly responsive queries at scale Delta Table Transactional Versioned Log Parquet Files Compaction Data skipping ● ● Key Features Caching Z-ordering ● ●

Get Started with Delta using Spark APIs Instead of parquet ... … simply say delta CREATE TABLE ... CREATE TABLE ... USING delta USING parquet … ... dataframe dataframe .write .write .format(" delta ") .format(" parquet ") .save("/data") .save("/data")

Using Delta with your Existing Parquet Tables Step 1: Convert Parquet to Delta Tables CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS] [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)] Step 2: Optimize Layout for Fast Queries OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)

Upsert/Merge: Fine-grained Updates MERGE INTO customers -- Delta table USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN UPDATE SET address = updates.address WHEN NOT MATCHED THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)

Time Travel Reproduce experiments & reports Rollback accidental bad writes SELECT count (*) FROM events INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS TIMESTAMP AS OF timestamp OF date_sub( current_date (), 1) SELECT count (*) FROM events VERSION AS OF version spark.read.format(" delta ").option("timestampAsOf", timestamp_string).load("/events/")

Apple: Threat Detection at Scale with Delta Detect signal across user, application and network logs; Quickly analyze the blast radius with ad hoc queries; Respond quickly in an automated fashion; Scaling across petabytes of data and 100’s of security analysts Databricks Delta Data Science > 100TB new data/day > 300B events/day Streaming Refinement Alerts KEYNOTE TALK Machine Learning BEFORE DELTA WITH DELTA Took 20 engineers; 24 weeks to build Took 2 engineers; 2 weeks to build ● ● Only able to analyze 2 week window of data Analyze 2 years of batch with streaming data ● ●

Spark References ● Databricks ● Apache Spark ML Programming Guide ● Scala API Docs ● Python API Docs ● Spark Key Terms 25

Questions? Further Training Options: http://bit.ly/DBTrng ● Live Onsite Training ● Live Online ● Self Paced Meet one of our Spark experts: http://bit.ly/ContactUsDB 26

Data Engineering and Streaming Analytics Welcome and Housekeeping - PowerPoint PPT Presentation

Data Engineering and Streaming Analytics Welcome and Housekeeping You should have received instructions on how to participate in the training session If you have questions, you can use the Q&A window in Go To Webinar The

Massive Streaming Data Analytics: A Case Study with Clustering Coefficients Davi vid Ediger,

Insights Big Data Analytics Processing on streaming, hot and historical data Rajiv Shah

Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- The Class We will learn:

Multi-Query Optimization in Wide-Area Streaming Analytics Albert Jonathan, Abhishek Chandra, Jon

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture Using an IoT

Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016 About Me Sr. Software Engineer,

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization

IOT, CONNECTED CARS & BIG DATA ANALYTICS Subramaniam Ganesan, School of Engineering and

Hitachi NEXT 2018 IoT Analytics Using Streaming Data Contents Page 2: Introduction to MQTT

Analytics in Real Time: The [Grey's] Anatomy of Event Streaming Adam Ahringer |

Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area Ariel Rabkin

AI and Predictive Analytics in Data-Center Environments Data Science and Engineering Josep Ll.

Fighting Identity Theft Big Data Analytics to the Rescue Seshika Fernando WSO2 Me - Seshika

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1 Overview

Data Analytics Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural

Data Streaming Lukasz Golab lgolab@uwaterloo.ca engineering.uwaterooo.ca/~lgolab Outline

Data Engineering and Streaming Analytics Welcome and Housekeeping - PowerPoint PPT Presentation

Data Engineering and Streaming Analytics Welcome and Housekeeping You should have received instructions on how to participate in the training session If you have questions, you can use the Q&A window in Go To Webinar The

Massive Streaming Data Analytics: A Case Study with Clustering Coefficients Davi vid Ediger,

Insights Big Data Analytics Processing on streaming, hot and historical data Rajiv Shah

Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- The Class We will learn:

Multi-Query Optimization in Wide-Area Streaming Analytics Albert Jonathan, Abhishek Chandra, Jon

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture Using an IoT

Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016 About Me Sr. Software Engineer,

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization

IOT, CONNECTED CARS &amp; BIG DATA ANALYTICS Subramaniam Ganesan, School of Engineering and

Hitachi NEXT 2018 IoT Analytics Using Streaming Data Contents Page 2: Introduction to MQTT

Analytics in Real Time: The [Grey's] Anatomy of Event Streaming Adam Ahringer |

Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area Ariel Rabkin

AI and Predictive Analytics in Data-Center Environments Data Science and Engineering Josep Ll.

Fighting Identity Theft Big Data Analytics to the Rescue Seshika Fernando WSO2 Me - Seshika

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1 Overview

Data Analytics Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut

Track Description Level Session Link ABD Analytics &amp; Big Data 201 Big Data Architectural

Data Streaming Lukasz Golab lgolab@uwaterloo.ca engineering.uwaterooo.ca/~lgolab Outline

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

IOT, CONNECTED CARS & BIG DATA ANALYTICS Subramaniam Ganesan, School of Engineering and

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural