I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - PowerPoint PPT Presentation

I ♥ Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Data Integration

Maslow’s Hierarchy Self-Actualization Esteem Love & Belonging Safety Physiological

For Data Automation Understanding Semantics Acquisition/Collection

New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors

New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata

Bad Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Data Social Rec. ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

Good Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

Apache Kafka producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer

A � brief � history � of � Kafka

Three design principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers

Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model

Kafka At LinkedIn • 175 TB of in-flight log data per colo • Low-latency: ~1.5 ms • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration

Kafka is about logs

What is a log?

Next Record 1st Record Written 0 1 2 3 4 5 6 7 8 9 10 11 12

Partitioning 1 1 1 Partition 0 0 1 2 3 4 5 6 7 8 9 0 1 2 Partition 1 0 1 2 3 4 5 6 7 8 9 1 1 Partition 2 0 1 2 3 4 5 6 7 8 9 0 1

Logs: pub/sub done right Data Source writes 1 1 1 0 1 2 3 4 5 6 7 8 9 Log 0 1 2 reads reads Destination Destination System A System B (time = 7) (time = 11)

Logs And Distributed Systems

Example: � A Fault-tolerant CEO Hash Table

Operations Final State PUT('microsoft', 'bill gates') { PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') 'microsoft': 'satya nadella', PUT('google', 'larry page') PUT('yahoo', 'terry semel') 'apple': 'tim cook', PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') 'google': 'larry page', PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') 'yahoo': 'marissa mayer' PUT('google', 'larry page') PUT('yahoo', 'scott thompson') } PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella') Replica 1 Replica 2

0 PUT(microsoft, bill gates) PUT(apple, steve jobs) 1 PUT(microsoft, steve ballmer) 2 PUT(google, larry page) 3 PUT(yahoo, terry semel) 4 PUT(google, eric schmidt) 5 (offset=10) Replica 1 PUT(yahoo, jerry yang) 6 PUT(yahoo, carol bartz) 7 PUT(apple, tim cook) 8 PUT(google, larry page) 9 (offset=12) Replica 2 10 PUT(yahoo, scott thompson) 11 PUT(yahoo, marissa mayer) 12 PUT(microsoft, satya nadella)

Two System Design Styles State-machine Primary-backup Replication Reads Writes Requests Master Slave Slave The Log state changes Peer Peer Peer The Log

Example: User views job Jobs Frontend Job Views Kafka Job Views Job Poster Rec. Hadoop Security Monitoring Analytics Engine

It’s all one big distributed system Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

Comparing Data Transfer Mechanisms

Stream Processing

Stream Processing = Logs + Jobs Log A Log B Log C Job 1 Job 2 Log D Log E Job 3 Log F

Stream processing is a � generalization � of batch processing

Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL

Systems Can Help

Samza Architecture Job Job Job Job Samza Kafka YARN

Log-centric Architecture Graph DB, Key-Value Search Query OLAP Store, Query Layer Layer Etc Monitoring Stream & Proces Log Graphs sing Hadoop

� � � � Kafka � http://kafka.apache.org � Samza � http://samza.incubator.apache.org � Log Blog � http://linkd.in/199iMwY � Me � http://www.linkedin.com/in/jaykreps � @jaykreps �

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - PowerPoint PPT Presentation

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing Data Integration

Logs on Logs on Logs No More Append Atomic & Remap Eric Mackay Venkatesh Srinivas Basics

Why are UI Logs Important? UI logs will help you identify Trends and Patterns that need to be

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Malicious behavior detection based on CyberArk PAS logs through string matching and genetic

by Mining Console Logs Wei Xu* Ling Huang Armando Fox* David Patterson* Michael Jordan*

DTTF/NB479: Dszquphsbqiz Day 24 Announcements: Term project groups and topics due midnight 1.

Mining Invariants from Logs for System Problem Detection Jian-Guang LOU, Qiang FU Software

Memory Management vanilladb.org Outline Overview Buffering User Data Caching Logs

It Can Understand the Logs, Literally Aidi Pi , Wei Chen, Will Zeller and Xiaobo Zhou IPDPSW19

Log all the things! Honza Krl @honzakral Logs? Events! Log lines Twitter feed Invoices

Harvesting Logs and Events Using MetaCentrum Virtualization Services Radoslav Bod, Daniel

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs

Geotechnical Desktop Study Borings >100 feet deep Water Well Logs Regional Geology Fault

Integrating Core Data and Image Logs: The Critical Steps in Modelling a Fractured Carbonate

Recognition of Capillary Seals in Hydrocarbon Accumulations Using SP Logs Stephen P. Cumella,

Altern rnat ative ve Care re Work rksho hop Papua New Guinea The Three Logs Story

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

G o i n g b e y o n d L o c a l D e n s i t y a n d G r a d i e n

New developments in the quantum ESPRESSO software distribution for quantum simulations at the

On Brewing Fresh Espresso: LinkedIns Distributed Data Serving Platform Thomas Marshall

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - PowerPoint PPT Presentation

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing Data Integration

Logs on Logs on Logs No More Append Atomic &amp; Remap Eric Mackay Venkatesh Srinivas Basics

Why are UI Logs Important? UI logs will help you identify Trends and Patterns that need to be

Exponential and Logarithm Natural Logs and e Exponential Growth and Decay Functions Slide 3 /

Malicious behavior detection based on CyberArk PAS logs through string matching and genetic

by Mining Console Logs Wei Xu* Ling Huang Armando Fox* David Patterson* Michael Jordan*

DTTF/NB479: Dszquphsbqiz Day 24 Announcements: Term project groups and topics due midnight 1.

Mining Invariants from Logs for System Problem Detection Jian-Guang LOU, Qiang FU Software

Memory Management vanilladb.org Outline Overview Buffering User Data Caching Logs

It Can Understand the Logs, Literally Aidi Pi , Wei Chen, Will Zeller and Xiaobo Zhou IPDPSW19

Log all the things! Honza Krl @honzakral Logs? Events! Log lines Twitter feed Invoices

Harvesting Logs and Events Using MetaCentrum Virtualization Services Radoslav Bod, Daniel

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs

Geotechnical Desktop Study Borings &gt;100 feet deep Water Well Logs Regional Geology Fault

Integrating Core Data and Image Logs: The Critical Steps in Modelling a Fractured Carbonate

Recognition of Capillary Seals in Hydrocarbon Accumulations Using SP Logs Stephen P. Cumella,

Altern rnat ative ve Care re Work rksho hop Papua New Guinea The Three Logs Story

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

G o i n g b e y o n d L o c a l D e n s i t y a n d G r a d i e n

New developments in the quantum ESPRESSO software distribution for quantum simulations at the

On Brewing Fresh Espresso: LinkedIns Distributed Data Serving Platform Thomas Marshall

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

Logs on Logs on Logs No More Append Atomic & Remap Eric Mackay Venkatesh Srinivas Basics

Geotechnical Desktop Study Borings >100 feet deep Water Well Logs Regional Geology Fault