I ♥ Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps
The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing
Data Integration
Maslow’s Hierarchy Self-Actualization Esteem Love & Belonging Safety Physiological
For Data Automation Understanding Semantics Acquisition/Collection
New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors
New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata
Bad Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Data Social Rec. ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services
Good Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services
The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing
Apache Kafka producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer
A � brief � history � of � Kafka
Three design principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers
Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model
Kafka At LinkedIn • 175 TB of in-flight log data per colo • Low-latency: ~1.5 ms • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration
The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing
Kafka is about logs
What is a log?
Next Record 1st Record Written 0 1 2 3 4 5 6 7 8 9 10 11 12
Partitioning 1 1 1 Partition 0 0 1 2 3 4 5 6 7 8 9 0 1 2 Partition 1 0 1 2 3 4 5 6 7 8 9 1 1 Partition 2 0 1 2 3 4 5 6 7 8 9 0 1
Logs: pub/sub done right Data Source writes 1 1 1 0 1 2 3 4 5 6 7 8 9 Log 0 1 2 reads reads Destination Destination System A System B (time = 7) (time = 11)
Logs And Distributed Systems
Example: � A Fault-tolerant CEO Hash Table
Operations Final State PUT('microsoft', 'bill gates') { PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') 'microsoft': 'satya nadella', PUT('google', 'larry page') PUT('yahoo', 'terry semel') 'apple': 'tim cook', PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') 'google': 'larry page', PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') 'yahoo': 'marissa mayer' PUT('google', 'larry page') PUT('yahoo', 'scott thompson') } PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella') Replica 1 Replica 2
0 PUT(microsoft, bill gates) PUT(apple, steve jobs) 1 PUT(microsoft, steve ballmer) 2 PUT(google, larry page) 3 PUT(yahoo, terry semel) 4 PUT(google, eric schmidt) 5 (offset=10) Replica 1 PUT(yahoo, jerry yang) 6 PUT(yahoo, carol bartz) 7 PUT(apple, tim cook) 8 PUT(google, larry page) 9 (offset=12) Replica 2 10 PUT(yahoo, scott thompson) 11 PUT(yahoo, marissa mayer) 12 PUT(microsoft, satya nadella)
Two System Design Styles State-machine Primary-backup Replication Reads Writes Requests Master Slave Slave The Log state changes Peer Peer Peer The Log
The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing
Example: User views job Jobs Frontend Job Views Kafka Job Views Job Poster Rec. Hadoop Security Monitoring Analytics Engine
It’s all one big distributed system Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services
Comparing Data Transfer Mechanisms
The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing
Stream Processing
Stream Processing = Logs + Jobs Log A Log B Log C Job 1 Job 2 Log D Log E Job 3 Log F
Stream processing is a � generalization � of batch processing
Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL
Systems Can Help
Samza Architecture Job Job Job Job Samza Kafka YARN
Log-centric Architecture Graph DB, Key-Value Search Query OLAP Store, Query Layer Layer Etc Monitoring Stream & Proces Log Graphs sing Hadoop
� � � � Kafka � http://kafka.apache.org � Samza � http://samza.incubator.apache.org � Log Blog � http://linkd.in/199iMwY � Me � http://www.linkedin.com/in/jaykreps � @jaykreps �
Recommend
More recommend