i logs
play

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - PowerPoint PPT Presentation

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing Data Integration


  1. I ♥ Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

  2. The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

  3. Data Integration

  4. Maslow’s Hierarchy Self-Actualization Esteem Love & Belonging Safety Physiological

  5. For Data Automation Understanding Semantics Acquisition/Collection

  6. New Types of Data • Database data – Users, products, orders, etc • Events – Clicks, Impressions, Pageviews, etc • Application metrics – CPU usage, requests/sec • Application logs – Service calls, errors

  7. New Types of Systems • Live Stores – Voldemort – Espresso – Graph – OLAP – Search – InGraphs • Offline – Hadoop – Teradata

  8. Bad Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Data Social Rec. ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

  9. Good Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

  10. The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

  11. Apache Kafka producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer

  12. A � brief � history � of � Kafka

  13. Three design principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers

  14. Characteristics • Scalability of a filesystem – Hundreds of MB/sec/server throughput – Many TB per server • Guarantees of a database – Messages strictly ordered – All data persistent • Distributed by default – Replication – Partitioning model

  15. Kafka At LinkedIn • 175 TB of in-flight log data per colo • Low-latency: ~1.5 ms • Replicated to each datacenter • Tens of thousands of data producers • Thousands of consumers • 7 million messages written/sec • 35 million messages read/sec • Hadoop integration

  16. The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

  17. Kafka is about logs

  18. What is a log?

  19. Next Record 1st Record Written 0 1 2 3 4 5 6 7 8 9 10 11 12

  20. Partitioning 1 1 1 Partition 0 0 1 2 3 4 5 6 7 8 9 0 1 2 Partition 1 0 1 2 3 4 5 6 7 8 9 1 1 Partition 2 0 1 2 3 4 5 6 7 8 9 0 1

  21. Logs: pub/sub done right Data Source writes 1 1 1 0 1 2 3 4 5 6 7 8 9 Log 0 1 2 reads reads Destination Destination System A System B (time = 7) (time = 11)

  22. Logs And Distributed Systems

  23. Example: � A Fault-tolerant CEO Hash Table

  24. Operations Final State PUT('microsoft', 'bill gates') { PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') 'microsoft': 'satya nadella', PUT('google', 'larry page') PUT('yahoo', 'terry semel') 'apple': 'tim cook', PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') 'google': 'larry page', PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') 'yahoo': 'marissa mayer' PUT('google', 'larry page') PUT('yahoo', 'scott thompson') } PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella') Replica 1 Replica 2

  25. 0 PUT(microsoft, bill gates) PUT(apple, steve jobs) 1 PUT(microsoft, steve ballmer) 2 PUT(google, larry page) 3 PUT(yahoo, terry semel) 4 PUT(google, eric schmidt) 5 (offset=10) Replica 1 PUT(yahoo, jerry yang) 6 PUT(yahoo, carol bartz) 7 PUT(apple, tim cook) 8 PUT(google, larry page) 9 (offset=12) Replica 2 10 PUT(yahoo, scott thompson) 11 PUT(yahoo, marissa mayer) 12 PUT(microsoft, satya nadella)

  26. Two System Design Styles State-machine Primary-backup Replication Reads Writes Requests Master Slave Slave The Log state changes Peer Peer Peer The Log

  27. The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

  28. Example: User views job Jobs Frontend Job Views Kafka Job Views Job Poster Rec. Hadoop Security Monitoring Analytics Engine

  29. It’s all one big distributed system Espresso Operational Operational Voldemort Oracle User Tracking Espresso Voldemort Oracle Logs Metrics Espresso Voldemort Oracle Log Log Data Social Rec ... Hadoop Monitoring Search Security Email Search Warehouse Graph Engine Production Services

  30. Comparing Data Transfer Mechanisms

  31. The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

  32. Stream Processing

  33. Stream Processing = Logs + Jobs Log A Log B Log C Job 1 Job 2 Log D Log E Job 3 Log F

  34. Stream processing is a � generalization � of batch processing

  35. Examples • Monitoring • Security • Content processing • Recommendations • Newsfeed • ETL

  36. Systems Can Help

  37. Samza Architecture Job Job Job Job Samza Kafka YARN

  38. Log-centric Architecture Graph DB, Key-Value Search Query OLAP Store, Query Layer Layer Etc Monitoring Stream & Proces Log Graphs sing Hadoop

  39. � � � � Kafka � http://kafka.apache.org � Samza � http://samza.incubator.apache.org � Log Blog � http://linkd.in/199iMwY � Me � http://www.linkedin.com/in/jaykreps � @jaykreps �

Recommend


More recommend