streaming why should i care
play

Streaming Why should I care? Christian Trebing Blue Yonder GmbH - PowerPoint PPT Presentation

Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1 Agenda Motivation Streaming Intro Implementation Challenges 2 Data Processing - The Monolith Data Input Data Validation THE Database Machine Learning Data


  1. Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1

  2. Agenda Motivation Streaming Intro Implementation Challenges 2

  3. Data Processing - The Monolith Data Input Data Validation THE Database Machine Learning Data Output 3

  4. Pain Several teams are developing this application • Customer desperately wants new feature in machine learning • But the data validation team is in the midst of refactoring their database structure (‚will be fj nished in two weeks‘) • So you wait… 4

  5. Could Microservices Help? Great: Data Input Data • No Dependency on single state • Independent Development Data Validation Data • Independent Upgrades Di ffi cult: Machine Learning Data • Too much data to transfer • Too much data to store in each Data Output Data service 5

  6. Are There Other Possibilities? 6

  7. Streaming Intro 7

  8. Databases and Streams Same information in table and stream: A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 A: 8 A: 4 A: 4 A: 1 B: 5 B: 5 B: 5 B: 5 B: 5 C: 3 C: 3 C: 3 C: 2 8

  9. Why does it matter? Di ff erent services can be in di ff erent states • Each service can consume the stream in its own speed • One service can be updated while the other runs A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 A: 8 A: 4 A: 4 A: 1 B: 5 B: 5 B: 5 B: 5 B: 5 C: 3 C: 3 C: 3 C: 2 Service 1 Service 2 on index 3 on index 5 9

  10. Partitioned Streams Sales stream, partitioned by location Each partition could be handled by di ff . processors location: Rimini location: Rimini location: Rimini location: Rimini product: Spaghetti product: Ravioli product: Pizza product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 5 quantity: 8 quantity: 1 quantity: 7 location: Bilbao location: Bilbao location: Bilbao product: Pizza product: Ravioli product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 3 quantity: 9 quantity: 5 location: Karlsruhe location: Karlsruhe location: Karlsruhe location: Karlsruhe product: Pizza product: Ravioli product: Ravioli product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 8 quantity: 3 quantity: 7 quantity: 7 10

  11. Data Processing - With Streaming Data Input Data Data Validation Data Streaming Platform Data Output Data Machine Learning Data 11

  12. What did we gain? Independent Development Independent Upgrade Scalability Did we throw out databases completely? • Let’s see… 12

  13. Is it magic? No, it’s a tradeo ff : • A database is so powerful: ACID guarantess, SQL language. You can do nearly everything • This comes at a price • Dependency on single state • Scaling is hard So let’s more think what we really need 13

  14. What do we loose? Database Stream ACID Ordering on stream partition Service is responsible of keeping SQL Queries its state You have to decide whether you can live with that. 14

  15. Implementation 15

  16. Apache Kafka 16

  17. Kafka Clients in Python • pykafka, python-kafka, con fm uent-kafka-client • Nice comparison has been done here: http://activisiongamescience.github.io/2016/06/15/Kafka- • Client-Benchmarking/ • Most performant currently is con fm uent-kafka-client • Uses the c library librdkafka 17

  18. Producer from con fm uent_kafka import Producer p = Producer({'bootstrap.servers': 'mybroker,mybroker2'}) for data in some_data_source: p.produce('mytopic', data.encode('utf-8')) p. fm ush() 18

  19. Consumer 19

  20. Apache Avro Data Serialization, Enabling Schema Evolution Clearly de fj ned schema with: • Schema evolution • Schema registry Writer’s schema Reader’s schema Data Type Field Name Data Type Field Name string location string location string product string product string sales_date string sales_date int quantity int quantity int, default=0 delivery_id 20

  21. Avro Schema 21

  22. AvroProducer 22

  23. AvroConsumer 23

  24. Example: Data Validation Separate valid and invalid sales records 24

  25. Additional Processors Need to evolve your application: • Add processors for evaluation topics • Try new variant of validation logic database • remember processing state, same for each processor streaming • o ff set in each processor, can work independently 25

  26. Challenge - Machine Learning Still Batch 26

  27. How to get the input data? Remember: There is no possibility to query a stream. Somewhere all this data needs to be. Options: • Memory of the service • Serving database • Blob store Yes, that’s duplication. We’ll have to live with it. 27

  28. Write Path / Read Path machine learning data validation THE database machine learning query write path read path machine learning data validation Blob store machine learning query write path read path 28

  29. Machine Learning Input Data locations_validated sales_validated products_validated table table join append to file 29

  30. Challenge - State in Processors 30

  31. State - Nightmare of every distributed systems engineer Streaming: Data just rushes through Why do we need state? • Time window processing • Data you want to join with Formerly, the database did it for you 31

  32. State - Some Challenges? Failure of a processor Scaling 32

  33. State in Stream Processors - Possible Solutions • Just keep in memory. Reprocess stream to warm up • Each processor to keep its own db • Save condensed in stream • Get it from other service Frameworks exist in other languages: • for example: Kafka Streams, Apache Samza Up to yesterday: none in python. Then heard about https://github.com/wintoncode/winton-kafka-streams 33

  34. Summary • You have more options for your data processing applications than you might have thought • As always, there are some tradeo ff s • You know the challenges 34

  35. Questions? 35

  36. Blue Yonder Best decisions, delivered daily Blue Yonder GmbH Blue Yonder Software Limited Blue Yonder Analytics, Inc. Ohiostraße 8 19 Eastbourne Terrace 5048 Tennyson Parkway 76149 Karlsruhe London, W2 6LG Suite 250 Germany United Kingdom Plano, Texas 75024 +49 721 383117 0 +44 20 3626 0360 USA 36

Recommend


More recommend