Building Stream Processing Pipelines Gyula Fra gyfora@sics.se - PowerPoint PPT Presentation

Building ¡Stream ¡ Processing ¡Pipelines Gyula ¡Fóra gyfora@sics.se

Introduction • Stream ¡processing ¡is ¡getting ¡extremely ¡relevant • New ¡open ¡source ¡systems ¡triggered ¡an ¡application ¡ shift ¡towards ¡“real-‑time” • We ¡are ¡seeing ¡more ¡and ¡more ¡complex ¡streaming ¡ applications High-latency Low-latency apps apps Batch Stream processors processors Event streams Files 2

What ¡does ¡streaming ¡enable? 1. ¡Data ¡integration 2. ¡Low ¡latency ¡applications Data sources ETL into DW • Fresh recommendations, fraud detection, etc. • Internet of Things, intelligent State of manufacturing the world • Results “right here, right now” Data becomes available at Apps read production rate data changes Apps read "current" (=last 3. Batch < Streaming night/week's) state History of the world cf. Kleppmann: "Turning the DB inside out with Samza" 3

What ¡can ¡go ¡wrong? http://www.confluent.io/blog/stream-‑data-‑platform-‑1/ 4

Parts ¡of ¡a ¡streaming ¡infrastructure Server Logs Sensors Transaction logs … Gathering Broker Analysis 5

Parts ¡of ¡a ¡streaming ¡infrastructure • Gathering • Collect ¡data ¡from ¡different ¡sources: ¡Logs, ¡Sensors ¡etc. • Integration ¡frameworks ¡(Apache ¡Camel) • Broker ¡service • Store ¡the ¡collected ¡data ¡and ¡make ¡it ¡available ¡for ¡ processing • Fault-‑tolerant ¡message ¡queues ¡and ¡logging ¡services • Stream ¡Analysis • Analyze ¡the ¡incoming ¡data ¡on-‑the-‑fly • Feed ¡results ¡back ¡into ¡the ¡broker ¡for ¡other ¡systems 6

Broker ¡Systems 7

Broker ¡systems • Middle-‑man ¡between ¡data ¡gathering ¡and ¡processing • Decouples ¡data ¡collection ¡from ¡analysis • Different ¡systems ¡can ¡analyze ¡the ¡data ¡in ¡different ¡ways • Stored ¡data ¡can ¡be ¡further ¡analyzed ¡later Consumer Producer Broker Consumer Producer Consumer 8

Requirements • Persistence • Durability: ¡Replay ¡any ¡set ¡of ¡messages ¡on ¡demand • This ¡is ¡critical ¡to ¡tolerate ¡failures ¡or ¡modifications ¡in ¡ other ¡parts ¡of ¡the ¡pipeline • High-‑throughput, ¡low-‑latency • As ¡the ¡primary ¡data ¡hub ¡it ¡needs ¡to ¡provide ¡high ¡ read/write ¡throughput ¡with ¡low ¡latency • Scalability • Flexible ¡APIs • Commonly ¡used ¡broker ¡systems: ¡Kafka, ¡RabbitMQ, ¡ ActiveMQ 9

Apache ¡Kafka • Distributed, ¡partitioned, ¡replicated ¡commit ¡log ¡service • Very ¡high ¡read/write ¡throughput • Main ¡concepts • Topic • Producer • Consumer ¡(Consumer ¡group) • Data ¡is ¡partitioned ¡so ¡producers ¡ and consumers ¡work ¡in ¡parallel 10

Apache ¡Kafka Guarantees • Partial ¡message ¡ordering • At-‑least-‑once ¡delivery ¡by default • Exactly-‑once ¡delivery ¡can ¡ be ¡implemented ¡by ¡the applications 11

Running ¡Kafka • Running ¡Kafka: 1. Start ¡Zookeper server 2. Start ¡Kafka ¡server 3. Create ¡Kafka ¡topics 4. Setup ¡Producers/Consumers > ¡ bin/zookeeper-‑server-‑start.shconfig/zookeeper.properties > ¡ bin/kafka-‑server-‑start.sh config/server.properties > ¡ bin/kafka-‑topics.sh -‑-‑create ¡-‑-‑zookeeper ¡localhost:2181 ¡-‑-‑replication-‑factor ¡1 ¡-‑-‑partitions ¡1 ¡-‑-‑topic ¡test > ¡ bin/kafka-‑console-‑producer.sh -‑-‑broker-‑list ¡localhost:9092 ¡-‑-‑topic ¡test > ¡ bin/kafka-‑console-‑consumer.sh -‑-‑zookeeper ¡localhost:2181 ¡-‑-‑topic ¡test ¡-‑-‑from-‑beginning 12

Stream ¡processors 13

Large-‑scale ¡stream ¡processing • Now ¡that ¡the ¡data ¡is ¡sitting ¡in ¡a ¡Broker ¡system ¡we ¡ need ¡something ¡to ¡process ¡it • General ¡requirements • High-‑throughput ¡(keep ¡up ¡with ¡the ¡broker ¡+ ¡more) • Expressivity • Fault-‑tolerance • Low-‑latency • There ¡are ¡plenty ¡of ¡stream ¡processing ¡systems ¡ tailored ¡more ¡towards ¡specific ¡applications 14

Apache ¡Storm Overview • True ¡data ¡streaming • Low ¡latency ¡– lower ¡throughput • Low ¡level ¡API ¡(Bolts, ¡Spouts) ¡+ ¡ Trident • At-‑least-‑once ¡processing ¡ guarantees 15

Storm ¡– Word ¡Count Rolling ¡word count -‑ Standard ¡Storm ¡API Rolling ¡word count -‑ Trident 16

Apache ¡Flink Overview • True ¡streaming ¡with ¡adjustable ¡ latency ¡and ¡throughput ¡ • Rich ¡functional ¡API ¡ • Fault-‑tolerant ¡operator ¡states ¡and ¡ flexible ¡windowing • Exactly-‑once ¡processing ¡guarantees 17

Flink – Word ¡Count case class Word ( word : String, frequency : Int) Rolling ¡word count val lines: DataStream[String] = env.fromSocketStream(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . groupBy ( "word" ). sum ( "frequency" ) .print() Window word count val lines: DataStream[String] = env.fromSocketStream(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . window (Time.of(5,SECONDS)). every (Time.of(1,SECONDS)) . groupBy ( "word" ). sum ( "frequency" ) .print() 18

Apache ¡Spark ¡Streaming Overview • Stream ¡processing ¡emulated on ¡a ¡batch ¡system • High ¡throughput ¡– Higher ¡latency • Functional ¡API ¡(DStreams) • Exactly-‑once ¡processing ¡ guarantees ¡ 19

Spark ¡– Word ¡Count Window word count Rolling ¡word count (kind of) 20

Putting ¡it ¡all ¡together 21

Streaming ¡pipeline ¡in ¡action Streaming ¡infrastructure ¡at ¡Bouygues ¡Telecom • Network ¡and ¡subscriber ¡data ¡gathered • Added ¡to ¡Broker ¡in ¡raw ¡format • Transformed ¡and ¡analyzed ¡by ¡streaming ¡ engine • Stored ¡back ¡for ¡further ¡processing • Results ¡processed ¡by ¡other ¡systems Read ¡more: http://data-‑artisans.com/flink-‑at-‑bouygues.html 22

Let’s ¡start ¡with ¡something ¡simple Interactive ¡analysis ¡using ¡Flink, ¡Kafka ¡and ¡Python 1. Load ¡stream ¡into ¡Kafka 2. Create ¡a ¡Flink job ¡to ¡process ¡the ¡data 3. Store ¡results ¡back ¡into ¡Kafka 4. Analyze ¡results ¡using ¡Python ¡notebook Data Streams 23

Demo https://github.com/gyfora/summer-‑school 24

What ¡have ¡we ¡learnt? • Building ¡a ¡proper ¡streaming ¡infrastructure ¡is ¡not ¡ trivial ¡(but ¡it’s ¡certainly ¡possible) • Stream ¡processors ¡are ¡just ¡part ¡of ¡the ¡big ¡picture, ¡ other ¡components ¡are ¡critical ¡as ¡well • There ¡is ¡no ¡single ¡system ¡to ¡provide ¡an ¡end-‑to-‑end ¡ solution • Mix ¡and ¡match ¡the ¡different ¡components 25

Thank ¡you! 26

Building Stream Processing Pipelines Gyula Fra gyfora@sics.se - PowerPoint PPT Presentation

Building Stream Processing Pipelines Gyula Fra gyfora@sics.se Introduction Stream processing is getting extremely relevant New open source systems triggered an application

RDF pro an Extensible Tool for Building Stream- an Extensible Tool for Building Stream-

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

1 2 M ID - STREAM NATURAL ECONOMICS , European pipelines

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

TargetEnvironment Heterogeneoussensors TinyWebServices:

ECE 563 Programming Parallel Machines The syllabus: https://engineering.purdue.edu/~s

CS 598: Advanced Internet Brighten Godfrey pbg@illinois.edu Fall 2009 Tuesday, August 25, 2009

CSc 337 LECTURE 16: WRITING YOUR OWN WEB SERVICE Basic web service // CSC 337 hello world server

Peer-to-Peer Networks 10 Fast Download Christian Ortolf Technical Faculty Computer-Networks and

Decentralized Key Management for Large Dynamic Multicast Groups using Distributed Balanced Trees

Lab Course RouterLab Open Shortest Path First (OSPF) Miscellaneous Don't set enable

Outline ! Multimedia Overview ! Receiver-Driven Layered Multicast ! UDP Sockets (coming soon) ! IP

Building Stream Processing Pipelines Gyula Fra gyfora@sics.se - PowerPoint PPT Presentation

Building Stream Processing Pipelines Gyula Fra gyfora@sics.se Introduction Stream processing is getting extremely relevant New open source systems triggered an application

RDF pro an Extensible Tool for Building Stream- an Extensible Tool for Building Stream-

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

1 2 M ID - STREAM NATURAL ECONOMICS , European pipelines

An Introduction To Data Stream Query Processing Neil Conway &lt;nconway@truviso.com&gt; Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

TargetEnvironment Heterogeneoussensors TinyWebServices:

ECE 563 Programming Parallel Machines The syllabus: https://engineering.purdue.edu/~s

CS 598: Advanced Internet Brighten Godfrey pbg@illinois.edu Fall 2009 Tuesday, August 25, 2009

CSc 337 LECTURE 16: WRITING YOUR OWN WEB SERVICE Basic web service // CSC 337 hello world server

Peer-to-Peer Networks 10 Fast Download Christian Ortolf Technical Faculty Computer-Networks and

Decentralized Key Management for Large Dynamic Multicast Groups using Distributed Balanced Trees

Lab Course RouterLab Open Shortest Path First (OSPF) Miscellaneous Don't set enable

Outline ! Multimedia Overview ! Receiver-Driven Layered Multicast ! UDP Sockets (coming soon) ! IP

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,