P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax
Profile 4+ years building stream processing platform at Netflix • Drove technical vision, roadmap, led implementation • 17+ years building distributed systems • @monaldax
Structure Of The Talk Stream Set The Stage 8 Patterns Processing ? 5 Functional 3 Non-Functional @monaldax
Disclaimer Inspired by True Events encountered building and operating a Stream Processing platform, and use cases that are in production or in ideation phase in the cloud. Some code and identifying details have been changed, artistic liberties have been taken , to protect the privacy of streaming applications, and for sharing the know-how. Some use cases may have been simplified. @monaldax
Stream Processing? Processing Data-In-Motion @monaldax
Lower Latency Analytics
User Activity Stream - Batched Feb 26 Feb 25 Flash Jessica Luke @monaldax
Sessions - Batched User Activity Stream Feb 26 Feb 25 Flash Jessica Luke @monaldax
Correct Session - Batched User Activity Stream Feb 26 Feb 25 Flash Jessica Luke @monaldax
Stream Processing Natural For User Activity Stream Sessions Flash Jessica Luke @monaldax
Why Stream Processing? Low latency insights and analytics 1. Process unbounded data sets 2. ETL as data arrives 3. Ad-hoc analytics and Event driven applications 4. @monaldax
Set The Stage Architecture & Flink
Stream Processing App Architecture Blueprint Stream Source Sink Processing Job @monaldax
Stream Processing App Architecture Blueprint Side Input Source Stream Sinks Source Processing Job Source @monaldax
Why Flink?
Flink Programs Are Streaming Dataflows – Streams And Transformation Operators @monaldax Image adapted, source: Flink Docs
Streams And Transformation Operators - Windowing 10 Second @monaldax Image source: Flink Docs
Streaming Dataflow DAG @monaldax Image adapted, source: Flink Docs
Scalable Automatic Scheduling Of Operations Job Manager (Process) Parallelism 2 Sink 1 (Process) (Process) @monaldax Image adapted, source: Flink Docs
Flexible Deployment Containers VM / Cloud Bare Metal @monaldax
Stateless Stream Processing No state maintained across events @monaldax Image adapted from: Stephan Ewen
Fault-tolerant Processing – Stateful Processing In-Memory / On-Disk Local State Access Streaming Application Flink TaskManager Sink Local State Source / Checkpoints Producers Savepoints (Periodic, Asynchronous, (Explicitly Triggered) Incremental Checkpoint) @monaldax
Levels Of API Abstraction In Flink Source: Flink Documentation
Describing Patterns @monaldax
Describing Design Patterns ● Use Case / Motivation ● Pattern ● Code Snippet & Deployment mechanism ● Related Pattern, if any @monaldax
Patterns Functional
1. Configurable Router @monaldax
1.1 Use Case / Motivation – Ingest Pipelines • Create ingest pipelines for different event streams declaratively • Route events to data warehouse, data stores for analytics • With at-least-once semantics • Streaming ETL - Allow declarative filtering and projection @monaldax
1.1 Keystone Pipeline – A Self-serve Product • SERVERLESS • Turnkey – ready to use • 100% in the cloud • No code, Managed Code & Operations @monaldax
1.1 UI To Provision 1 Data Stream, A Filter, & 3 Sinks
1.1 Optional Filter & Projection (Out of the box)
1.1 Provision 1 Kafka Topic, 3 Configurable Router Jobs R Configurable Router Job Projection Filter Connector Fan-out: 3 Configurable Router Job 1 2 3 4 5 6 7 Filter Projection Connector Events Elasticsearch play_events Configurable Router Job Projection Filter Connector Consumer Kafka @monaldax
1.1 Keystone Pipeline Scale ● Up to 1 trillion new events / day ● Peak: 12M events / sec, 36 GB / sec ● ̴ 4 PB of data transported / day ● ̴ 2000 Router Jobs / 10,000 containers @monaldax
1.1 Pattern: Configurable Isolated Router Configurable Router Job Sink Declarative Processors Declarative Processors Events Producer @monaldax
1.1 Code Snippet: Configurable Isolated Router No User Code val ka#aSource = getSourceBuilder.fromKa#a( "topic1" ).build() val selectedSink = getSinkBuilder() .toSelector(sinkName).declareWith( "ka,asink" , ka#aSink) .or( "s3sink" , s3Sink).or( "essink" , esSink).or( "nullsink" , nullSink).build(); ka#aSource .filter( KeystoneFilterFunc6on ).map( KeystoneProjec6onFunc6on ) .addSink(selectedSink) @monaldax
1.2 Use Case / Motivation – Ingest large streams with high fan-out Efficiently • Popular stream / topic has high fan-out factor • Requires large Kafka Clusters, expensive R Filter TopicA Cluster1 R Events Kafka TopicB Cluster1 Projection Producer R @monaldax
1.2 Pattern: Configurable Co-Isolated Router R Filter TopicA Cluster1 Events Kafka TopicB Cluster1 Projection Producer Co-Isolated Router Merge Routing To Same Kafka Cluster @monaldax
1.2 Code Snippet: Configurable Co-Isolated Router No User Code ui_A_Clicks_KafkaSource ui_A_Clicks_KakfaSource .map(transformer) .filter(filter) .flatMap(outputFlatMap) .map(projection) .map(outputConverter) .map(outputConverter) .addSink( kafkaSinkA_Topic2 ) .addSink( kafkaSinkA_Topic1 ) @monaldax
2. Script UDF* Component [Static / Dynamic] *UDF – User Defined Function @monaldax
2. Use Case / Motivation – Configurable Business Logic Code for operations like transformations and filtering Managed Router / Streaming Job Biz Logic Source Job DAG Sink @monaldax
2. Pattern: Static or Dynamic Script UDF (stateless) Component Comes with all the Pros and Cons of scripting engine Script Engine executes function defined in the UI Streaming Job UDF Source Sink @monaldax
2. Code Snippet: Script UDF Component Contents configurable at runtime val xscript = // Script Function new DynamicConfig("x.script") val sm = new ScriptEngineManager() kakfaSource ScriptEngine se = .map(new ScriptFunc>on(xscript)) m.getEngineByName ("nashorn"); .filter(new ScriptFunc>on(xsricpt2)) se .eval(script) .addSink( new NoopSink() ) @monaldax
3. The Enricher @monaldax
Next 3 Patterns (3-5) Require Explicit Deployment @monaldax
3. User Case - Generating Play Events For Personalization And Show Discovery @monaldax
3. Use-case: Create play events with current data from services, and lookup table for analytics. Using lookup table keeps originating events lightweight Streaming Job Play Logs Resource Rate Limiter Periodically updating Service call lookup data Playback Video History Service Metadata @monaldax
3. Pattern: The Enricher - Rate limit with source or service rate limiter, or with resources - Pull or push data, Sync / async Streaming Job Source Sink Source / Service Rate Limiter • Service call • Lookup from Data Store • Static or Periodically updated lookup data Side Input @monaldax
3. Code Snippet: The Enricher val kafkaSource = getSourceBuilder.fromKafka( "topic1" ).build() val parsedMessages = kafkaSource.flatMap(parser).name( ”parser" ) val enrichedSessions = parsedMessages.filter(reflushFilter).name( ”filter" ) .map(playbackEnrichment).name( ”service" ) .map(dataLookup) enrichmentSessions.addSink(sink).name( "sink" ) @monaldax
4. The Co-process Joiner @monaldax
4. Use Case – Play-Impressions Conversion Rate @monaldax
4. Impressions And Plays Scale • 130+ M members • 10+ B Impressions / day • 2.5+ B Play Events / day ~ 2 TB Processing State • @monaldax
4. Join Large Streams With Delayed, Out Of Order Events Based on Event Time • # Impressions per user play • Impression attributes leading to the play I2 impressions I1 Sink Streaming Job P1 P3 plays Kafka Topics @monaldax
Understanding Event Time Input Processing 10:00 11:00 12:00 13:00 14:00 15:00 Time Output 10:00 11:00 12:00 13:00 14:00 15:00 Event Time 1 hour Window Image Adapted from The Apache Beam Presentation Material
4. Use Case: Join Impressions And Plays Stream On Event Time Keyed State Merge I2 P2 Merge I2 P2 & Emit & Emit I1 K impressions I2 K keyBy F1 keyBy F2 plays P2 K Co-process Kafka Topics Streaming Job @monaldax
4. Pattern: The Co-process Joiner Process and Coalesce events for each stream grouped by same key • Join if there is a match, evict when joined or timed out • Keyed State Source 1 keyBy F1 State 1 Sink Source 2 keyBy F2 State 2 Co-process Streaming Job @monaldax
4. Code Snippet – The Co-process Joiner, Setup sources env.setStreamTimeCharacteristic( EventTime ) val impressionSource = kafkaSrc1 .filter(eventTypeFilter) .flatMap(impressionParser) .keyBy(in => ( s"$ {profile_id} _$ {title_id} " )) val impressionSource = kafkaSrc2 .flatMap(playbackParser) .keyBy(in => ( s"$ {profile_id} _$ {title_id} " )) @monaldax
Recommend
More recommend