Real-time Data Pipelines with Structured Streaming in Tathagata “ TD ” Das @tathadas DataEngConf 2018 18 th April, San Francisco
About Me Started Spark Streaming project in AMPLab, UC Berkeley Currently focused on building Structured Streaming PMC Member of Engineer on the StreamTeam @ "we make all your streams come true"
Applications Streaming SQL ML Graph unified processing engine EC2 YARN Environments Data Sources
Data Pipelines – 10000ft view Data Lake Dump ETL Analytics Data Warehouse unstructured unstructured structured data data dump data streams
Data Pipeline @ Fortune 100 Company Trillions of Records Separate warehouses for Messy data not ready each type of analytics Security Infra for analytics IDS/IPS, DLP, antivirus, load Incidence balancers, proxy servers DW1 Response DATALAKE1 Cloud Infra & Apps Dump Complex ETL AWS, Azure, Google Cloud Alerting DW2 DATALAKE2 Servers Infra Linux, Unix, Windows Reports DW3 Hours of delay in accessing data Network Infra Routers, switches, WAPs, Very expensive to scale databases, LDAP Proprietary formats No advanced analytics (ML)
New Pipeline @ Fortune 100 Company Incidence Response Alerting Dump Complex ETL STRUCTURED SQL, ML, DELTA STREAMING STREAMING Reports Data usable in minutes/seconds Easy to scale Open formats Enables advanced analytics
STRUCTURED STREAMING
you should not have to reason about streaming
you should write simple queries & Spark should continuously update the answer
Treat Streams as Unbounded Tables unbounded input table data stream new data in the data stream = new rows appended to a unbounded table
Anatomy of a Streaming Query Example Read JSON data from Kafka Parse nested JSON ET ETL Store in structured Parquet table Get end-to-end failure guarantees
Anatomy of a Streaming Query spark.readStream.format("kafka") Source .option("kafka.boostrap.servers",...) .option("subscribe", "topic") Specify where to read data from .load() Built-in support for Files / Kafka / Kinesis* Can include multiple sources of returns a different types using join() / union() DataFrame *Available only on Databricks Runtime
DataFrame ó Table static data = streaming data = bounded table unbounded table Sin Single e AP API !
DataFrame/Dataset Da DataFrame SQ SQL Da Dataset spark.sql(" val df: DataFrame = val ds: Dataset[(String, Double)] = spark.table("device-data") SELECT type, sum(signal) spark.table("device-data") .as[DeviceData] FROM devices .groupBy("type") .groupByKey(_.type) GROUP BY type .sum("signal")) .mapValues(_.signal) ") .reduceGroups(_ + _) Most familiar to BI Analysts Great for Data Scientists familiar Great for Data Engineers who Supports SQL-2003, HiveQL with Pandas, R Dataframes want compile-time type safety Same semantics, same performance Choose your hammer for whatever nail you have!
Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Kafka DataFrame key value topic partition offset timestamp [binary] [binary] "topic" 0 345 1486087873 [binary] [binary] "topic" 3 2890 1486086721
Anatomy of a Streaming Query Transformations spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Cast bytes from Kafka records to a .selectExpr("cast (value as string) as json") string, parse it as a json, and .select(from_json("json", schema).as("data")) generate nested columns 100s of built-in, optimized SQL functions like from_json user-defined functions, lambdas, function literals with map, flatMap…
Anatomy of a Streaming Query spark.readStream.format("kafka") Sink .option("kafka.boostrap.servers",...) .option("subscribe", "topic") Write transformed output to .load() external storage systems .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Built-in support for Files / Kafka .writeStream Use foreach to execute arbitrary .format("parquet") .option("path", "/parquetTable/") code with the output data Some sinks are transactional and exactly once (e.g. files)
Anatomy of a Streaming Query spark.readStream.format("kafka") Processing Details .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Trigger: when to process data .selectExpr("cast (value as string) as json") - Fixed interval micro-batches .select(from_json("json", schema).as("data")) - As fast as possible micro-batches .writeStream - Continuously (new in Spark 2.3) .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") Checkpoint location: for tracking the .option("checkpointLocation", "…") progress of the query .start()
Spark automatically streamifies! t = 1 t = 2 t = 3 Read from Kafka Kafka Source spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) Project .option("subscribe", "topic") Optimized device, signal .load() new data process new data new data process process Operator .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) codegen, off- Filter .writeStream heap, etc. .format("parquet") signal > 15 .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") Write to Parquet .start() Parquet Sink DataFrames, Logical Optimized Series of Incremental Datasets, SQL Plan Plan Execution Plans Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data
Fault-tolerance with Checkpointing t = 1 t = 2 t = 3 Checkpointing new data process new data new data process process Saves processed offset info to stable storage Saved as JSON for forward-compatibility write Allows recovery from any failure ahead Can resume after limited changes to your log end-to-end streaming transformations (e.g. adding new exactly-once filters to drop corrupted data, etc.) guarantees
Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") ETL ET .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream Raw data from Kafka available .format("parquet") as structured data in seconds, .option("path", "/parquetTable/") ready for querying .trigger("1 minute") .option("checkpointLocation", "…") .start()
Performance: Benchmark Structured Streaming reuses 40-core throughput the Spark SQL Optimizer 65M 70 Millions of records/s and Tungsten Engine 60 50 40 30 22M 20 3x 10 700K 0 Kafka Apache Flink Structured cheaper faster Streams Streaming More details in our blog post
Business Logic independent of Execution Mode spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") Business logic .select(from_json("json", schema).as("data")) remains unchanged .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()
Business Logic independent of Execution Mode spark. read .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Business logic .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) remains unchanged . write .format("parquet") .option("path", "/parquetTable/") .load() Peripheral code decides whether it’s a batch or a streaming query
Business Logic independent of Execution Mode .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Batch Micro-batch Continuous** Streaming Streaming low latency high latency ultra-low latency (seconds) (hours/minutes) (milliseconds) efficient resource allocation execute on-demand static resource allocation high throughput high throughput **experimental release in Spark 2.3, read our blog
Event time Aggregations Windowing is just another type of grouping in Struct. Streaming parsedData .groupBy(window("timestamp","1 hour")) number of records every hour .count() parsedData avg signal strength of each .groupBy( "device", device every 10 mins window("timestamp","10 mins")) .avg("signal") Support UDAFs!
Recommend
More recommend