Easy, Scalable, Fault-tolerant Stream Processing with St Structured ed St Strea eamin ing Burak Yavuz DataEngConf NYC October 31 st 2017
Who ¡am ¡I ● Software ¡Engineer ¡– Databricks -‑ “We ¡make ¡your ¡streams ¡come ¡true” ● Apache ¡Spark ¡Committer ● MS ¡in ¡Management ¡Science ¡& ¡Engineering ¡-‑ Stanford ¡University ● BS ¡in ¡Mechanical ¡Engineering ¡-‑ Bogazici University, ¡Istanbul 2
About Databricks TEAM Started Spark project (now Apache Spark) at UC Berkeley in 2009 MISSION Making Big Data Simple PRODUCT Unified Analytics Platform
building robust stream processing apps is hard
Complexities in stream processing COMPLEX DATA COMPLEX WORKLOADS COMPLEX SYSTEMS Diverse data formats Combining streaming with Diverse storage systems (json, avro, binary, …) interactive queries (Kafka, S3, Kinesis, RDBMS, …) Data can be dirty, System failures Machine learning late, out-of-order
Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
you should not have to reason about streaming
you should write simple queries & Spark should continuously update the answer
Anatomy of a Streaming Query Streaming word count
Anatomy of a Streaming Query spark.readStream Source .format("kafka") .option("subscribe", ¡"input") • Specify one or more locations .load() .groupBy($"value".cast("string")) to read data from .count() • .writeStream Built in support for .format("kafka") Files/Kafka/Socket, .option("topic", ¡"output") pluggable. .trigger("1 ¡minute") .outputMode(OutputMode.Complete()) • Can include multiple sources .option("checkpointLocation", ¡"…") .start() of different types using union()
Anatomy of a Streaming Query spark.readStream Transformation .format("kafka") .option("subscribe", ¡"input") • Using DataFrames, .load() .groupBy('value.cast("string") ¡as ¡'key) Datasets and/or SQL. .agg(count("*") ¡as ¡'value) .writeStream • Catalyst figures out how to .format("kafka") execute the transformation .option("topic", ¡"output") .trigger("1 ¡minute") incrementally. .outputMode(OutputMode.Complete()) .option("checkpointLocation", ¡"…") • Internal processing always .start() exactly-once.
Spark automatically streamifies! t = 1 t = 2 t = 3 Read from Kafka input = ¡spark.readStream .format("kafka") Kafka Source .option("subscribe", "topic") .load() Project Optimized result = ¡input device, signal new data process new data new data process process .select("device", "signal") Operator .where("signal ¡> ¡15") codegen, off- Filter heap, etc. signal > 15 result.writeStream .format("parquet") .start("dest-‑path") Write to Kafka Kafka Sink DataFrames, Logical Optimized Series of Incremental Datasets, SQL Plan Physical Plan Execution Plans Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data
Anatomy of a Streaming Query spark.readStream Sink .format("kafka") .option("subscribe", ¡"input") • Accepts the output of each .load() .groupBy('value.cast("string") ¡as ¡'key) batch. .agg(count("*") ¡as ¡'value) .writeStream • When supported sinks are .format("kafka") transactional and exactly .option("topic", ¡"output") .trigger("1 ¡minute") once (Files). .outputMode(OutputMode.Complete()) .option("checkpointLocation", ¡"…") • Use foreach to execute .start() arbitrary code.
Anatomy of a Streaming Query spark.readStream Output mode – What's output .format("kafka") .option("subscribe", ¡"input") • Complete – Output the whole answer .load() every time .groupBy('value.cast("string") ¡as ¡'key) .agg(count("*") ¡as ¡'value) • Update – Output changed rows .writeStream .format("kafka") • Append – Output new rows only .option("topic", ¡"output") .trigger("1 ¡minute") Trigger – When to output .outputMode("update") • .option("checkpointLocation", ¡"…") Specified as a time, eventually .start() supports data size • No trigger means as fast as possible
Anatomy of a Streaming Query spark.readStream Checkpoint .format("kafka") .option("subscribe", ¡"input") • Tracks the progress of a .load() .groupBy('value.cast("string") ¡as ¡'key) query in persistent storage .agg(count("*") ¡as ¡'value) .writeStream • Can be used to restart the .format("kafka") query if there is a failure. .option("topic", ¡"output") .trigger("1 ¡minute") .outputMode("update") .option("checkpointLocation", ¡"…") .start()
Fault-tolerance with Checkpointing t = 1 t = 2 t = 3 Checkpointing – tracks progress (offsets) of consuming data from new data process new data new data process process the source and intermediate state. Offsets and metadata saved as JSON write Can resume after changing your ahead streaming transformations log end-to-end exactly-once guarantees
Complex Streaming ETL
Traditional ETL table file 10101010 seconds hours dump Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 18
Traditional ETL table file 10101010 seconds hours dump Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.]
Streaming ETL w/ Structured Streaming table 10101010 seconds Structured Streaming enables raw data to be available as structured data as soon as possible 20
Streaming ETL w/ Structured Streaming val rawData = spark.readStream Example .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") Json data being received in Kafka .load() Parse nested json and flatten it val parsedData = rawData .selectExpr("cast ¡(value ¡as ¡string) ¡as ¡json")) Store in structured Parquet table .select(from_json("json", schema).as("data")) .select("data.*") Get end-to-end failure guarantees val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")
Reading from Kafka Specify options to configure val ¡rawData ¡= ¡spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) How? .option("subscribe", ¡"topic") kafka.boostrap.servers ¡=> ¡broker1,broker2 .load() What? subscribe ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡ ¡topic1,topic2,topic3 ¡ ¡ ¡// ¡fixed ¡list ¡of ¡topics subscribePattern => ¡ ¡topic* // ¡dynamic ¡list ¡of ¡topics assign ¡ => ¡ ¡{"topicA":[0,1] ¡} ¡ ¡ // ¡specific ¡partitions Where? startingOffsets ¡=> ¡latest (default) / ¡earliest ¡/ ¡ {"topicA":{"0":23,"1":345} ¡}
Reading from Kafka val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) rawData dataframe has .option("subscribe", ¡"topic") the following columns .load() ke key va value to topic ic pa partition of offset time timesta tamp mp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
Transforming Data Cast binary value to string val ¡parsedData ¡= ¡rawData .selectExpr("cast ¡(value ¡as ¡string) ¡as ¡json") Name it column json .select(from_json("json", ¡schema).as("data")) .select("data.*")
Transforming Data Cast binary value to string val parsedData ¡= ¡rawData .selectExpr("cast ¡(value ¡as ¡string) ¡as ¡json") Name it column json .select(from_json("json", ¡schema).as("data")) .select("data.*") Parse json string and expand into nested columns, name it data data (nested) json timestamp device … from_json("json") { " time timesta tamp mp ": 1486087873, " de device ": "devA", …} as "data" 1486087873 devA … { " time timesta tamp mp ": 1486082418, " de device ": "devX", …} 1486086721 devX …
Recommend
More recommend