Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.
Gerard Maas SeΓ±or SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
Agenda What is Spark and Why We Should Care? Streaming APIs in Spark - Structured Streaming - Interactive Session 1 - Spark Streaming - Interactive Session 2 Spark Streaming X Structured Streaming @maasg
Streaming | Big Data @maasg
100Tb 5Mb @maasg
100Tb 5Mb/s @maasg
β Stream = Dataset π¬ Dataset = Stream - Tyler Akidau, Google @maasg
Once upon a time...
Structured Streaming Spark SQL Datasets/Frames Apache Spark Core Data Sources GraphFrames Spark MLLib Spark Streaming @maasg
Structured Streaming Spark SQL Datasets/Frames Apache Spark Core Data Sources GraphFrames Spark MLLib Spark Streaming @maasg
Structured Streaming @maasg
Structured Streaming @maasg
Structured Streaming Kafka HDFS/S3 Sockets Custom Streaming DataFrame @maasg
Structured Streaming Kafka HDFS/S3 Sockets Custom Streaming DataFrame @maasg
Structured Streaming Output Mode Kafka Kafka HDFS/S3 Files Query Sockets foreachSink Custom console memory Streaming DataFrame @maasg
Interactive Session 1 Structured Streaming @maasg
Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg
Live @maasg
Interactive Session 1 Structured Streaming QUICK RECAP @maasg
Sources val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load() @maasg
Operations ... val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String] val jsonValues = rawValues.select(from_json($"value", schema) as "record") val sensorData = jsonValues.select("record.*").as[SensorData] β¦ @maasg
Event Time ... val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp")) ... @maasg
Sinks ... val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start() ... val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start() @maasg
Use Cases β Streaming ETL β Stream aggregations, windows β Event-time oriented analytics β Arbitrary stateful stream processing β Join Streams with other streams and with Fixed Datasets β Apply Machine Learning Models @maasg
Structured Streaming @maasg
Spark Streaming Kafka Databases Flume Spark SQL Kinesis Spark ML HDFS Twitter ... Sockets API Server HDFS/S3 Apache Spark Streams Custom @maasg
t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] @maasg
t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] @maasg
t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Actions @maasg
API: Transformations map, flatmap, filter count, reduce, n countByValue, reduceByKey union, join cogroup @maasg
API: Transformations mapWithState β¦ β¦ @maasg
API: Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) } @maasg
Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles * foreachRDD @maasg
Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles Spark SQL * Dataframes foreachRDD GraphFrames Any API @maasg
Interactive Session 2 Spark Streaming @maasg
Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg
Live @maasg
Interactive Session 2 Spark Streaming QUICK RECAP @maasg
Streaming Context import org.apache.spark.streaming.StreamingContext val streamingContext = new StreamingContext(sparkContext, Seconds(10)) @maasg
Source val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString ) val topics = Set(topic) @transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics) @maasg
Transformations import spark.implicits._ val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd } @maasg
DIY Custom Model val model = new M2Model() β¦ model.trainOn(inputData) β¦ val scoredDStream = model.predictOnValues(inputData) @maasg
Output suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample) } @maasg
Usecases β Complex computing/state management (local + cluster) β Streaming Machine Learning β Learn β Score β Join Streams with Updatable Datasets β RDD-based streaming computations β [-] Event-time oriented analytics β [-] Optimizations: Query & Data β [-] Continuous processing @maasg
Sensor Anomaly Detection (Real Time Detection) Sensor Data Multiplexer Structured Structured Local Process Streaming Streaming @maasg
+ Structured Streaming
Spark Streaming + Structured Streaming val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? Spark Streaming Structured Streaming val kafkaStream = spark.readStream β¦ val dstream = KafkaUtils.createDirectStream(...) val f = parse andThen process andThen serialize dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val result = f(kafkaStream) val f = parse andThen process andThen serialize result.writeStream .format("kafka") val result = f(ds) .option("kafka.bootstrap.servers",bootstrapServers) result.write.format("kafka") .option("topic", writeTopic) .option("kafka.bootstrap.servers", bootstrapServers) .option("checkpointLocation", checkpointLocation) .option("topic", writeTopic) .start() .option("checkpointLocation", checkpointLocation) .save() } 47 @maasg
Streaming Pipelines Structured Streaming Keyword Keyword Similarity Extraction Relevance DB Storage @maasg
Structured Streaming Spark Streaming Abstract Fixed to microbatch Time (Processing Time, Event Time) Streaming Interval Fixed Micro batch, Best Effort MB, Execution Fixed Micro batch Continuous (NRT) Abstraction DataFrames/Dataset DStream, RDD Access to the scheduler @maasg
New Project? 80% Structured Streaming 20%
lightbend.com/fast-data-platform
@maasg
Gerard Maas SeΓ±or SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
Thank You!
Recommend
More recommend