the tale of two streaming apis
play

The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - PowerPoint PPT Presentation

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc. Gerard Maas Seor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/


  1. Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

  2. Gerard Maas SeΓ±or SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

  3. Agenda What is Spark and Why We Should Care? Streaming APIs in Spark - Structured Streaming - Interactive Session 1 - Spark Streaming - Interactive Session 2 Spark Streaming X Structured Streaming @maasg

  4. Streaming | Big Data @maasg

  5. 100Tb 5Mb @maasg

  6. 100Tb 5Mb/s @maasg

  7. βˆ‘ Stream = Dataset 𝚬 Dataset = Stream - Tyler Akidau, Google @maasg

  8. Once upon a time...

  9. Structured Streaming Spark SQL Datasets/Frames Apache Spark Core Data Sources GraphFrames Spark MLLib Spark Streaming @maasg

  10. Structured Streaming Spark SQL Datasets/Frames Apache Spark Core Data Sources GraphFrames Spark MLLib Spark Streaming @maasg

  11. Structured Streaming @maasg

  12. Structured Streaming @maasg

  13. Structured Streaming Kafka HDFS/S3 Sockets Custom Streaming DataFrame @maasg

  14. Structured Streaming Kafka HDFS/S3 Sockets Custom Streaming DataFrame @maasg

  15. Structured Streaming Output Mode Kafka Kafka HDFS/S3 Files Query Sockets foreachSink Custom console memory Streaming DataFrame @maasg

  16. Interactive Session 1 Structured Streaming @maasg

  17. Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg

  18. Live @maasg

  19. Interactive Session 1 Structured Streaming QUICK RECAP @maasg

  20. Sources val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load() @maasg

  21. Operations ... val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String] val jsonValues = rawValues.select(from_json($"value", schema) as "record") val sensorData = jsonValues.select("record.*").as[SensorData] … @maasg

  22. Event Time ... val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp")) ... @maasg

  23. Sinks ... val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start() ... val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start() @maasg

  24. Use Cases ● Streaming ETL ● Stream aggregations, windows ● Event-time oriented analytics ● Arbitrary stateful stream processing ● Join Streams with other streams and with Fixed Datasets ● Apply Machine Learning Models @maasg

  25. Structured Streaming @maasg

  26. Spark Streaming Kafka Databases Flume Spark SQL Kinesis Spark ML HDFS Twitter ... Sockets API Server HDFS/S3 Apache Spark Streams Custom @maasg

  27. t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] @maasg

  28. t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] @maasg

  29. t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Actions @maasg

  30. API: Transformations map, flatmap, filter count, reduce, n countByValue, reduceByKey union, join cogroup @maasg

  31. API: Transformations mapWithState … … @maasg

  32. API: Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) } @maasg

  33. Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles * foreachRDD @maasg

  34. Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles Spark SQL * Dataframes foreachRDD GraphFrames Any API @maasg

  35. Interactive Session 2 Spark Streaming @maasg

  36. Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg

  37. Live @maasg

  38. Interactive Session 2 Spark Streaming QUICK RECAP @maasg

  39. Streaming Context import org.apache.spark.streaming.StreamingContext val streamingContext = new StreamingContext(sparkContext, Seconds(10)) @maasg

  40. Source val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString ) val topics = Set(topic) @transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics) @maasg

  41. Transformations import spark.implicits._ val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd } @maasg

  42. DIY Custom Model val model = new M2Model() … model.trainOn(inputData) … val scoredDStream = model.predictOnValues(inputData) @maasg

  43. Output suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample) } @maasg

  44. Usecases ● Complex computing/state management (local + cluster) ● Streaming Machine Learning β—‹ Learn β—‹ Score ● Join Streams with Updatable Datasets ● RDD-based streaming computations ● [-] Event-time oriented analytics ● [-] Optimizations: Query & Data ● [-] Continuous processing @maasg

  45. Sensor Anomaly Detection (Real Time Detection) Sensor Data Multiplexer Structured Structured Local Process Streaming Streaming @maasg

  46. + Structured Streaming

  47. Spark Streaming + Structured Streaming val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? Spark Streaming Structured Streaming val kafkaStream = spark.readStream … val dstream = KafkaUtils.createDirectStream(...) val f = parse andThen process andThen serialize dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val result = f(kafkaStream) val f = parse andThen process andThen serialize result.writeStream .format("kafka") val result = f(ds) .option("kafka.bootstrap.servers",bootstrapServers) result.write.format("kafka") .option("topic", writeTopic) .option("kafka.bootstrap.servers", bootstrapServers) .option("checkpointLocation", checkpointLocation) .option("topic", writeTopic) .start() .option("checkpointLocation", checkpointLocation) .save() } 47 @maasg

  48. Streaming Pipelines Structured Streaming Keyword Keyword Similarity Extraction Relevance DB Storage @maasg

  49. Structured Streaming Spark Streaming Abstract Fixed to microbatch Time (Processing Time, Event Time) Streaming Interval Fixed Micro batch, Best Effort MB, Execution Fixed Micro batch Continuous (NRT) Abstraction DataFrames/Dataset DStream, RDD Access to the scheduler @maasg

  50. New Project? 80% Structured Streaming 20%

  51. lightbend.com/fast-data-platform

  52. @maasg

  53. Gerard Maas SeΓ±or SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

  54. Thank You!

Recommend


More recommend