lie to me demystifying spark accumulators
play

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - PowerPoint PPT Presentation

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP About DMPkit Data Management Platform Data Monetization Technology 1DMC - Data Exchange More than 13 years in IT 7 years of friendship


  1. DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

  2. DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] (c1, 3) not visited order confirmed area of "Darth Vader's lightsaber" 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

  3. DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] (c1, 3) not visited order confirmed area of "Darth Vader's lightsaber" Ø 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

  4. DMP Segmentation find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

  5. DMP Segmentation map find all users who have (c1, 0) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or (c1, 1) "Luke Skywalker" during [last] 6 day[s] (c1, 2) clicked banner[s] "Darth Vader's lightsaber" (c1, 3) visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Ø Vader's lightsaber"

  6. DMP Segmentation map reduce find all users who have (c1, 0) (c1, 0;1;2;3) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or true(0) and (c1, 1) "Luke Skywalker" during [last] 6 day[s] true(1) and true(2) and (c1, 2) clicked banner[s] "Darth Vader's lightsaber" true(3) and (c1, 3) visited buying area of "Darth Vader's lightsaber" not false(4) not visited order confirmed area of "Darth Ø Vader's lightsaber"

  7. DMP Segmentation map reduce find all users who have (c1, 0) (c1, 0;1;2;3) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or true(0) and (c1, 1) "Luke Skywalker" during [last] 6 day[s] true(1) and c1 true(2) and (c1, 2) clicked banner[s] "Darth Vader's lightsaber" true(3) and (c1, 3) visited buying area of "Darth Vader's lightsaber" not false(4) not visited order confirmed area of "Darth Ø Vader's lightsaber"

  8. DMP Segmentation val predicateMatches = events.flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

  9. DMP Segmentation val predicateMatches = events. flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches . reduceByKey (_ ++ _) . filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } . keys ruleMatches.saveAsNewAPIHadoopDataset(...)

  10. DMP Segmentation val predicateMatches = events.flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches. saveAsNewAPIHadoopDataset (...)

  11. DMP Segmentation val ruleMatches = events . flatMap (...) . reduceByKey (...) . filter (...) . keys ruleMatches . saveAsNewAPIHadoopDataset (...)

  12. RDDs

  13. DMP Spark Actions val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches . saveAsNewAPIHadoopDataset (...) val stats = ruleMatches . treeAggregate (...)

  14. DMP Spark Actions val data = 1L to 1000000L sc . makeRDD ( data ) . map ( "%09d" . format (_)) . saveAsTextFile ( "/data/input" ) val rdd = sc .textFile( "/data/input" , 5) rdd.saveAsTextFile( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

  15. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc . textFile ( "/data/input" , 5 ) rdd . saveAsTextFile ( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

  16. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) rdd.saveAsTextFile( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [ RawLocalFileSystem ]) stats . getBytesRead shouldBe data . length * 10L + ( data . length / 2 +- data . length / 2 ) BYTES READ: 10 486 160 (~10 MB)

  17. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

  18. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile("/data/input") rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 20 841 256 (~20 MB)

  19. DMP Spark Actions By default, each transformed RDD may be recomputed each time you run an action on it .

  20. DMP Spark Actions By default, each transformed RDD may be recomputed each time you run an action on it. 2x data read

  21. By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it . https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

  22. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

  23. DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 10 486 160 (~10 MB)

  24. DMP Spark Actions sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

  25. DMP Spark Actions sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ... BYTES READ: 14 557 992 (~14 MB)

  26. DMP Spark Accumulators val acc = new MyAccumulator() sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys . map { item => acc.add(item) item } ruleMatches . saveAsNewAPIHadoopDataset (...)

  27. For accumulator updates performed inside actions only , Spark guarantees that each task’s update to the accumulator will only be applied once , i.e. restarted tasks will not update the value. In transformations , users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed. https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

  28. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  29. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  30. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  31. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  32. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) // <== Inserts new Stage here rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  33. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) // <== Inserts new Stage is here rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

  34. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd .map( failStage (2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

  35. DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => 30 was not equal to 10 acc . add ( 1L ) Expected :10 num Actual :30 } . repartition ( 3 ) rdd . map ( failStage (2) ) . saveAsTextFile ( "/data/output" ) acc.value shouldBe data.length

  36. DMP Spark Accumulators val blockManager = SparkEnv. get .blockManager val block = blockManager. diskBlockManager .getAllBlocks() .filter(_.isInstanceOf[ShuffleDataBlockId]) .map(_.asInstanceOf[ShuffleDataBlockId]) .head throw new FetchFailedException( blockManager. blockManagerId , block.shuffleId, block.mapId, block.reduceId, "__spark_stage_failed__" )

  37. DMP Spark Accumulators val acc = new MyAccumulator() sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys . map { item => acc.add(item) item } ruleMatches . saveAsNewAPIHadoopDataset (...)

  38. DMP Spark Accumulators Action Meaning collect () Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count () Return the number of elements in the dataset. ... ... saveAsTextFile ( path ) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach ( func ) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

  39. DMP Spark Accumulators Action Meaning collect () Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count () Return the number of elements in the dataset. ... ... saveAsTextFile ( path ) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach (func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

  40. DMP Spark Accumulators val acc = new MyAccumulator () sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches. saveAsNewAPIHadoopDataset (...) ruleMatches. foreach ( acc . add )

  41. DMP Spark Accumulators val acc = sc . longAccumulator ... ... val rdd = sc .textFile("/data/input", 5) rdd. saveAsTextFile ("/data/output") rdd. foreach (_ => acc.add ( 1L )) ... stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 20 841 256 (~20 MB)

  42. Про задачу DMP Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self.context.runJob(self, writeToFile) writer.commitJob() }

  43. DMP Про задачу Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self.context.runJob(self, writeToFile) writer.commitJob() }

  44. DMP Про задачу Custom RDD or …? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { ... Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self. context.runJob (self, writeToFile ) writer.commitJob() }

  45. … or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

  46. … or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc . runJob (this, ( iter: Iterator [ T ]) => iter.foreach ( cleanF )) }

  47. … or SparkContext? trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super .register(acc) acc match { case _: ActionAccumulator[_, _] => this . accumulators .put(acc.id, ActionCallable (acc)) case _ => } } ... }

  48. … or SparkContext? trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super .register(acc) acc match { case _: ActionAccumulator[_, _] => this . accumulators .put(acc.id, ActionCallable (acc)) case _ => } } ... }

  49. … or SparkContext? trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super .register(acc) acc match { case _: ActionAccumulator[_, _] => this . accumulators .put(acc.id, ActionCallable (acc)) case _ => } } ... }

  50. … or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob [T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

  51. … or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext : Boolean = iter . hasNext override def next (): T = { val rec : T = iter . next () accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

  52. … or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func( tc, accIter) } super .runJob(rdd, accFunc, partitions, resultHandler) } }

  53. … or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func( tc, accIter ) } super.runJob( rdd, accFunc , partitions, resultHandler ) } }

  54. SparkContext! val sc = new SparkContext(...) with ActionAccumulable

  55. DMP Spark Accumulators sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

  56. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

  57. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd.count() acc.value shouldBe data.length ...

  58. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd . count () acc.value shouldBe data.length ...

  59. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd . count () acc.value shouldBe data.length ...

  60. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd . count () acc.value shouldBe data.length ... BYTES READ: 10 486 160 (~10 MB)

  61. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd . count () acc.value shouldBe data.length ...

  62. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... // rdd . count () acc.value shouldBe data.length ...

  63. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ...

  64. DMP Spark Accumulators Task not serializable org.apache.spark.SparkException: Task not serializable at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack : - object not serializable (class: JobConf, value: Configuration: ... - field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, name: conf$4, type: class JobConf - object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0> ... - field class: ActionAccumulable$$anonfun$1, name: func$1, type: interface.Function2 - object class ActionAccumulable$$anonfun$1, <function2>

  65. DMP Spark Accumulators Task not serializable org.apache.spark.SparkException: Task not serializable at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack : - object not serializable (class: JobConf, value: Configuration: ... - field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, name: conf$4, type: class JobConf - object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0> ... - field class: ActionAccumulable$$anonfun$1, name: func$1, type: interface.Function2 - object class ActionAccumulable$$anonfun$1, <function2>

  66. DMP Spark Accumulators Task not serializable org.apache.spark.SparkException: Task not serializable at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner $. clean (ClosureCleaner.scala:108) at SparkContext . clean (SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack: - object not serializable (class: JobConf, value: Configuration: ... - field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, name: conf$4, type: class JobConf - object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0> ... - field class: ActionAccumulable$$anonfun$1, name: func$1, type: interface.Function2 - object class ActionAccumulable$$anonfun$1, <function2>

  67. DMP ClosureCleaner intro. Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

  68. DMP ClosureCleaner intro. Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

  69. DMP ClosureCleaner intro. Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

  70. DMP ClosureCleaner intro. def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

  71. DMP ClosureCleaner intro. def foreach( f : T => Unit): Unit = withScope { val cleanF = sc.clean( f ) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

  72. DMP ClosureCleaner intro. def foreach( f : T => Unit): Unit = withScope { val cleanF = sc. clean ( f ) sc.runJob( this , (iter: Iterator[T]) => iter.foreach( cleanF )) }

  73. DMP ClosureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

  74. DMP ClousureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

  75. DMP ClousureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val cleanF = clean(func) val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } cleanF(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

  76. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5) //.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ...

  77. DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5) //.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ... BYTES READ: 10 486 160 (~10 MB)

  78. Datasets

  79. DMP Why Datasets?

Recommend


More recommend