Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - PowerPoint PPT Presentation

DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] (c1, 3) not visited order confirmed area of "Darth Vader's lightsaber" 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

DMP Segmentation c1 find all users who have 0 taken part in campaign[s] "Star Wars" [and] (c1, 0) viewed banner[s] "Darth Vader" or "Luke Skywalker" 1 (c1, 1) during [last] 6 day[s] [and] 2 clicked banner[s] "Darth Vader's lightsaber" [and] (c1, 2) 3 visited buying area of "Darth Vader's lightsaber" [and] (c1, 3) not visited order confirmed area of "Darth Vader's lightsaber" Ø 4 id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

DMP Segmentation find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

DMP Segmentation map find all users who have (c1, 0) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or (c1, 1) "Luke Skywalker" during [last] 6 day[s] (c1, 2) clicked banner[s] "Darth Vader's lightsaber" (c1, 3) visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Ø Vader's lightsaber"

DMP Segmentation map reduce find all users who have (c1, 0) (c1, 0;1;2;3) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or true(0) and (c1, 1) "Luke Skywalker" during [last] 6 day[s] true(1) and true(2) and (c1, 2) clicked banner[s] "Darth Vader's lightsaber" true(3) and (c1, 3) visited buying area of "Darth Vader's lightsaber" not false(4) not visited order confirmed area of "Darth Ø Vader's lightsaber"

DMP Segmentation map reduce find all users who have (c1, 0) (c1, 0;1;2;3) taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or true(0) and (c1, 1) "Luke Skywalker" during [last] 6 day[s] true(1) and c1 true(2) and (c1, 2) clicked banner[s] "Darth Vader's lightsaber" true(3) and (c1, 3) visited buying area of "Darth Vader's lightsaber" not false(4) not visited order confirmed area of "Darth Ø Vader's lightsaber"

DMP Segmentation val predicateMatches = events.flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation val predicateMatches = events. flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches . reduceByKey (_ ++ _) . filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } . keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation val predicateMatches = events.flatMap { event => rules.value.foldLeft( Set [((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches. saveAsNewAPIHadoopDataset (...)

DMP Segmentation val ruleMatches = events . flatMap (...) . reduceByKey (...) . filter (...) . keys ruleMatches . saveAsNewAPIHadoopDataset (...)

DMP Spark Actions val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches . saveAsNewAPIHadoopDataset (...) val stats = ruleMatches . treeAggregate (...)

DMP Spark Actions val data = 1L to 1000000L sc . makeRDD ( data ) . map ( "%09d" . format (_)) . saveAsTextFile ( "/data/input" ) val rdd = sc .textFile( "/data/input" , 5) rdd.saveAsTextFile( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc . textFile ( "/data/input" , 5 ) rdd . saveAsTextFile ( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) rdd.saveAsTextFile( "/data/output" ) val stats = FileSystem . getStatistics ( "file" , classOf [ RawLocalFileSystem ]) stats . getBytesRead shouldBe data . length * 10L + ( data . length / 2 +- data . length / 2 ) BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile("/data/input") rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 20 841 256 (~20 MB)

DMP Spark Actions By default, each transformed RDD may be recomputed each time you run an action on it .

DMP Spark Actions By default, each transformed RDD may be recomputed each time you run an action on it. 2x data read

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it . https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions val data = 1L to 1000000L sc .makeRDD(data) .map( "%09d" .format(_)) .saveAsTextFile( "/data/input" ) val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

DMP Spark Actions sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val rdd = sc .textFile("/data/input", 5). cache () rdd. saveAsTextFile ("/data/output") val numRecords = rdd. treeAggregate (0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ... BYTES READ: 14 557 992 (~14 MB)

DMP Spark Accumulators val acc = new MyAccumulator() sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys . map { item => acc.add(item) item } ruleMatches . saveAsNewAPIHadoopDataset (...)

For accumulator updates performed inside actions only , Spark guarantees that each task’s update to the accumulator will only be applied once , i.e. restarted tasks will not update the value. In transformations , users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed. https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) // <== Inserts new Stage here rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) // <== Inserts new Stage is here rdd . map ( failStage ( 2 )) . saveAsTextFile ( "/data/output" ) acc . value shouldBe data . length

DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => acc . add ( 1L ) num } . repartition ( 3 ) rdd .map( failStage (2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators val data = 1L to 10L val acc = sc . longAccumulator val rdd = sc . makeRDD ( data ) . map { num => 30 was not equal to 10 acc . add ( 1L ) Expected :10 num Actual :30 } . repartition ( 3 ) rdd . map ( failStage (2) ) . saveAsTextFile ( "/data/output" ) acc.value shouldBe data.length

DMP Spark Accumulators val blockManager = SparkEnv. get .blockManager val block = blockManager. diskBlockManager .getAllBlocks() .filter(_.isInstanceOf[ShuffleDataBlockId]) .map(_.asInstanceOf[ShuffleDataBlockId]) .head throw new FetchFailedException( blockManager. blockManagerId , block.shuffleId, block.mapId, block.reduceId, "__spark_stage_failed__" )

DMP Spark Accumulators val acc = new MyAccumulator() sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys . map { item => acc.add(item) item } ruleMatches . saveAsNewAPIHadoopDataset (...)

DMP Spark Accumulators Action Meaning collect () Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count () Return the number of elements in the dataset. ... ... saveAsTextFile ( path ) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach ( func ) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

DMP Spark Accumulators Action Meaning collect () Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count () Return the number of elements in the dataset. ... ... saveAsTextFile ( path ) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach (func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

DMP Spark Accumulators val acc = new MyAccumulator () sc .register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches. saveAsNewAPIHadoopDataset (...) ruleMatches. foreach ( acc . add )

DMP Spark Accumulators val acc = sc . longAccumulator ... ... val rdd = sc .textFile("/data/input", 5) rdd. saveAsTextFile ("/data/output") rdd. foreach (_ => acc.add ( 1L )) ... stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2) BYTES READ: 20 841 256 (~20 MB)

Про задачу DMP Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self.context.runJob(self, writeToFile) writer.commitJob() }

DMP Про задачу Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self.context.runJob(self, writeToFile) writer.commitJob() }

DMP Про задачу Custom RDD or …? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException( "Output format class not set" ) } if (keyClass == null ) { throw new SparkException( "Output key class not set" ) } if (valueClass == null ) { throw new SparkException( "Output value class not set" ) } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug( "Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")" ) if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { ... Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit() outputMetrics.setBytesWritten(callback()) outputMetrics.setRecordsWritten(recordsWritten) } self. context.runJob (self, writeToFile ) writer.commitJob() }

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc . runJob (this, ( iter: Iterator [ T ]) => iter.foreach ( cleanF )) }

… or SparkContext? trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super .register(acc) acc match { case _: ActionAccumulator[_, _] => this . accumulators .put(acc.id, ActionCallable (acc)) case _ => } } ... }

… or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob [T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

… or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext : Boolean = iter . hasNext override def next (): T = { val rec : T = iter . next () accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

… or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func( tc, accIter) } super .runJob(rdd, accFunc, partitions, resultHandler) } }

… or SparkContext? trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func( tc, accIter ) } super.runJob( rdd, accFunc , partitions, resultHandler ) } }

SparkContext! val sc = new SparkContext(...) with ActionAccumulable

DMP Spark Accumulators sparkConf . set ( "spark.testing" , "true" ) . set ( "spark.testing.memory" , ( 75*1024*1024 ). toString ) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd . count () acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc . textFile ("/data/input", 5) //.cache() ... rdd . count () acc.value shouldBe data.length ... BYTES READ: 10 486 160 (~10 MB)

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... rdd . count () acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... // rdd . count () acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5)//.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ...

DMP Spark Accumulators Task not serializable org.apache.spark.SparkException: Task not serializable at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack : - object not serializable (class: JobConf, value: Configuration: ... - field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, name: conf$4, type: class JobConf - object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0> ... - field class: ActionAccumulable$$anonfun$1, name: func$1, type: interface.Function2 - object class ActionAccumulable$$anonfun$1, <function2>

DMP Spark Accumulators Task not serializable org.apache.spark.SparkException: Task not serializable at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner $. clean (ClosureCleaner.scala:108) at SparkContext . clean (SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack: - object not serializable (class: JobConf, value: Configuration: ... - field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, name: conf$4, type: class JobConf - object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0> ... - field class: ActionAccumulable$$anonfun$1, name: func$1, type: interface.Function2 - object class ActionAccumulable$$anonfun$1, <function2>

DMP ClosureCleaner intro. Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

DMP ClosureCleaner intro. def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

DMP ClosureCleaner intro. def foreach( f : T => Unit): Unit = withScope { val cleanF = sc.clean( f ) sc.runJob( this , (iter: Iterator[T]) => iter.foreach(cleanF)) }

DMP ClosureCleaner intro. def foreach( f : T => Unit): Unit = withScope { val cleanF = sc. clean ( f ) sc.runJob( this , (iter: Iterator[T]) => iter.foreach( cleanF )) }

DMP ClosureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

DMP ClousureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

DMP ClousureCleaner intro. trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U , ...): Unit = { val cleanF = clean(func) val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] { override def hasNext: Boolean = iter.hasNext override def next(): T = { val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } cleanF(tc, accIter) } super .runJob(rdd, accFunc , partitions, resultHandler) } }

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5) //.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ...

DMP Spark Accumulators sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc .textFile("/data/input", 5) //.cache() ... // rdd . count () rdd . saveAsTextFile ("/data/output") acc.value shouldBe data.length ... BYTES READ: 10 486 160 (~10 MB)

Datasets

DMP Why Datasets?

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - PowerPoint PPT Presentation

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP About DMPkit Data Management Platform Data Monetization Technology 1DMC - Data Exchange More than 13 years in IT 7 years of friendship

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

33: Accumulators & Polishing code (Functional Data) Accumulators Estimated Value and search

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Accumulators & Difference Lists Accumulators & Difference Lists York University CSE 3401

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

ELT Main Structure earthquake protection system concept: analysis and simulations Concept

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

SPARK G N I T E R R A M E V I T A E C u s t o m L o g o P r o j e c t Introduction

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Tail Recursion and Accumulators Recursion Should now be

Demystifying Benchmarks How to Use Them to Better Evaluate Databases Peter Friedenbach,

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark & Spark SQL High-Speed In-Memory Analytics over

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Zero-Knowledge Proofs III Beyond zk-SNARKs & Accumulators Oct. 23, 2019 Recap zk-SNARKs

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - PowerPoint PPT Presentation

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP About DMPkit Data Management Platform Data Monetization Technology 1DMC - Data Exchange More than 13 years in IT 7 years of friendship

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

33: Accumulators &amp; Polishing code (Functional Data) Accumulators Estimated Value and search

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Accumulators &amp; Difference Lists Accumulators &amp; Difference Lists York University CSE 3401

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

ELT Main Structure earthquake protection system concept: analysis and simulations Concept

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

SPARK G N I T E R R A M E V I T A E C u s t o m L o g o P r o j e c t Introduction

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Tail Recursion and Accumulators Recursion Should now be

Demystifying Benchmarks How to Use Them to Better Evaluate Databases Peter Friedenbach,

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark &amp; Spark SQL High-Speed In-Memory Analytics over

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Zero-Knowledge Proofs III Beyond zk-SNARKs &amp; Accumulators Oct. 23, 2019 Recap zk-SNARKs

33: Accumulators & Polishing code (Functional Data) Accumulators Estimated Value and search

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Accumulators & Difference Lists Accumulators & Difference Lists York University CSE 3401

Spark & Spark SQL High-Speed In-Memory Analytics over

Zero-Knowledge Proofs III Beyond zk-SNARKs & Accumulators Oct. 23, 2019 Recap zk-SNARKs