starting with apache spark
play

Starting with Apache Spark, Best Practices and Learning from the - PowerPoint PPT Presentation

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft Best Practices Enterprise Solutions Resilient - Fault tolerant 19,500+ commits Tungsten AMPLab


  1. Repartition T o numPartitions or by Columns Increase parallelism – will shuffle coalesce() – combine partitions in place

  2. Cache cache() or persist() Flush least-recently-used (LRU) - Make sure there is enough memory! MEMORY_AND_DISK to avoid expensive recompute (but spill to disk is slow)

  3. Streaming Use Structured Streaming (2.1+) If not... If reliable messaging (Kafka) use Direct DStream

  4. Metadata - Config Position from streaming source (aka offset) - could get duplicates! (at-least-once) Pending batches

  5. Persist stateful transformations - data lost if not saved Cut short execution that could grow indefinitely

  6. Direct DStream Checkpoint also store offset Turn off auto commit - do when in good state for exactly- once

  7. Checkpointing Stream/ML/Graph/SQL - more efficient indefinite/iterative - recovery Generally not versioning-safe Use reliable distributed file system (caution on “object store”)

  8. External Data Source Hadoop Hive Hourly ly FrontEnd ntEnd Spark SQL WebLog Hive Metastore HDFS BI T ools

  9. Spark Ne Near-RealT RealTime ime ML (e (end nd-to to-end end round ndtrip: ip: 8-20 20 sec) FrontEnd ntEnd Spark Kafka Streaming Offline HDFS Analysis

  10. BI T ools SQL Appliance Spark SQL Hive BI T ools RDBMS

Recommend


More recommend