apache spark streaming kafka and harmonicio a performance
play

Apache Spark Streaming, Kafka and HarmonicIO: A Performance - PowerPoint PPT Presentation

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench19,


  1. Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench’19, Denver, USA, November 2019. http://www.benchcouncil.org/bench19/index.html

  2. Summary • Performance Benchmark for Streaming Frameworks – Apache Spark (under various integrations…) – HarmonicIO • Large Message Size (and higher processing cost) – Scientific use cases: microscopy • Key finding: ‘islands’ of good performance over that 2D domain, varying utility w.r.t. theoretical bounds.

  3. Background • Apache Spark – Enterprise grade (resilient, great features, etc.) – Proven performance for typical enterprise use cases. • HASTE Project: – Microscopy use cases – Message Size 1-10MB, >1 second per message. • How well do enterprise tools adapt to sci. computing?

  4. The Parameter Space • 2D Parameter Space ( A ) ) • Theoretical Bounds C P U B o u n d n o i t c – Network n u F p – CPU Ma ( ( B ) N e t w o r k t • How does performance s B o u n d o ( C ) ‘ F r a m e w o r k ’ C generalize across this B o u n d U P domain? C Me s s a g e S i z e

  5. HarmonicIO • Favors P2P message transfer. Image source: Torruangwatthana et al., • Fallback to Master Queue HarmonicIO: Scalable Data Stream Processing for • Processing runs inside Docker containers. Scientific Datasets , IEEE • Intended for scientific computing applications. Services 2018

  6. Methodology icIO A p a c h e S p a r k S t r e a m i n g w . F i l e S t r e a m i n g A p a c h e S p a r k S t r e a m i n g w . T C P A p a c h e S p a r k S t r e a m i n g w . K a f k a H a r m o n Ma s t e r Ma s t e r Ma s t e r Ma s t e r K a f k a S e r v e r Wo r k e r 1 Wo r k e r 1 Wo r k e r 1 Wo r k e r 1 S t r e a m S t r e a m S t r e a m S t r e a m S o u r c e S o u r c e S o u r c e S o u r c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wo r k e r N - 1 . . . Wo r k e r N Wo r k e r N Wo r k e r N Me s s a g e T r a n s f e r – P 2 P Mo d e F i l e T r a n s f e r ( N F S S h a r e ) Me s s a g e T r a n s f e r Me s s a g e T r a n s f e r Me s s a g e T r a n s f e r – Q u e u e Mo d e F i l e L i s t i n g ( N F S S h a r e )

  7. Experimental Setup CPU Pause, padded to length Spark Streaming Source Benchmarking Application M o n i t o r i n g I P v A S i a e t t L s s o e M g R s , e s s g s o a L g v i e a a i S v R Throttling i g z e e n s , t i r C A o Application P t P i U I n o C M o s t

  8. The Workload StreamingBenchmark.scala

  9. Dark = High Freq Black = Best Light = Low Freq

  10. - Excellent Performance near origin. - 300KHz - Relatively weaker for high CPU Load - Cores used for message forwarding - Crashes for Large Messages.

  11. - Excellent Performance near origin. - At origin, beaten by Spark+TCP - Weaker for high CPU load - Overhead of Kafka server - Weaker for larger messages. - Not intended use case.

  12. - Great Performance at low frequencies. - Sparks’ filesystem polling struggles at high frequency.

  13. - Good overall performance. - Able to match performance of Spark+FS, and Spark+Kafka in their regions of good performance - …and in between. - Struggles at higher frequencies near origin.

  14. Results – Theoretical Bounds

  15. Performance for nil CPU Load

  16. Discussion • ‘islands’ of good performance.

  17. Conclusions • Choice of Spark Integration matters – depends on the parameters, frequency. • 2D Parameter Sweep is a nice way to viz. performance. • Various phenomenon visible only in some regions: – Bottlenecks, overhead costs. – Varying utility (w.r.t. theoretical bounds). • ‘Middle Region’ – 1-10Mb, >1 second cost – Neglected in streaming benchmark studies? – A region where HarmonicIO does well.

  18. Funding The HASTE Project (http://haste.research.it.uu.se/) is funded by the Swedish Foundation for Strategic Research (SSF) under award no. BD15-0008, and the eSSENCE strategic collaboration for eScience.

  19. Questions? Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden http://haste.research.it.uu.se/ https://github.com/HASTE-project/HarmonicIO https://github.com/HASTE-project

  20. Results

Recommend


More recommend