performance and insights
play

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - PowerPoint PPT Presentation

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag Motesnitsalis Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5


  1. Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis

  2. Design and Expectations  2 Use Cases:  Exhaustive (operation using all values of a record)  Selective (operation using limited values of a record)  5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro  The tests gave insights on specific advantages and dis- advantages for each format as well as their time and space performance. 2

  3. Experiment descriptions  For the “exhaustive” use case (UC1) we used EOS logs “processed” data.  Current default data format is CSV .  For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard.  Current default data format is JSON .  For each use case all formats were generated a priori (from the default format) and then executed the tests.  Technology: Spark (Scala) with SparkSQL library.  No test performed with compression. 3

  4. Formats  CSV – text files, comma separated values, one per line  JSON – text files, JavaScript objects, one per line  Serialiazed RDD Objects (SRO) – Spark dataset serialized on text files  Avro – serialization format with binary encoding  Parquet – colunmar format with binary encoding 4

  5. Space Requirements (in GB) 25 23.4 19.9 20 EOS default 15 Size (GB) 13.4 12.7 UC1 11.7 UC2 Job Monitoring 10 8.1 default 6.6 6.3 6.2 5 2.1 0 CSV JSON SRO AVRO Parquet 5

  6. Spark executions for i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers). Times include reading and computation (test jobs don't write any file, they just print to stdout the result ). 6

  7. Times: UC1 "Exhaustive" 12.7 23.4 19.9 11.7 6.3 GB 300 245.9 250 205 Time (seconds) 200 155.8 AVG 150 131.6 MIN 108.4 100 80.3 80.7 78.5 65.41 64.5 50 0 CSV JSON SRO AVRO Parquet 7

  8. Times: UC2 "Selective" 6.2 13.4 8.1 6.6 2.1 GB 120 109.5 100 83.3 Time (seconds) 74.6 80 63.6 AVG 60 52.6 MIN 42 40.4 35.1 40 23.7 18.4 20 0 CSV JSON SRO AVRO Parquet 8

  9. Time Comparison between UC1 and UC2 AVG UC1 AVG UC2 300 250 200 seconds 150 100 50 0 CSV JSON SRO AVRO Parquet 9

  10. Space and Time Performance Gain/Loss [compared to current CSV JSON SRO Avro Parquet default format] Space UC1 = + 84 % + 56 % - 8 % - 51 % [EOS logs] CSV Time performance = + 215 % + 93 % = + 35 % UC1 Space UC2 - 54 % = - 40 % - 51 % - 84 % [Job Monitoring] JSON Time performance - 64 % = - 35 % - 54 % - 79 % UC2 10

  11. Pros and Cons Pros Cons CSV Always supported and easy to use. No schema change allowed. No type definitions. Efficient. No declaration control. JSON Encoded in plain text (easy to use). Inefficient. High space consuming. Schema changes allowed. No declaration control. Serialized Declaration control. Choice “between” Spark only. No compression. RDD CSV and JSON (for space and time). Schema changes allowed but to be manually Objects Good to store aggregate result. implemented. Avro Schema changes allowed. Space consuming like CSV (not really a negative). Efficiency comparable to CSV. Needs a plugin (we found an incompatibility with Compression definition included in the our Spark version and avro library, we had to fix schema. and recompile it). Parquet Low space consuming (RLE). Extremely Needs a plugin. efficient for “selective” use cases but Slow to be generated. good performances also in other cases. 11

  12. Data Formats - Overview CSV JSON SRO Avro Parquet Support Change of NO YES YES YES YES Schema Primitive/Complex - YES (but with YES YES YES Types general numeric) Declaration control - NO YES YES YES Support Compression YES YES NO YES YES Storage Consumption Medium High Medium/High Medium Low (RLE) Supported by which All All (to be parsed Spark only All (needs All (needs technologies? from text) plugin) plugin) Possilibity to print a YES YES NO YES (with NO (yes with snippet as sample avro tools) unofficial tools) 12

  13. Conclusions There is no “ultimate” file format but…  Avro shows promising results for exhaustive use cases, with performances comparable to CSV.  Parquet shows extremely good results for selective use cases and really low space consuming.  JSON is good to store directly (without any additional effort) data coming from web-like services that might change their format in a future, but it is too inefficient and high space consuming.  CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user.  Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format. 13

  14. Thank You 14

  15. Spark UC1 executions UC1 (EM,NE,EC): 2G 2 1 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 4 2 0 50 100 150 200 250 300 350 400 450 500 parquet sro avro json csv 15

  16. Spark UC2 executions UC2 (EM,NE,EC): 2G 2 1 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 4 2 0 50 100 150 200 250 parquet sro avro json csv 16

Recommend


More recommend