are objects the right level of abstraction to enable the
play

Are objects the right level of abstraction to enable the convergence - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan , Gabriel Antoniu , Jess Montes*, Mara S. Prez *, Luc Boug * Universidad


  1. Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan ✝ , Gabriel Antoniu ◇ , Jesús Montes*, María S. Pérez *, Luc Bougé ◇ * Universidad Politécnica de Madrid, Madrid, Spain — ✝ INSA Rennes / IRISA, Rennes, France — ◇ Inria Rennes Bretagne-Atlantique, Rennes, France

  2. A Catalyst for Convergence: Data Science 1

  3. An Approach: 
 The BigStorage H2020 Project Can we build a converged storage system for HPC and Big Data? 2

  4. The BigStorage Consortium 3

  5. One concern 4

  6. HPC App 5

  7. HPC App (POSIX) File System 6

  8. HPC App (POSIX) File System

  9. folder / file hierarchies permissions Supports random reads and writes to files ◇ atomic file renaming multi-user protection 7

  10. Supports random reads and writes to files 8

  11. Supports random reads and writes to files Objects 8

  12. HPC App Object Storage System 9

  13. HPC App Big Data App Object Storage System 10

  14. HPC App Big Data App Object Storage System K/V Store DB FS 10

  15. Big Data App HPC App K/V Store DB Object Storage System FS 10

  16. Big Data App HPC App K/V Store DB Object Storage System Object Storage System 10

  17. Big Data App HPC App K/V Store DB Converged Object Storage System 11

  18. MonALISA monitoring platform of the CERN LHC ALICE Experiment A Big Data use-case 12

  19. One problem… A scientific monitoring service, monitoring the ALICE CERN LHC experiment: - Ingests events at a rate of up to 16 GB/s, - Produces more than 10 9 data files per year Computes 35.000+ aggregates in real-time Current lock-based platform does not scale …multiple requirements - Multi-object write synchronization support - Atomic, lock-free writes - High-performance reads - Horizontal scalability 13

  20. Object Storage System Why is write synchronization needed? write(count,6) Aggregate computation is a three-step operation: read(count) 1. Read current value remotely from storage sync( ) 2. Update it with the new data ack 3. Write the updated value remotely to storage 5 Aggregate update needs to be atomic (transactions) Also, adding a new data to persistent storage and updating the related aggregates needs to be performed atomically as well. Client 14

  21. At which level to handle concurrency management? 15

  22. At the application level? Thread 1 Thread 2 Thread 3 Enables fine-grained synchronization (app knowledge) …but significantly complexities application design, and typically only guarantees isolation. Synchronization layer At a middleware level? Eases application design… …but has a performance cost (zero knowledge), and usually also only guarantees isolation. At a storage level? Object Storage System Also eases application design, Transactional better performance than middleware (storage knowledge), Object Storage System and may offer additional consistency guarantees. 16

  23. Aren’t existing transactional object stores enough? 17

  24. Not quite. Existing transactional systems typically only ensure consistency of writes In most current systems, reads are performed atomically only because objects are small enough to be located on a single server, i.e. - Records for database systems - Values for Key-Value stores Yet, for large objects, reads spanning multiple chunks should always return a consistent view 18

  25. T ý r transactional design T ý r internally maps all writes to transactions - Multi-chunk, and even multi-object operations are processed with a serializable order - Ensures that all chunk replicas are consistent T ý r uses a high-performance, sequentially-consistent transaction chain algorithm: WARP [1]. [1] R. Escriva et al. – Warp: Lightweight Multi-Key Transactions for Key-Value Stores 19

  26. T ý r is alive! Fully implemented as a prototype with ~22.000 lines of C Lock-free, queue-free, asynchronous design. Leveraging well-known technologies: - Google LevelDB [1] for node-local persistent storage, - Google FlatBuffers [2] for message serialization, - UDT [3] as network transfer protocol. [1] http://leveldb.org/ [2] https://google.github.io/flatbuffers [3] http://udt.sourceforge.net/ 20

  27. T ý r evaluation with MonALISA MonALISA data collection was re-implemented atop T ý r, and evaluated using real data T ý r was compared to other state-of-the-art, object-based storage systems: + - RADOS / librados (Ceph) - Azure Storage Blobs (Microsoft) - BlobSeer (Inria) Experiments run on the Microsoft Azure cloud, up to 256 nodes 3 x replication factor for all systems 21

  28. Synchronized write performance: Evaluating transactional write performance 5 Avg. throughput (mil. ops / sec) We add fine-grained, application-level, lock-based 3,75 synchronization to T ý r competitors Performance of T ý r competitors decrease due to the 2,5 synchronization cost Clear advantage of Atomic operations over Read- 1,25 Update-Write aggregate updates 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent writers T ý r (Atomic operations) Tyr (Read-Update-Write) RADOS (Synchronized) BlobSeer (Synchronized) Azure Blobs (Synchronized) 22

  29. Read performance 8 Avg. throughput (mil. ops / sec) We simulate MonALISA reads, varying the number of 6 concurrent readers Slightly lower performance than RADOS, but offers 4 read consistency guarantees T ý r lightweight read protocol allows it to outperform 2 BlobSeer and Azure Storage 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent readers T ý r RADOS BlobSeer Azure Blobs 23

  30. The next step Big Data App HPC App RDB K/V Store Converged Object Storage System T ý r as a base layer for higher-level T ý r for HPC applications? storage abstractions? 24

  31. Before that: A study of feasibility 25

  32. Current storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC PFS Big Data DFS 26

  33. “Converged” storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC Adapter Big Data Adapter Object-based storage calls Converged Object Storage System 27

  34. Object-oriented primitives - Object Access: random object read, object size - Object Manipulation: random object write, truncate - Object Administration: create object, delete object - Namespace Access: scan all objects - These operations are similar to those permitted by the POSIX-IO API on a single file - Directory-level operations do not have their object-based storage counterpart (flat nature of these kinds of systems) - Low number of them - Emulated using the scan operation (far from optimized, but compensated by the gains permitted by using a flat namespace and simpler semantics) 28

  35. Representative set of HPC/BD applications Platform Application Usage Total reads Total writes R/W ratio Profile mpiBLAST Protein docking 27.7 GB 12.8 MB 2.1*10^3 Read-intensive MOM Oceanic model 19.5 GB 3.2 GB 6.01 Read-intensive HPC/MPI Sediment ECOHAM 0.4 GB 9.7 GB 4.2*10^-2 Write-intensive propagation Video Ray Tracing 67.4 GB 71.2 GB 0.94 Balanced processing Sort Text processing 5.8 GB 5.8 GB 1.00 Balanced Connected Graph 13.1 GB 71.2 MB 0.18 Read-intensive Component processing Cloud/Spark Grep Text processing 55.8 GB 863.8 MB 64.52 Read-intensive Decision Tree Machine learning 59.1 GB 4.7 GB 12.58 Read-intensive Tokenizer Text processing 55.8 GB 235.7 GB 0.24 Write-intensive 29

  36. 30

  37. Original operation Rewritten operation Operation Action Operation count create(/foo/bar) create(/foo__bar) open(/foo/bar) open(/foo__bar) mkdir Create directory 43 read(fd) read(bd) rmdir Remove directory 43 write(fd) write(bd) mkdir(/foo) Dropped operation opendir (Input data Open/List directory 5 directory) opendir(/foo) scan(/), return all files matching /foo__* opendir (other Open/List directory 0 directories) rmdir(/foo) scan(/), remove all files matching /foo__* 31

  38. 32

  39. T ý r and RADOS vs Lustre (HPC) , HDFS/CephFS (Big Data) - Grid’5000 experimental testbed distributed over 11 sites in France and Luxembourg (parapluie cluster, Rennes) - 2 x 12-core 1.7 Ghz 6164 HE, 48 GB of RAM, and 250 GB HDD. - HPC applications: Lustre 2.9.0 and MPICH 3.2 [67], on a 32-node cluster. - Big data applications: Spark 2.1.0, Hadoop / HDFS 2.7.3 and Ceph Kraken on a 32-node cluster 33

  40. HPC applications 34

  41. BD applications 35

  42. HPC/BD applications 36

  43. Conclusions - Tyr is a novel high-performance object-based storage system providing built-in multi object transactions - Object-based storage convergence is possible, leading to a significant performance improvement on both platforms (HPC and Cloud) - A completion time improvement of up to 25% for big data applications and 15% for HPC applications when using object-based storage 37

  44. Thank you!

Recommend


More recommend