scibox online sharing of scientific data
play

Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang - PowerPoint PPT Presentation

Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang , Xuechen Zhang , Greg Eisenhauer , Karsten Schwan Jia Matthew Wolf , , Stephane Ethier , Scott Klasky CERCS Research Center, Georgia Tech


  1. Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang † , Xuechen Zhang † , Greg Eisenhauer † , Karsten Schwan † Jia Matthew Wolf †,‡ , Stephane Ethier ǂ , Scott Klasky ‡ † CERCS Research Center, Georgia Tech ǂ Princeton Plasma Physics Laboratory ‡ Oak Ridge National Laboratory Supported in part by funding from the US Department of Energy for DOE SDAV SciDac 1

  2. Outline • Background and Motivation • Problems and Challenges • Design and Implementation • Evaluation • Conclusion and Future Work 2

  3. Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability 3

  4. Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage • Dropbox, GoogleDrive, iCloud, SkyDrive, and etc. 3

  5. Cloud Storage is Popular Easy-of-use Pay-as-you-go model Universal accessibility Good scalability and durability Works based on cloud storage • Dropbox, GoogleDrive, iCloud, SkyDrive, and etc. Scibo Sc ibox: fo focus on on sc scien ientif ific ic data sh sharin ing 3

  6. Use Cases for Cloud Storage Combustion Experimental Private Data Cloud Aero Cluster 4

  7. Use Cases for Cloud Storage Combustion Experimental Private Data Cloud Aero Cluster 4

  8. Use Cases for Cloud Storage Combustion Experimental Private Data Cloud Image Processing Student PC Aero Cluster 4

  9. Use Cases for Cloud Storage Combustion Experimental Private Data Cloud Image Processing GTS/ LAMMPS Student PC Vogue Aero Cluster 4

  10. Use Cases for Cloud Storage Combustion Experimental Private Public Data Cloud Cloud Image Processing GTS/ LAMMPS Student PC Vogue Aero Cluster 4

  11. Use Cases for Cloud Storage Combustion Visualization Experimental Private Public Data Cloud Cloud Image Processing GTS/ LAMMPS GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) Student PC Vogue Aero Cluster 4

  12. Use Cases for Cloud Storage Combustion Visualization Experimental Private Public Data Cloud Cloud Image Processing GTS/ LAMMPS 1. Easy of use 2. Universal accessibility 3. Good scalability GeorgiaTech (Atlanta) WSU (Detroit) OSU (Columbus) Student PC Vogue Aero Cluster 4

  13. Outline • Background and Motivation • Problems and Challenges • Design and Implementation • Evaluation • Conclusion and Future Work 5

  14. Cloud Storage is Not Ready for HEC Scientific applications are data-intensive • Generate large amounts of data 6

  15. Cloud Storage is Not Ready for HEC Scientific applications are data-intensive • Generate large amounts of data Networking bandwidth is limited • Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores 6

  16. Cloud Storage is Not Ready for HEC Scientific applications are data-intensive • Generate large amounts of data Networking bandwidth is limited • Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers • Expensive for large amounts of data when using the pay-as-you-go model 6

  17. Cloud Storage is Not Ready for HEC Scientific applications are data-intensive • Generate large amounts of data Networking bandwidth is limited • Inadequate levels of ingress and egress bandwidths available to/from remote cloud stores High costs imposed by cloud providers • Expensive for large amounts of data when using the pay-as-you-go model An example: A GTS runs on 29K cores on the Jaguar machine at OLCF generates over 54 Terabytes of data in a 24 hour period. Amazon S3: ~$0.03/GB for storage and $0.09/GB for data transfer out. Cost: $6635.52/day, increases with increasing number of collaborators 6

  18. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 Time step 0 Example A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n Output of GTS fusion modeling simulation: B 0 B 1 B 2 … B n B 0 B 1 B 2 … B n Checkpoint data, diagnosis data, visualization data and etc. C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n Each data subset includes many elements Time step 1 Time step 1 A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n B 0 B 1 B 2 … B n B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n … … Cloud Data producer Data consumer 7

  19. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 Time step 0 A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n B 0 B 1 B 2 … B n B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n Time step 1 Time step 1 A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n B 0 B 1 B 2 … B n B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n … … Cloud Data producer Data consumer 7

  20. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 Time step 0 A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n • Scientific data formats: BP/HDF5 B 0 B 1 B 2 … B n B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n Structured and meta-data rich Time step 1 Time step 1 • Standard I/O interface: ADIOS A 0 A 1 A 2 … A n A 0 A 1 A 2 … A n B 0 B 1 B 2 … B n Almost transparent to users B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n C 0 C 1 C 2 … C n … … Cloud Data producer Data consumer 7

  21. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 … A n • Scientific data formats: BP/HDF5 B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n Structured and meta-data rich Time step 1 • Standard I/O interface: ADIOS A 0 A 1 A 2 … A n Almost transparent to users B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n … Cloud Data producer Data consumer 7

  22. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 … A n • Scientific data formats: BP/HDF5 B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n Structured and meta-data rich Time step 1 • Standard I/O interface: ADIOS A 0 A 1 A 2 … A n Almost transparent to users B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n … Cloud Data producer Data consumer 7

  23. Problem: Too Much Data Movement Issue: naïve approach transfers lots of data, even if only some of it is needed Time step 0 A 0 A 1 A 2 … A n • Scientific data formats: BP/HDF5 B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n Structured and meta-data rich Time step 1 • Standard I/O interface: ADIOS A 0 A 1 A 2 … A n Almost transparent to users B 0 B 1 B 2 … B n C 0 C 1 C 2 … C n … Cloud Data producer Data consumer Goal: Reduce data transfer from producers to consumers 7

  24. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data 8

  25. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data Data selection • Files: users can ask for subsets of data files, by specifying file offsets • Requires knowledge about data layout in files 8

  26. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data Data selection • Files: users can ask for subsets of data files, by specifying file offsets • Requires knowledge about data layout in files Content-based data indexing • Useful, but may require large amounts of meta-data 8

  27. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data Data selection • Files: users can ask for subsets of data files, by specifying file offsets • Requires knowledge about data layout in files Content-based data indexing • Useful, but may require large amounts of meta-data Scibox Approaches: • Filter unnecessary data at producer-side via metadata (uploads) 8

  28. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data Data selection • Files: users can ask for subsets of data files, by specifying file offsets • Requires knowledge about data layout in files Content-based data indexing • Useful, but may require large amounts of meta-data Scibox Approaches: • Filter unnecessary data at producer-side via metadata (uploads) • Merge overlapping subsets when multiple users share the same data (uploads) 8

  29. Solutions for Minimizing Data Transfers for Data Sharing Compression • Helps, but compression ratio can be low for floating-point scientific data Data selection • Files: users can ask for subsets of data files, by specifying file offsets • Requires knowledge about data layout in files Content-based data indexing • Useful, but may require large amounts of meta-data Scibox Approaches: • Filter unnecessary data at producer-side via metadata (uploads) • Merge overlapping subsets when multiple users share the same data (uploads) • Minimize data sharing cost in cloud storage via new software protocol (downloads) 8

Recommend


More recommend