The Scientific Data Management The Scientific Data Management Center Center Arie Shoshani (PI) Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Co-Principal Investigators DOE Laboratories Universities ANL : Rob Ross NCSU : Mladen Vouk LBNL : Doron Rotem NWU : Alok Choudhary LLNL : Chandrika Kamath UCD : Bertram Ludaescher ORNL : Nagiza Samatova SDSC : Ilkay Altintas PNNL : Terence Critchlow UUtah : Claudio Silva Jarek Nieplocha Centers/Institutes meeting, October 24-25, 2008 Arie Shoshani
Problems and Mandate Problems and Mandate Why is Managing Scientific Data Important for Scientific Why is Managing Scientific Data Important for Scientific • Inv Investig stigations? ations? Sheer volume and increasin Sheer volume and increasing complexity of data be g complexity of data being collected are already ing collected are already • interfering with the scient interferi g with the scientif ific ic inve investigat stigation process ion process Managing the data by scie Managing the data by scientists greatly wastes scientists e ntists greatly wastes scientists effective t ctive time in me in • performin rforming th their applicatio eir applications work work Data collection, storage, tr Data collection, storage, transfer, and archival often ansfer, and archival often co confli nflict wi ct with effec th effectivel ely • using com us ing computati utational nal resources resources Effectively managing, and analyzing th Effectively managing, and analyzing this data and associated metadata is data and associated metadata • requires a comprehensive, end-to-end approach that enco requires a comprehensive, end-to-end approach that encompasses all of t mpasses all of the e stages from the initial da stages from the initial data acquisition to the fi ta acquisition to the final analysis of nal analysis of the data the data Enable scientists to most effect Enable s ientists to most effectively discover new knowledge by ively discover new knowledge by • removing data manag removing data management bottlenecks, and enabling effective ment bottlenecks, and enabling effective data analysis data analysi Improve productivity of data Improve productivity of data management i management infrastructure frastructure • Taking away the burd Taking away the burden f en from scie om scientists ntists • Engaging Scientists, education Engaging Scientists, education • Arie Shoshani
Focus of SDM center Focus of SDM center high performance high performance En Enabling data underst abling data understan anding ding • • fast fast, scala , scalable le Paralleli Parallelize analysis tools e analysis tools • • Paralle Parallel I/O, para I/O, paralle llel fi file le sy systems stems Streamli Streamline use of analysi ne use of analysis tools tools • • Inde Indexing, ng, data m data movement vement Real-time data search tools Real-time data search tools • • Su Sustain ainability bility Usability and effectiveness Usability and effectiveness • • robustness robustness • Easy-to-use tools and interfaces Easy-to-use tools and interfaces • Productize software Productize software • Use of workfl Use of w flow ow, dashboa dashboards ds • work with vendors, computing work with vendors, computing • end-to-end use (data and metadata) end-to-end use (data and metadata) • centers centers Establish dialog with scientists Establish dialog with scientists • Outreach, Outreach, • partne partner w r with scie th scientists ntists, , • education (students, scie education (students, scientists ntists ) • Arie Shoshani
Organization Organization of the center: of the center: ba based on three-la sed on three-layer org yer organization of technologie nization of technologies Scientific Process Automation (SPA) Layer Integrated approach: • To provide a To provide a sc scie ientif ntific ic Workflow Specialized Scientific Management workflow and dashboard workflow and dashboard Workflow Dashboard Engine components capability cap ability (Kepler) • To support data mining and To support data mining and analysis tools analysis tools Data Mining and Analysis (DMA) Layer • To accelerate sto To accelerate storage and age and Efficient Parallel R Data access to data acces s to data indexing Analysis and Statistical Feature (Bitmap Benefits scientists by Identification Analysis Index) • Hiding underlying parallel Hiding underlying parallel technology technology Storage Efficient Access (SEA) Layer • End-to-end support of End-to-end support of applications applications Parallel Storage Adaptable Parallel Active Virtual Resource Parallel • Permitt Permitting assembly of ng assembly of I/O System Storage Manager I/O NetCDF File (ADIOS) (ROMIO) modules using w modules using workf rkflow ow (SRM) System descript descr ption tool on tool • Trac Tracking king data management data management Hardware, Operating Systems, and Storage Systems tasks through web-based tasks through we b-based dashboards dashboa ds Arie Shoshani
Results Results High Performance Technologies High Performance Technologies Usability and effectiveness Usability and effectiveness Enabling Data Understanding Enabling Data Understanding Arie Shoshani
Arie Shoshani The I/O Software Stack The I/O Software Stack
Arie Shoshani PVFS on IBM Blue Gene/P PVFS on IBM Blue Gene/P
Speeding data transfer with PnetCDF Speeding data transfer with PnetCDF Inter-process communication Enables high performance parallel I/O to netCDF data sets P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Achieves up to 10-fold Parallel netCDF netCDF Parallel netCDF netCDF performance improvement over HDF5 Parallel File System Parallel File System Parallel File System Parallel File System Early performance testing showed PnetCDF outperformed HDF5 for some critical access patterns. The HDF5 team has responded by improving their code for these patterns, and now these teams actively collaborate to better understand application needs and system characteristics, leading to I/O performance gains in both libraries. Illustration: A. Tovey Illustration: A. Tovey Contacts: Rob Ross, ANL, Alok Choudhari, NWU Arie Shoshani
Improving IO in accelerator design simulation Improving IO in accelerator design simulation on Jaguar/Cray XT* on Jaguar/Cray XT* Application: SLAC accelerator design Application: SLAC accelerator design • Omeg ega3P: a3P: simulatio simulation program th program that u at uses es higher-order te higher-order tetrahedral elements trahedral elements • Had bad reading patterns that do not scale Had bad reading patterns that do not scale • Use netCDF files Use netCDF files • Scaling from Regular meshes Before (in seconds) N-CPUs Writing Time Solver Time 128 30.27 634.74 256 59.26 324.16 To adaptive 512 146.24 163.30 meshes 1024 340.15 94.86 2048 499.21 45.86 4096 965.64 26.08 Time for Writing File >> Time for Solver !!! Time for Writing File >> Time for Solver !!! • ( * ) Lie-Quan (Rich) Lee (SLAC) and Stephen Hodson (ORNL) Arie Shoshani
Using Parallel-netCDF Using Parallel-netCDF instead of Netcdf and using MPI_Info instead of Netcdf and using MPI_Info After (in seconds) NCPUs Writing Time Solver Time 512 1.50 163.30 1024 3.27 94.86 2048 7.90 45.86 I/O Time 1200 1000 Time in seconds Writing-netCDF 800 Writing Parallel- 600 netCDF 400 Solver time 200 Time for writing data reduced 100 times 0 128 256 512 1024 2048 4096 Time for Writing File << Time for Solver num of CPUs Expected to behave better for larger problem sizes. Contact: Alok Choudhari, NWU Arie Shoshani
Parallel netCDF (no hints) Parallel netCDF (no hints) offset time Block dep Block depiction of 2 ction of 28 GB file GB file Y axis larger here Y axis larger here • • Record variable scattered Record variable scattered Default “cb_buffer_size Default “cb_buffer_size” hint not good for int not good for • • interleaved netCDF record variables interleaved netCDF record variables Reading in way too much data Reading i way too much data! • Arie Shoshani
Parallel netCDF (hints) Parallel netCDF (hints) offset time With tuning, much le With tuning, much less reading ss reading Still some overlap Still some overlap • • Better effi Bette r efficiency, bu ciency, but still short t still short of MPI-IO of MPI-IO “cb_buffer_size” now size of one netCDF “cb_buffer_size” ow size of one netCDF • • record record Better effi Better efficiency, at slight perf cost ciency, at slight perf cost • Arie Shoshani
Recommend
More recommend