Data at the Leibniz-Institute for Astrophysics Kristin Riebe
AIP – Leibniz-Institute for Astrophysics Potsdam • Research areas: – cosmic magnetic fields (solar/stellar physics, magnetohydrodynamics) – extragalactic astrophysics (galactic archeology, galaxies and quasars, cosmology) • Development of Research Technology and Infrastructure – Robotic telescopes, (3D) spectroscopy – Supercomputing and E-Science • Participation in many projects – e.g. RAVE, ROSAT, XMM-Newton, LOFAR, MUSE, ... 2
Example data types at AIP • Observations: – RAVE • Radial velocity measurements + spectra – SDSS • Mirror of DR7, catalog server – „minor data sets“: • Plate archive (historical plates) • CALIFA (spectra of galaxies) • Cepheids (collection of data for time series), ... • Simulation data: – Magnetohydrodynamics – Cosmological simulations: particle data, dark matter halo catalogues, halo merger history, ... 3
Behind the scenes • Supercomputers: Leibniz, Babel, for in-house simulations, data processing • Almagest: Graywulf cluster for archiving, exchanging data, hosting databases, publishing data, 700 TB disk space • Virtual research environment: – Erebos: ~ 250 TB disk space – Used by CLUES collaboration to exchange and process data • Web servers for publishing smaller data sets 4
Data center task: Extract – Transform – Load Extract Load Webserver Server From different Publish the data sources Transform Checking, Corrections, Additions; bring into (standard) format 5
Example: MultiDark Database • Collaboration with Spanish MultiDark project • Publish data of cosmological simulations in a simulation database • Have similar success like MillenniumDB! :-) • http://www.multidark.org • 2 simulations uploaded (12+6 TB) • > 1 million queries in 2 years, ~ 1500 per day, 4 TB downloaded • ~ 140 registered users 6
Example workflow: MultiDark Database • Extract: – Cosmologists produce data, copy them to a server at AIP (VRE) • Transform: – We check data and reading routines, data curation (C/Fortran/Perl/Python) • Load: – Ingest data into database (SQL, bulk copy) • Check and test: – Check the data for completeness, consistency (SQL) – Create Peano-Hilbert keys, indexes (C#, Spatial 3D library (T. Budavari, G. Lemson)) • Publish: – Using simpledb (Gerard Lemson, Millennium DB, jsp ) – Write/update documentation; update admin tables of the database – Inform users 7
Transform: Data curation • Check completeness of data sets • Create homogeneous data sets, bring into useful (standard) formats • Add identifiers, grid indexes etc. for faster queries & for representing relations in the database • Cross-link data with other catalogues => usually we applied tailor-made solutions, tuned to each individual data set, custom reading routines required => now things are improving ... 8
DBIngestor and libhilbert • DBIngestor library + AsciiIngest – Adrian Partl, https://github.com/adrpar/DBIngestor, …/AsciiIngest – Apply converters (unit conversions, adding identifiers for db indexing, spatial grid indexes) – Apply asserters (nan, inf etc.) – => transform and load in one go – Easy to write own converters & add own reading routines for binary data • C-library libhilbert – For creating indexes of space-filling Peano-Hilbert curve in 20 dimensions 9
Data publication • Many possibilities, very often individual solutions for each project • Now: new webapp Daiquiri , http://escience.aip.de/daiquiri/ • Developed by Jochen Klar und Adrian Partl • Web application for publishing data • Modular, highly customizable • Using PHP, Zend-framework • Modern interface using bootstrap, jQuery • Authentication, Query Interface • Wordpress integration • One code base to serve most needs, open source, (easily) extendable 10
Daiquiri examples • MultiDark2 • Califa • 4MOST workshop • Plate Archive • Jubilee, Curie simulation database in Madrid http://escience.aip.de/daiquiri/ 11
Screenshot
Screenshot
Screenshot
VO compliance • Currently working on including VO protocols with Daiquiri – Download data as VOTables (MySQL-VOTable-Dump, see github) – TAP protocol for accessing data – UWS for job queues (MySQL query queue) • Problems: – No public PHP libraries for IVOA protocols available (only in java) – But community rather needs PHP or Python implementations 15
Concluding Remarks • Comon tasks for each data publication: extracting, transforming, uploading the data • Different tool for each data set? – Should rather use only a few, generalized tools, reusable, easier to maintain – Takes a lot of time to develop – => Collect tools from data centers? Combine efforts? • Would like to have more implementations/libraries of VO protocols, in different languages 16
Recommend
More recommend