The Sloan Digital Sky Survey From Big Data to Big Database to Big Compute Heidi Newberg Rensselaer Polytechnic Institute
Summary • History of the data deluge from a personal perspective. • The transformation of astronomy with the Sloan Digital Sky Survey. • The discovery of density substructure in the Milky Way stellar spheroid. • Using MilkyWay@home to fit more complex models to the data.
The new 1024x1024 CCD camera required a new computer to store the data from just one night of observing (2 megabytes every five minutes). We also needed to write to exabyte tape drives rather than magnetic tapes, so the data would be easier to carry home on the airplane.
The beginning of the data deluge (1990’s) • New CCD cameras produced enough data that we could no longer look at each astronomical object individually. Automated algorithms were needed. • Mag tapes hold 100 Mbytes each, ~2 hrs of observing time per tape. (Requires large backpack to transport home.) Exabyte tapes made data transport easier. • I still own all of these tapes, but it is likely that they are not readable. All astronomical data from that era is lost forever.
The Sloan Digital Sky Survey (SDSS) is a joint project of The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Max-Planck-Institute for Astronomy (MPIA), the Max- Planck-Institute for Astrophysics (MPA), New Mexico State University, Princeton University, the U.S. Naval Observatory, and the University of Washington. (11 institutions) Funding for the project has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.
The Data • Images of 14,000 square degrees of sky in 5 passbands (raw data 20 TB) • A catalog of a billion objects detected in those images (20 TB SQL database), ~400 parameters per object • Other data products (DAS – 34 TB) • 1.5 million spectra of galaxies, stars, and quasars (3.3 TB) • Spectral parameters (450 Gbytes) Data reduction??
I discussed the data processing,
Alex Szalay and his group at Johns Hopkins took on the enormous task of putting all of this data into a database, preserving as much provenance as possible, and making the data as accessible as possible. There are serious issues with speed in a database of this size, so his group needed to think hard about how the data would be accessed, and thus how it should be organized.
Scientists were asked for example scientific queries, so the database could be optimized. The 20 Queries Q11: Find all elliptical galaxies with spectra that have an Q1: Find all galaxies without unsaturated pixels within 1' of a given anomalous emission line. point of ra=75.327, dec=21.023 Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over Q2: Find all galaxies with blue surface brightness between and 23 60<declination<70, and 200<right ascension<210, on a grid and 25 mag per square arcseconds, and -10<super galactic of 2’, and create a map of masks over the same grid. latitude (sgb) <10, and declination less than zero. Q13: Create a count of galaxies for each of the HTM triangles Q3: Find all galaxies brighter than magnitude 22, where the local which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && extinction is >0.75. r<21.75, output it in a form adequate for visualization. Q4: Find galaxies with an isophotal surface brightness (SB) larger Q14: Find stars with multiple measurements and have magnitude than 24 in the red band, with an ellipticity>0.5, and with the variations >0.1. Scan for stars that have a secondary object major axis of the ellipse having a declination of between 30” (observed at a different time) and compare their magnitudes. and 60”arc seconds. Q15: Provide a list of moving objects consistent with an asteroid. Q5: Find all galaxies with a deVaucouleours profile (r ¼ falloff of Q16: Find all objects similar to the colors of a quasar at intensity on disk) and the photometric colors consistent with 5.5<redshift<6.5. an elliptical galaxy. The deVaucouleours profile Q17: Find binary stars where at least one of them has the colors of Q6: Find galaxies that are blended with a star, output the a white dwarf. deblended galaxy magnitudes. Q18: Find all objects within 30 arcseconds of one another that Q7: Provide a list of star-like objects that are 1% rare. have very similar colors: that is where the color ratios u-g, g-r, Q8: Find all objects with unclassified spectra. r-I are less than 0.05m. Q9: Find quasars with a line width >2000 km/s and Q19: Find quasars with a broad absorption line in their spectra and 2.5<redshift<2.7. at least one galaxy within 10 arcseconds. Return both the Q10: Find galaxies with spectra that have an equivalent width in Ha quasars and the galaxies. >40Å (Ha is the main hydrogen spectral line.) Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. From talk by Jim Gray (2001)
Sky survey “Navigate” tool lets you browse through the images
Over a billion hits to the SDSS site, leveling off at 150 million per year. Over 2,000,000 SQL queries per month on the database.
Computational Science • Traditional Empirical Science – Scientist gathers data by direct observation – Scientist analyzes data • Computational Science – Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database – Scientist analyzes database 16 From talk by Jim Gray 10/10/2001
What’s needed? (not drawn to scale) Miners Scientists Data Da ta Mi Mini ning ng Sc Scie ience nce Dat ata a Algo Al gorith ithms ms & Ques & Qu esti tion ons Plumbers Tools Data Da tabase base Qu Ques estion tion & & To st To stor ore e da data ta An Answe wer Exec Ex ecute ute Visua Vi ualiza lization tion Quer Qu eries ies 17 Slide from talk by Jim Gray 4/10/2002
Astronomy Information Age • Astronomical data is processed without anyone looking at the individual images/spectra Astronomers used to classify galaxies by eye. Sometimes a graduate student would classify thousands of galaxies from a computer screen. At three per minute, this might take hours, days, or even weeks of time. The SDSS found 108 galaxies. At three per minute, classification would take 63 years of 24 hours per day, seven days per week. The “Galaxy Zoo” is a project that allows private citizens to look at data by eye, and contribute classifications to scientists. • More data is obtained than anyone can analyze himself (drinking from a fire hose) Projects like the SDSS SkyServer, the Virtual Observatory, Google Sky, and WikiSky are all projects aimed at letting people better access the data from SDSS. • New surveys, including Pan-STARRS, LSST, Guo Shou Jing (LAMOST), DES, RAVE, SEGUE, HERMES, and WFMOS are planned or in progress, patterned on the success of the Sloan Digital Sky Survey.
2 2 2 , 3 . 5 , ( / ) r r x y z q
The SDSS survey was funded as an extragalactic project, but Galactic stars could not be completely avoided.
Statistical Photometric Parallax The use of statistical knowledge of the absolute magnitudes of stellar populations to determine the density distributions of stars.
Monoceros stream, Newberg et al. 2002 Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream Vivas overdensity, or Virgo Stellar Stream Stellar Spheroid? Sagittarius Dwarf Tidal Stream
Squashed Spherical halo halo Prolate Exponential halo disk Newberg et al. 2002
Kathryn Johnston
David Law
A map of stars in the outer regions of the Milky Way Galaxy, derived from the SDSS images of the northern sky, shown in a Mercator-like projection. The color indicates the distance of the stars, while the intensity indicates the density of stars on the sky. Structures visible in this map include streams of stars torn from the Sagittarius dwarf galaxy, a smaller 'orphan' stream crossing the Sagittarius streams, the 'Monoceros Ring' that encircles the Milky Way disk, trails of stars being stripped from the globular cluster Palomar 5, and excesses of stars found towards the constellations Virgo and Hercules. Circles enclose new Milky Way companions discovered by the SDSS; two of these are faint globular star clusters, while the others are faint dwarf galaxies. Credit: V. Belokurov and the Sloan Digital Sky Survey.
Why is this important? • Small dwarf galaxies are merging with the Milky Way at the present time. • The Milky Way itself was created by a long history of merging smaller galaxies to make larger ones • The tidal streams are an archeological record of the merger history that created our galaxy • The tidal streams encode the gravitational potential through which the dwarf galaxy traveled, and can therefore tell us about the distribution of dark matter in the Milky Way.
Recommend
More recommend