SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on “Techniques for Massive- Data Machine Learning in Astronomy” Nick Ball Herzberg Institute of Astrophysics Victoria, Canada
The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24 • Astronomy faces enormous datasets • Their size, dimensionality, and complexity require intelligent, automated investigation • Exponential increase in data size: algorithms cannot scale worse than O( N log N ) • Most data mining algorithms naïvely scale as N 2 or worse
The Solution SCMA V, Penn State, Jun 14th 2011 3 of 24 • Make data mining algorithms that scale as N log N ! (or better) • May have to compromise accuracy slightly • Deploy them so that astronomers are willing and able to use them • They must work on real astronomical data
Collaboration is Vital SCMA V, Penn State, Jun 14th 2011 4 of 24 • Successful use of astrostatistics and data mining requires expertise in computer science, statistics, and astronomy • Collaboration enables novelty that would not arise from a single group • So, computer scientists supplying algorithms in this way is excellent
But SCMA V, Penn State, Jun 14th 2011 5 of 24 • ... expertise in computer science, statistics, and astronomy • Successful collaborations have involved astronomers who are experts in computing/statistics, or who are working closely and over time with these experts
And SCMA V, Penn State, Jun 14th 2011 6 of 24 • Astronomy data are messy: - Large, complex, increasingly high-dimensional, time- domain - Missing data: non-observation or non-detection - Heteroscedastic, non-Gaussian, underestimated errors - Outliers, artifacts, false detections, systematic effects - Correlated inputs - Etc.
An Example SCMA V, Penn State, Jun 14th 2011 7 of 24 • How do you apply astrostatistics and fast algorithms to this?
The Next Generation Virgo Cluster Survey SCMA V, Penn State, Jun 14th 2011 9 of 24 • 10 σ point source limiting magnitude g = 25.7 (faint!) • Photometric (few spectra), ~100 deg 2 , 5 bands ( ugriz , like Sloan) • 10 7 + galaxies, 2.6 terabytes data • 40 people at at 23 institutions in Canada, France, etc. (PI Laura Ferrarese @ HIA) • 2009-2012
Virgo is an actual cluster of galaxies, the nearest large one to us
NGVS Statistical Challenges SCMA V, Penn State, Jun 14th 2011 11 of 24 • Object detection and classification • Photometric redshifts (photo-z) • Virgo cluster membership / background • Missing data • Field-to-field variation • Multi-wavelength data • Completeness(mag, SB, etc. etc.)
Object detection: low surface brightness galaxies
SCMA V, Penn State, Jun 14th 2011 13 of 24 Cluster membership: photometric redshift using k nearest neighbours
SCMA V, Penn State, Jun 14th 2011 14 of 24 Missing data: NGVS fields (not final) don’t all contain all 5 bands ugriz
Multi-wavelength data
Canadian Astronomy Data Centre SCMA V, Penn State, Jun 14th 2011 16 of 24 • CADC is one of the world’s largest astronomy data centres • ~500 terabytes of data (will grow to petabytes) • Uses Virtual Observatory standards • Staffed by astronomers and computer specialists, but not statisticians
CANFAR SCMA V, Penn State, Jun 14th 2011 17 of 24 • Canadian Advanced Network for Astronomical Research, at CADC • Combines cluster job scheduling with cloud computing resources • Users manage their own virtual machines
So SCMA V, Penn State, Jun 14th 2011 18 of 24 • Put fast data mining tools on the CANFAR infrastructure • ... but early days, not much to say yet
Guide to Data Mining in Astronomy SCMA V, Penn State, Jun 14th 2011 19 of 24 • Virtual Observatory KDD-IG guide: http:// www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ IvoaKDDguide • Emphasizes data mining, which is part of astroinformatics • But this overlaps with astrostatistics • -> potential outreach channel to wider community
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 20 of 24 • Use kd-tree for fast kNN assignment of photo-zs to Sloan Digital Sky Survey quasars • Single neighbour, perturb input features to make a PDF in redshift • Removing multi-peaked PDFs removes almost all catastrophic outliers
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 21 of 24 6 120 5 100 4 80 z mean 3 60 2 40 1 20 � = 0.34397 0 0 1 2 3 4 5 6 z spec
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 22 of 24 6 120 5 100 4 80 z one peak 3 60 2 40 1 20 � = 0.11096 0 0 1 2 3 4 5 6 z spec
Questions SCMA V, Penn State, Jun 14th 2011 23 of 24 • Can we overcome the problems of real data? • Will there be data of high intrinsic dimension? • Will astronomers be able to deploy the algorithms? • Where do GPUs fit? (GPU+brute force may be just as fast?)
Conclusions SCMA V, Penn State, Jun 14th 2011 24 of 24 • Provided the data can be suitably prepared, and the science-driven usage of the algorithm intelligently motivated, the fast algorithms presented here have excellent potential for advancing astronomical research
Recommend
More recommend