Commentary on Techniques for Massive- Data Machine Learning in - PowerPoint PPT Presentation
SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on Techniques for Massive- Data Machine Learning in Astronomy Nick Ball Herzberg Institute of Astrophysics Victoria, Canada The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24
SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on “Techniques for Massive- Data Machine Learning in Astronomy” Nick Ball Herzberg Institute of Astrophysics Victoria, Canada
The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24 • Astronomy faces enormous datasets • Their size, dimensionality, and complexity require intelligent, automated investigation • Exponential increase in data size: algorithms cannot scale worse than O( N log N ) • Most data mining algorithms naïvely scale as N 2 or worse
The Solution SCMA V, Penn State, Jun 14th 2011 3 of 24 • Make data mining algorithms that scale as N log N ! (or better) • May have to compromise accuracy slightly • Deploy them so that astronomers are willing and able to use them • They must work on real astronomical data
Collaboration is Vital SCMA V, Penn State, Jun 14th 2011 4 of 24 • Successful use of astrostatistics and data mining requires expertise in computer science, statistics, and astronomy • Collaboration enables novelty that would not arise from a single group • So, computer scientists supplying algorithms in this way is excellent
But SCMA V, Penn State, Jun 14th 2011 5 of 24 • ... expertise in computer science, statistics, and astronomy • Successful collaborations have involved astronomers who are experts in computing/statistics, or who are working closely and over time with these experts
And SCMA V, Penn State, Jun 14th 2011 6 of 24 • Astronomy data are messy: - Large, complex, increasingly high-dimensional, time- domain - Missing data: non-observation or non-detection - Heteroscedastic, non-Gaussian, underestimated errors - Outliers, artifacts, false detections, systematic effects - Correlated inputs - Etc.
An Example SCMA V, Penn State, Jun 14th 2011 7 of 24 • How do you apply astrostatistics and fast algorithms to this?
The Next Generation Virgo Cluster Survey SCMA V, Penn State, Jun 14th 2011 9 of 24 • 10 σ point source limiting magnitude g = 25.7 (faint!) • Photometric (few spectra), ~100 deg 2 , 5 bands ( ugriz , like Sloan) • 10 7 + galaxies, 2.6 terabytes data • 40 people at at 23 institutions in Canada, France, etc. (PI Laura Ferrarese @ HIA) • 2009-2012
Virgo is an actual cluster of galaxies, the nearest large one to us
NGVS Statistical Challenges SCMA V, Penn State, Jun 14th 2011 11 of 24 • Object detection and classification • Photometric redshifts (photo-z) • Virgo cluster membership / background • Missing data • Field-to-field variation • Multi-wavelength data • Completeness(mag, SB, etc. etc.)
Object detection: low surface brightness galaxies
SCMA V, Penn State, Jun 14th 2011 13 of 24 Cluster membership: photometric redshift using k nearest neighbours
SCMA V, Penn State, Jun 14th 2011 14 of 24 Missing data: NGVS fields (not final) don’t all contain all 5 bands ugriz
Multi-wavelength data
Canadian Astronomy Data Centre SCMA V, Penn State, Jun 14th 2011 16 of 24 • CADC is one of the world’s largest astronomy data centres • ~500 terabytes of data (will grow to petabytes) • Uses Virtual Observatory standards • Staffed by astronomers and computer specialists, but not statisticians
CANFAR SCMA V, Penn State, Jun 14th 2011 17 of 24 • Canadian Advanced Network for Astronomical Research, at CADC • Combines cluster job scheduling with cloud computing resources • Users manage their own virtual machines
So SCMA V, Penn State, Jun 14th 2011 18 of 24 • Put fast data mining tools on the CANFAR infrastructure • ... but early days, not much to say yet
Guide to Data Mining in Astronomy SCMA V, Penn State, Jun 14th 2011 19 of 24 • Virtual Observatory KDD-IG guide: http:// www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ IvoaKDDguide • Emphasizes data mining, which is part of astroinformatics • But this overlaps with astrostatistics • -> potential outreach channel to wider community
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 20 of 24 • Use kd-tree for fast kNN assignment of photo-zs to Sloan Digital Sky Survey quasars • Single neighbour, perturb input features to make a PDF in redshift • Removing multi-peaked PDFs removes almost all catastrophic outliers
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 21 of 24 6 120 5 100 4 80 z mean 3 60 2 40 1 20 � = 0.34397 0 0 1 2 3 4 5 6 z spec
kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 22 of 24 6 120 5 100 4 80 z one peak 3 60 2 40 1 20 � = 0.11096 0 0 1 2 3 4 5 6 z spec
Questions SCMA V, Penn State, Jun 14th 2011 23 of 24 • Can we overcome the problems of real data? • Will there be data of high intrinsic dimension? • Will astronomers be able to deploy the algorithms? • Where do GPUs fit? (GPU+brute force may be just as fast?)
Conclusions SCMA V, Penn State, Jun 14th 2011 24 of 24 • Provided the data can be suitably prepared, and the science-driven usage of the algorithm intelligently motivated, the fast algorithms presented here have excellent potential for advancing astronomical research
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.