Commentary on Techniques for Massive- Data Machine Learning in - PowerPoint PPT Presentation

SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on “Techniques for Massive- Data Machine Learning in Astronomy” Nick Ball Herzberg Institute of Astrophysics Victoria, Canada

The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24 • Astronomy faces enormous datasets • Their size, dimensionality, and complexity require intelligent, automated investigation • Exponential increase in data size: algorithms cannot scale worse than O( N log N ) • Most data mining algorithms naïvely scale as N 2 or worse

The Solution SCMA V, Penn State, Jun 14th 2011 3 of 24 • Make data mining algorithms that scale as N log N ! (or better) • May have to compromise accuracy slightly • Deploy them so that astronomers are willing and able to use them • They must work on real astronomical data

Collaboration is Vital SCMA V, Penn State, Jun 14th 2011 4 of 24 • Successful use of astrostatistics and data mining requires expertise in computer science, statistics, and astronomy • Collaboration enables novelty that would not arise from a single group • So, computer scientists supplying algorithms in this way is excellent

But SCMA V, Penn State, Jun 14th 2011 5 of 24 • ... expertise in computer science, statistics, and astronomy • Successful collaborations have involved astronomers who are experts in computing/statistics, or who are working closely and over time with these experts

And SCMA V, Penn State, Jun 14th 2011 6 of 24 • Astronomy data are messy: - Large, complex, increasingly high-dimensional, time- domain - Missing data: non-observation or non-detection - Heteroscedastic, non-Gaussian, underestimated errors - Outliers, artifacts, false detections, systematic effects - Correlated inputs - Etc.

An Example SCMA V, Penn State, Jun 14th 2011 7 of 24 • How do you apply astrostatistics and fast algorithms to this?

The Next Generation Virgo Cluster Survey SCMA V, Penn State, Jun 14th 2011 9 of 24 • 10 σ point source limiting magnitude g = 25.7 (faint!) • Photometric (few spectra), ~100 deg 2 , 5 bands ( ugriz , like Sloan) • 10 7 + galaxies, 2.6 terabytes data • 40 people at at 23 institutions in Canada, France, etc. (PI Laura Ferrarese @ HIA) • 2009-2012

Virgo is an actual cluster of galaxies, the nearest large one to us

NGVS Statistical Challenges SCMA V, Penn State, Jun 14th 2011 11 of 24 • Object detection and classification • Photometric redshifts (photo-z) • Virgo cluster membership / background • Missing data • Field-to-field variation • Multi-wavelength data • Completeness(mag, SB, etc. etc.)

Object detection: low surface brightness galaxies

SCMA V, Penn State, Jun 14th 2011 13 of 24 Cluster membership: photometric redshift using k nearest neighbours

SCMA V, Penn State, Jun 14th 2011 14 of 24 Missing data: NGVS fields (not final) don’t all contain all 5 bands ugriz

Multi-wavelength data

Canadian Astronomy Data Centre SCMA V, Penn State, Jun 14th 2011 16 of 24 • CADC is one of the world’s largest astronomy data centres • ~500 terabytes of data (will grow to petabytes) • Uses Virtual Observatory standards • Staffed by astronomers and computer specialists, but not statisticians

CANFAR SCMA V, Penn State, Jun 14th 2011 17 of 24 • Canadian Advanced Network for Astronomical Research, at CADC • Combines cluster job scheduling with cloud computing resources • Users manage their own virtual machines

So SCMA V, Penn State, Jun 14th 2011 18 of 24 • Put fast data mining tools on the CANFAR infrastructure • ... but early days, not much to say yet

Guide to Data Mining in Astronomy SCMA V, Penn State, Jun 14th 2011 19 of 24 • Virtual Observatory KDD-IG guide: http:// www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ IvoaKDDguide • Emphasizes data mining, which is part of astroinformatics • But this overlaps with astrostatistics • -> potential outreach channel to wider community

kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 20 of 24 • Use kd-tree for fast kNN assignment of photo-zs to Sloan Digital Sky Survey quasars • Single neighbour, perturb input features to make a PDF in redshift • Removing multi-peaked PDFs removes almost all catastrophic outliers

kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 21 of 24 6 120 5 100 4 80 z mean 3 60 2 40 1 20 � = 0.34397 0 0 1 2 3 4 5 6 z spec

kNN Quasar Photometric Redshifts SCMA V, Penn State, Jun 14th 2011 22 of 24 6 120 5 100 4 80 z one peak 3 60 2 40 1 20 � = 0.11096 0 0 1 2 3 4 5 6 z spec

Questions SCMA V, Penn State, Jun 14th 2011 23 of 24 • Can we overcome the problems of real data? • Will there be data of high intrinsic dimension? • Will astronomers be able to deploy the algorithms? • Where do GPUs fit? (GPU+brute force may be just as fast?)

Conclusions SCMA V, Penn State, Jun 14th 2011 24 of 24 • Provided the data can be suitably prepared, and the science-driven usage of the algorithm intelligently motivated, the fast algorithms presented here have excellent potential for advancing astronomical research

Commentary on Techniques for Massive- Data Machine Learning in - PowerPoint PPT Presentation

SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on Techniques for Massive- Data Machine Learning in Astronomy Nick Ball Herzberg Institute of Astrophysics Victoria, Canada The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Contents Introduction IV Spring 1 Spring Commentary 138 Summer 32 Summer Commentary 154

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

HIV/AIDS in Practice An Expert Commentary with Myron Cohen, MD A Clinical Context Report

Analyst Presentation for the year ended 28 February 2014 Agenda 1. General commentary on the

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

Summary Structures for Massive Data Graham Cormode G.Cormode@warwick.ac.uk 7 6 4 1 Massive

Massive Data Algorithmics Lecture 3: External Search Trees Massive Data Algorithmics Lecture 3:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Massive Data Algorithmics Lecture 7: Range Searching Massive Data Algorithmics Lecture 7: Range

Investor Presentation MARCH 2020 ASX : DEV Important Information Forward Looking

This presentation will be available on the school website. https://slp5.somerset.org.uk/webs/c

Nicholas Northall English Language Teaching Centre The University of Sheffield The University of

Tips from the Trade Useful Advice for Effective Donor Relations Kathleen Diemer, CFRE Executive

February 2018 4Q 2017 Earnings Release & Company Supplemental Forward-Looking Statements and

2019 RESULTS YEAR ENDED 31 ST DECEMBER 2019 Nicholas Anderson (Group Chief Executive) Kevin Boyd

Magontec Limited (ASX: MGL) Full Year 2019 Result 28 February 2020 Nicholas Andrews, Executive

South Africa - a land of contrasts Poverty and high levels of unemployment Challenges Crime,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Commentary on Techniques for Massive- Data Machine Learning in - PowerPoint PPT Presentation

SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on Techniques for Massive- Data Machine Learning in Astronomy Nick Ball Herzberg Institute of Astrophysics Victoria, Canada The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Contents Introduction IV Spring 1 Spring Commentary 138 Summer 32 Summer Commentary 154

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

HIV/AIDS in Practice An Expert Commentary with Myron Cohen, MD A Clinical Context Report

Analyst Presentation for the year ended 28 February 2014 Agenda 1. General commentary on the

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

Summary Structures for Massive Data Graham Cormode G.Cormode@warwick.ac.uk 7 6 4 1 Massive

Massive Data Algorithmics Lecture 3: External Search Trees Massive Data Algorithmics Lecture 3:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Massive Data Algorithmics Lecture 7: Range Searching Massive Data Algorithmics Lecture 7: Range

Investor Presentation MARCH 2020 ASX : DEV Important Information Forward Looking

This presentation will be available on the school website. https://slp5.somerset.org.uk/webs/c

Nicholas Northall English Language Teaching Centre The University of Sheffield The University of

Tips from the Trade Useful Advice for Effective Donor Relations Kathleen Diemer, CFRE Executive

February 2018 4Q 2017 Earnings Release &amp; Company Supplemental Forward-Looking Statements and

2019 RESULTS YEAR ENDED 31 ST DECEMBER 2019 Nicholas Anderson (Group Chief Executive) Kevin Boyd

Magontec Limited (ASX: MGL) Full Year 2019 Result 28 February 2020 Nicholas Andrews, Executive

South Africa - a land of contrasts Poverty and high levels of unemployment Challenges Crime,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

February 2018 4Q 2017 Earnings Release & Company Supplemental Forward-Looking Statements and