Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets Raffaele D’Abrusco Harvard-Smithsonian Center for Astrophysics Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22
Galilean experimental method Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 2 / 22
Setting the stage Knowledge Discovery - KD - is the “automatic processing of large amount of data to extract patterns that can represent knowledge about the data”. Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 3 / 22
KD in the real world Outside our Real and Virtual Domes , KD methodology has already shaped how Data are processed and Knowledge is extracted, in several (expected and unexpected) fields: Social sciences: advertisement placement, social networks... Finance: market analysis tool, derivatives trading... Life science: genetics, epidemiology, drug testing.... Security: face recognition, behavior tracking... Google and the like ... And for most of these fields, KD is the only possibility to make sense out of the overwhelming amount of data gathered. Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 4 / 22
The opportunity in Astronomy The advancement of astronomical technology ( hardware and software ) allows to go larger, deeper and with higher resolution, both spatially and spectrally, changing the nature of astronomical data. 10 6 10 5 # sources 10 4 10 3 10 9 10 10 10 11 10 12 10 13 10 14 10 15 10 16 10 17 10 18 d a t a c o m p l e x i t y (bytes) Facilities like LSST, SKA, ALMA, Euclid , etc... and the access and federation to archival data provided by the VO’s will boost this change by making large multivariate datasets (spanning also the time axis) easily available . Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 5 / 22
Not just a needle in the haystack A KD workflow is a sequence of analysis steps accomplished through KD techniques to extract the most knowledge out of (usually) large amount of (complex) data. Goals: Discovery Find new complex correlations; Expand known correlations to more dimensions; Find new simple correlations, so far overlooked; Using the discovery Insight into astrophysics; Classification, regression, new ways to look at things... While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet , as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources . Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22
Not just a needle in the haystack A KD workflow is a sequence of analysis steps accomplished through KD techniques to extract the most knowledge out of (usually) large amount of (complex) data. Goals: Discovery Find new complex correlations; Expand known correlations to more dimensions; Find new simple correlations, so far overlooked; Using the discovery Insight into astrophysics; Classification, regression, new ways to look at things... While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet , as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources . Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22
A first try Start spectroscopic data Extraction of optical candidate quasars PPS from the SDSS photometric dataset using spectroscopic base of knowledge. NEC No Selection of Successful best cluster? clusterization Yes A combination of two unsupervised clustering (UC) Characterization Photometric in parameter techniques and the use of a priori knowledge available for a data space subset of confirmed SDSS quasars was used to extract optical candidate quasars from photometric data . Selection of photometric objects End candidate quasars Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 7 / 22
The Weak Gated Expert The Weak Gated Expert ( WGE ) is a KD procedure for the determination of z phot for galaxies and quasars, based on clustering in the color space and the training of an ensemble of neural networks for regression. Fuzzy k-means Cluster 5 Cluster 3 The UC algorithm split the feature space into more Cluster 8 Cluster 2 homogeneous chunks to prevent under or over-fitting Cluster 4 Cluster 1 of the experts ; Cluster 7 Cluster 6 Multiple distinct experts (neural networks) are trained on different regions of the features space; Experts z(2) phot z(1) z(N) phot phot The gate combines the outputs of the single experts in order to maximize the accuracy of the reconstruction and minimize biases. Gating Network z(Best)phot Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 8 / 22
A more general question What if the goal is not the improvement of the accuracy of a quantity obtained by regression ( z phot ) or binary classifications of sources (star vs quasars)? What if the goal is to find out whether any pattern happens to occur in any feature space using clustering techniques? The tenet Spontaneous aggregations of sources in their observable space, the clusters, reflect similarities common traits shared by these sources. Anisotropies in the distribution of clusters populations relative to other observables reflect the existence of significant patterns . Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 9 / 22
The CLaSPS method Clustering-Labels-Scores Patterns Spotter (CLaSPS) A UC algorithm is used to produce clusterings in the parameter space generated 1 by any subset of the observables (the features ); Other observables not employed for the clustering (the labels ), are used as tags 2 to identify interesting set of clusters using the score ; The patterns in the selected set of clusters are selected and studied. 3 Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 10 / 22
The choice of the clustering(s) Set of clusters (or single clusters) are picked according to the degree of correlation between the distribution of cluster members in the feature space and their distribution in the labels space. � M ( j ) − 1 N clust N clust 1 1 ∑ ∑ ∑ � S tot = · S i = � f ij − f i ( j + 1 ) � N clust N clust i = 1 i = 1 j = 1 where f ij is the fraction of members of the i -th cluster with values of the label in the j -th class. lab. 1 Total 0.54 0.558 0.529 0.516 0.472 0.485 ● 1.0 [0.174] [0.131] [0.105] [0.087] [0.075] [0.065] 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 0 0 0 0.6 ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● (5) ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● S tot 0.8 ● ● ● ● 0 0 0 0 0.409 0 ● ● 7 ● (22) (1) ● ● ● ● ● 0.4 ● K−means ● SOM HC 0 0 0 0.483 0 0.357 ● ● 6 (29) (1) (14) 0.6 3 4 5 6 7 8 Cluster 0 0 0.474 0.487 0.714 0.435 N clust 5 (38) (39) (7) (23) lab. 1 0.4 0 0.48 0.517 0.25 0.696 0.619 4 (25) (29) (4) (23) (21) ● ● K−means ● ● ● SOM 0.20 ● 0.48 0.459 0.308 0.619 0.438 0.435 ● HC 3 (25) (37) (13) (21) (16) (23) 0.2 ● ● ● S' tot ● ● 0.514 0.541 0.682 0.714 0.4 0.75 0.15 ● ● ● ● ● 2 ● ● (70) (37) (22) (7) (25) (8) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● 0.625 0.75 0.667 0.545 0.647 0.688 ● ● ● ● ● ● ● 1 0.0 ● ● ● ● ● ● ● ● (16) (12) (9) (11) (17) (16) ● ● ● ● ● ● ● ● 3 4 5 6 7 8 3 4 5 6 7 8 N clust # clusters of clustering Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 11 / 22
An interesting finding CLaSPS has been applied on a sample of AGNs with multi-wavelength observations spanning from radio to γ -rays ( features and labels ) to characterize their SEDs in the colors feature space . labels labels features → Dataset AGNs catalog Three clusters composed of Blazars stood out with large → Features UV( Galex ) + Optical( SDSS )+ values of the scores spectral classification as label . Further NIR( UKIDSS ) + IR( WISE ) experiments using as labels the γ -ray detection and → AGNs class., Blazars spectral class. Labels FSRQs-BL Lacs classifications showed that such patterns γ -ray emission of Blazars depend on WISE mid-Infrared colors. Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 12 / 22
Recommend
More recommend