Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond Julian Borrill Computational Cosmology Center, LBL & Space Sciences Laboratory, UCB with Christopher Cantalupo, Theodore Kisner, Radek Stompor et al and the BOOMERanG, EBEX, MAXIMA, Planck & PolarBear teams 3DAPAS – June 8th, 2011
The Cosmic Microwave Background About 400,000 years after the Big Bang, the expanding Universe cools through the ionization temperature of hydrogen: p + + e - => H. Without free electrons to scatter off, the photons free-stream to us today. • COSMIC - filling all of space. • MICROWAVE - redshifted by the expansion of the Universe from 3000K to 3K. • BACKGROUND - primordial photons coming from “behind” all astrophysical sources. 3DAPAS – June 8th, 2011
CMB Science • Primordial photons give the earliest possible image of the Universe. • The existence of the CMB supports a Big Bang over a Steady State cosmology (NP1). • Tiny fluctuations in the CMB temperature (NP2) and polarization encode the fundamentals of – Cosmology • geometry, topology, composition, history, … – Highest energy physics • grand unified theories, the dark sector, inflation, … • Current goals: – definitive T measurement provides complementary constraints for all dark energy experiments. – detection of cosmological B-mode gives energy scale of inflation from primordial gravity waves. (NP3) 3DAPAS – June 8th, 2011
The Concordance Cosmology High-z Supernova & Supernova Cosmology (1998): Cosmic Dynamics ( Ω Λ - Ω m ) BOOMERanG & MAXIMA (2000): Cosmic Geometry ( Ω Λ + Ω m ) 70% Dark Energy + 25% Dark Matter + 5% Baryons 95% Ignorance What (and why) is the Dark Universe ? 3DAPAS – June 8th, 2011
Observing the CMB • With very sensitive, very cold, detectors – coloured noise. • Scanning all of the sky from space, or some of the sky from the stratosphere or high dry ground. 3DAPAS – June 8th, 2011
Analysing The CMB 3DAPAS – June 8th, 2011
CMB Satellite Evolution Evolving science goals require (i) higher resolution & (ii) polarization sensitivity. 3DAPAS – June 8th, 2011
CMB Data Analysis • In principle very simple – Assume Gaussianity and maximize the likelihood 1. of maps given the observations and their noise statistics (analytic). 2. of power spectra given maps and their noise statistics (iterative). • In practice very complex – Foregrounds, glitches, asymmetric beams, non-Gaussian noise, etc. – Algorithm & implementation scaling with evolution of • CMB data-set size • HPC architecture 3DAPAS – June 8th, 2011
The CMB Data Challenge • Extracting fainter signals (polarization, high resolution) from the data requires: – larger data volumes to provide higher signal-to-noise. – more complex analyses to control fainter systematic effects. Experiment Start Date Goals N t N p 10 9 10 4 COBE 1989 All-sky, low res, T 10 9 10 6 BOOMERanG 1997 Cut-sky, high-res, T 10 10 10 7 WMAP 2001 All-sky, mid-res, T+E 10 12 10 9 Planck 2009 All-sky, high-res, T+E(+B) 10 13 10 7 PolarBear 2012 Cut-sky, high-res, T+E+B 10 14 10 7 QUIET-II 2015 Cut-sky, high-res, T+E+B 10 15 10 10 CMBpol 2020+ All-sky, high-res, T+E+B • 1000x increase in data volume over past & future 15 years – need linear analysis algorithms to scale through 10 + 10 M-foldings ! 3DAPAS – June 8th, 2011
CMB Data Analysis Evolution Data volume & computational capability dictate analysis approach. Date Data System Map Power Spectrum Explicit Maximum Likelihood Explicit Maximum Likelihood 1997 - Cray T3E B98 2000 x 700 3 ) 3 ) (Matrix Invert - N p (Matrix Cholesky + Tri-solve - N p Algorithms IBM SP3 Explicit Maximum Likelihood Explicit Maximum Likelihood 2000 - B2K2 2003 3 ) 3 ) x 3,000 (Matrix Invert - N p (Matrix Invert + Multiply - N p IBM SP3 PCG Maximum Likelihood Monte Carlo 2003 - Planck SF 2007 x 6,000 (band-limited FFT – few N t ) (Sim + Map - many N t ) Implementations Planck AF PCG Maximum Likelihood Monte Carlo 2007 - Cray XT4 2010 x 40,000 EBEX (band-limited FFT – few N t ) (SimMap - many N t ) Planck MC PCG Maximum Likelihood Monte Carlo 2010 - Cray XE6 2014 x 150,000 PolarBear (band-limited FFT – few N t ) (Hybrid SimMap - many N t ) 3DAPAS – June 8th, 2011
Scaling In Practice • 2000: BOOMERanG T-map – 10 8 samples => 10 5 pixels – 128 Cray T3E processors; • 2005: Planck T-map – 10 10 samples => 10 8 pixels – 6000 IBM SP3 processors; • 2008: EBEX T/P-maps – 10 11 samples, 10 6 pixels – 15360 Cray XT4 cores . • 2010: Planck Monte Carlo 1000 noise T-maps – 10 14 samples => 10 11 pixels – 32000 Cray XT4 cores. 3DAPAS – June 8th, 2011
Simulation & Mapping: Calculations Given the instrument noise statistics & beams, a scanning strategy, and a sky: 1) SIMULATION: d t = n t + s t = n t + P tp s p – A realization of the piecewise stationary noise time-stream: • Pseudo-random number generation & FFT – A signal time-stream scanned & beam-smoothed from the sky map: • SHT (P T N -1 P) d p = P T N -1 d t 2) MAPPING: (A x = b) – Build the RHS • FFT & sparse matrix-vector multiply – Solve for the map • PCG over FFT & sparse matrix-vector multiply 3DAPAS – June 8th, 2011
Simulation & Mapping: Scaling • In theory such analyses should scale – Perfectly to arbitrary numbers of cores (strong – within an experiment). – Linearly with number of observations (weak – between experiments). • In practice this does not happen because of – IO (reading pointing, writing time-streams; reading time-streams, writing maps) – Communication (gathering maps from all processes) – Calculation inefficiency (linear operations only, minimal data re-use) • Ultimately all comes down to delivering data to cores fast enough. • Code development has been an ongoing history of addressing this challenge anew with each new CMB data volume and HPC system/scale. 3DAPAS – June 8th, 2011
IO - Before For each MC realization For each detector Read detector pointing Sim Write detector timestream For all detectors Read detector timestream & pointing Map Write map ⇒ Read: 56 x Realizations x Detectors x Observations bytes Write: 8 x Realizations x (Detectors x Observations + Pixels) bytes E.g. for Planck, read 500PB & write 70PB. 3DAPAS – June 8th, 2011
IO - Optimizations • Read sparse telescope pointing instead of dense detector pointing – Calculate individual detector pointing on the fly. • Remove redundant write/read of time-streams between simulation & mapping – Generate simulations on the fly only when map-maker requests data. • Put MC loop inside map-maker – Amortize common data reads over all realizations. 3DAPAS – June 8th, 2011
IO – After Read telescope pointing For each detector Calculate detector pointing For each MC realization SimMap For all detectors Simulate time-stream Write map ⇒ Read: 24 x Sparse Observations bytes Write: 8 x Realizations x Pixels bytes E.g. for Planck, read 2GB & write 70TB (10 8 read & 10 3 write compression). 3DAPAS – June 8th, 2011
Communication Details • The time-ordered data from all the detectors are distributed over the processes subject to: – Load-balance – Common telescope pointing • Each process therefore holds – some of the observations – for some of the pixels. • In each PCG iteration, each process solves with its observations. • At the end of each iteration, each process needs to gather the total result for all of the pixels it observed. 3DAPAS – June 8th, 2011
Communication Implementation • Initialize a process & MPI task on every core • Distribute time-stream data & hence pixels • After each PCG iteration – Each process creates a full map vector by zero-padding – Call MPI_Allreduce(map, world) – Each process extracts the pixels of interest to it & discards the rest 3DAPAS – June 8th, 2011
Communication Challenge 3DAPAS – June 8th, 2011
Communication Optimizations • Reduce the number of MPI tasks (hybridize) – Use MPI only for off-node communication – Use threads for on-node communication • Minimize the total message volume – Determine the pixel overlap for every process pair. • One-time initialization cost – If the total overlap data volume is smaller than a full map, use Alltoallv in place of Allreduce • Typically Alltoallv for all-sky, Allreduce for part-sky surveys 3DAPAS – June 8th, 2011
Communication Improvement 3DAPAS – June 8th, 2011
Communication Evaluation • Which enhancement is more important? • Is this system-dependent? • Compare – unthreaded/threaded – allreduce/alltoallv on Cray XT4, XT5 & XE6 on 200 – 16,000 cores: – Alltoallv at low end – Threads at high end 3DAPAS – June 8th, 2011
Petascale Communication (I) 3DAPAS – June 8th, 2011
HPC System Evolution • Clock speed is no longer able to maintain Moore’s Law. • Multi-core CPU and GPGPU are two major approaches. • Both of these will require performance modeling, experiment & auto-tuning • E.g. NERSC’s new XE6 system Hopper – 6384 nodes – 2 MagnyCours processors per node – 2 NUMA nodes per processor – 6 cores per NUMA node • What is the best way to run hybrid code on such a system? – “wisdom” says 4 processes x 6 threads to avoid NUMA effects. 3DAPAS – June 8th, 2011
Recommend
More recommend