LS1 Activities of the ATLAS Software Project Markus Elsing report at the PH-SFT group meeting December 9th, 2013 reconstructed event in Phase-2 tracker
Introduction and Outline • the challenges GRID CPU Consumption MC Simulation ➡ pileup drives resource needs 3% 3% 4% MC Reconstruction • not only in Tier-0 10% Final Analysis 42% ➡ GRID “luminosity” is limited Group Production 19% • full simulation is costly Group Analysis Data Reconstruction ➡ physics requires to increase rate 20% Others • Run-2 data taking rate 1kHz (?) ➡ technologies are evolving fast • software needs to follow CPU vs pileup ➡ support detector upgrade studies • not covered in this talk LHC@25 ¡ns • outline of the talk LHC@50 ¡ns 1. work of Future Software Technologies Forum (FSTF) 2. algorithmic improvements 3. the Integrated Simulation Framework (ISF) for Run-2 4. new Analysis Model for Run-2 5. goals and plans for Data Challenge-14 (DC-14) 6. completion of LS1 program for restart of data taking Markus Elsing 2
Evolution of WLCG Resources • upgrades of existing centers WLCG%Disk%Growth% 500" PB 450" y"="34.2x"+"0.5" 400" Tier2% 350" ➡ additional resources expected mainly from Tier1% 300" CERN% advancements in technology (CPU or disk) 250" %% 200" %2008812%linear% ➡ will not match additional needs in coming years 150" 100" • todays infrastructure 50" 0" 2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020" ➡ x86 based, 2-3 GB per core, commodity CPU servers WLCG%CPU%Growth% 5000000" y"="363541x"+"16742" kHS06 ➡ applications running “event” parallel on separate cores 4500000" 4000000" ➡ jobs are send to the data to avoid transfers Tier2% 3500000" Tier1% 3000000" CERN% 2500000" • technology is evolving fast %% 2000000" 2008712%linear% 1500000" 1000000" ➡ network bandwidth fastest growing resource 500000" • data transfer to remote jobs is less of a problem 0" 2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020" • strict Monarc Model no longer necessary • fl exible data placement with data popularity driven replication, remote I/O and storage federations Intel Phi ➡ modern processors: vectorization of the applications and optimization for data locality (avoid cache misses) ➡ “many core” processors like Intel Phi (MIC) or GPGPUs • much less memory per core ! Markus Elsing 3
High Performance Computing in ATLAS • infrastructure is getting heterogeneous ➡ mostly opportunistic usage of additional resources • commercial Cloud providers (i.e. Google, Amazon) • free CPU in High Performance Computing centers ➡ big HPC centers outperform WLCG in CPU • X86, BlueGene, NVIDIA GPUs, ARM, ... ➡ GRID (ARC Middleware) or Cloud (OpenStack) interface SuperMUC (München) • suitable applications was ➡ CPU resource hungry with low data throughput • physics generators or detector simulation ➡ X86 based systems • small overhead to migrate applications NVIDIA ➡ GPU based systems • complete rewrite necessary (so far) or dedicated code • ATLAS (ADC) working group to evaluate HPC opportunities ➡ fi rst successful test productions on commercial clouds and HPC clusters Markus Elsing 4
Future Software Technologies Forum • coordinates all technology R&D e ff orts in ATLAS ➡ drives ATLAS developments on vectorization and parallel programming • examples: AthenaMP, AthenaHive, Eigen, VDT/libimf, ... • studies of compilers, allocators, auto-vectorization, ... • explore new languages (ISPC, cilk+, openMP4 etc) ➡ forum for R&D on GPGPUs and other co-processors • algorithm development, share experience, identify successful strategies • get experience on ARM and Intel Phi ➡ pool of experienced programmers • educating development community ➡ software optimization with pro fi ling tools (together with PMB) • tools like: perfmon, gperftools, GoODA • code optimization and identi fi cation of hot spots in ATLAS applications • examples: b- fi eld access, z- fi nder in HLT, optimizing neural-nets • liaison with Concurrency Forum and OpenLab ➡ integration of ATLAS e ff orts in LHC wide activities Markus Elsing 5
V.Tsulaia AthenaMP (Multi-Process) • not a new development, but not yet in production ➡ event parallel processing, aim to share memory (see GaudiMP) ➡ successful simulation, digitization and reconstruction tests recently • still issues with I/O, e.g. on EOS ➡ goal is to put AthenaMP in full production by ~ this summer memory sharing between worker processes • next version of AthenaMP improves GRID integration ➡ including new “event service” I/O model in ProdSys-2 Markus Elsing 6
C.Leggett AthenaHive Testbed • based on GaudiHive project ➡ model is multi-threading at the algorithm level (DAG) ➡ demonstrator study using calorimeter reconstruction • factor 3.3 speedup w.r.t. sequential (on more cores), 28% more memory Try To Find Best Configuration Calorimeter Testbed Dataflow Algorithm Timing SGInputLoader Calo Testbed Memory Usage and Timing SGInputLoader 1/1/20 2/2/20 MyEvent LArRawChannel TileRawChannel TrigTowers 0.142s with cloning (max 10) 2/3/20 2/4/20 LArCalibrationHitActive 2/5/20 3/2/20 un-parllelalizable 3/3/20 3/4/20 3/5/20 4/2/20 LArCalibrationHitDeadMaterial 0.994s LArCalibrationHitInActive 1.18s 4/3/20 4/4/20 CaloCellMaker 4/5/20 5/2/20 CaloCellMaker 680 5/3/20 5/4/20 0.852s 5/5/20 AllCalo serial: 630 1 Store, 1Alg: 523Mb, 316s CaloTopoCluster CmbTowerBldr CmbTowerBldr CaloTopoCluster MBTSContainer 0.082s 1.158s memory (MB) no cloning 1.201s 3 Stores, 3 Algs: 607Mb, 161s CaloTopoCluster CaloCalTopoCluster 580 CombinedTower with cloning 3 Stores, 5 Algs: 618Mb, 134s CaloCell2TopoCluster CaloClusterMakerSWCmb CaloClusterMakerSWCmb 0.187s 0.043s 4 Stores, 4 Algs: 667Mb, 129s CaloCell2TopoCluster 530 CombinedCluster_Link CombinedCluster CaloCell2TopoCluster CombinedCluster_Data StreamESD 0.186s serial: 2.65s 0.186s 480 30 80 130 180 230 280 330 100 events StreamESD time (s) C Leggett 10/23/13 C Leggett 10/23/13 C Leggett 10/23/13 • still a long way to go ➡ all framework services need to support multi-threading ➡ making ATLAS services, tools and algorithms thread safe, adapt con fi guration ➡ in the demonstrator we see limits of DAG (Amdahl’s law at play) • work on Hive necessary step towards fi nal multi-threading goal • need parallelism at all levels (especially for tracking algorithms) Markus Elsing 7
Current Tracking Software Chain • tracking is resource driver in reconstruction ➡ current software optimized for early rejection • avoid combinatorial overhead as much as possible ! ➡ early rejection requires strategic candidate processing and hit removal • not a heavily parallel approach, it is a SEQUENTIAL approach ! ➡ good scaling with pileup (factor 6-8 for 4 times pileup) - still catastrophic • implications for making it heavily parallel ? ➡ Amdahl’s law at work: t || =p/n+s • current strategy has small parallel part P, while it is heavy on sequential S ➡ hence: if we want to gain by a large N threads, we need to reduce S • compromise on early rejection, which means more combinatorial overhead • as a result, we will spend more CPU if we go parallel ➡ makes only sense if we use additional processing power that otherwise would not be usable ! (many core processors) Markus Elsing 8
Tracking Developments during LS1 • work on technology to improve CURRENT algorithms ➡ modi fi ed track seeding to explore 4th Pixel layer ➡ Eigen migration - faster vector+matrix algebra ➡ use vectorized trigonometric functions (VDT, INTEL libimf) ➡ F90 to C++ for the b- fi eld (speed improvement in Geant4 as well) ➡ simplify EDM design to be less OO (was the “hip” thing 10 years ago) ➡ xAOD: a new analysis EDM, maybe more... (may allow for data locality) • work will continue beyond this, examples: ➡ (auto-)vectorize Runge-Kutta, fi tter, etc. and take full bene fi t from Eigen ➡ use only curvilinear frame inside extrapolator ➡ faster tools like reference Kalman fi lter ➡ optimized seeding strategy for high pileup • hence, mix of SIMD and algorithm tuning • may give us a factor 2 (maybe more...) ➡ further speedups probably requires “new” thinking '' Markus Elsing 9
Improved Physics Performance • algorithms essential part of LS1 development work, examples: ➡ improved topo-clustering for calorimeter showers ➡ new tau reconstruction exploring substructure ➡ new jet and missing E T software, improved pileup stability ➡ particle fl ow jets η = a r e a ( ~ 3 . 5 ) τ + →π + π 0 ν CATIA ECAL HCAL staves identify substructure EM1 EM2 in tau decays TRT ATLAS IBL PP0 SCT Pix (I-Flexes) stave and module flexes PP0 to PP1 π + ( n o t y e t f i n a l i z e d ) e + e - stave ring & 6 endblocks π 0 free zone Conversions • software for Phase-0 upgrades Tracking ➡ full inclusion of IBL in track reconstruction inefficiency ➡ emulation of FTK in Trigger simulation chain (next slide) Markus Elsing 10
Recommend
More recommend