1st IEEE Workshop on High-Performance Interconnectjon Networks towards the Exascale and Big-Data Era Chicago, 8 September 2015 Modeling a Large Data-Acquisition Network in a Simulation Framework Tommaso Colombo 1,2 Holger Fröning 2 Pedro Javier Garcìa 3 Wainer Vandelli 1 1 Physics Department, CERN 2 Instjtut für Technische Informatjk, Universität Heidelberg 3 Departamento de Sistemas Informátjcos, Universidad de Castjlla-La Mancha
Data-acquisition systems ● In a scientjfjc experiment, a data-acquisitjon (DAQ) system handles the experimental signals ● Main functjons: – Signal processing (e.g. analog-to-digital conversion) – Data gathering (collectjon of signals from difgerent devices) – Filter / Trigger (discarding faulty / uninterestjng data) – Storage ● Usually implemented as a mix of custom hardware and sofuware running on commodity hardware T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 2
Data-acquisition systems ● Key requirement: DAQ effjciency – Fractjon of correctly acquired experimental data – Ideally 100%: experimental data is precious! – An ineffjcient DAQ might introduce bias in the data ⬇ ● Stringent requirements on – System availability – Bufger depth – Latency T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 3
Data-acquisition systems ● Systematjcally studying the performance envelope of a DAQ system is diffjcult: – A DAQ system is a mission-critjcal component of an experiment – System availability for performance studies is limited – Hardware or system sofuware modifjcatjons are usually not possible ● Simulatjon models give more freedom – Must be accurate enough in reproducing the system's behavior – Must be reasonably fast T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 4
Case study: the ATLAS experiment Large scale machine built to discover and study rare partjcle physics phenomena T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 5
Case study: the ATLAS experiment Observes proton collisions delivered by the LHC accelerator at CERN T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 6
Case study: the ATLAS experiment ● Basic parameters: – LHC delivers a collision “event” every 25 ns (40 MHz) – Each event is separately detected and measured – An event corresponds to 1-2 MB ● The data-acquisitjon system incorporates a data fjltering component: – If all collision events were acquired ATLAS would produce up to 80 TB/s and hundreds of EB per year! – Afuer two fjltering stages, ~1/10000 events survive – Data is recorded at 1-4 GB/s T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 7
First stage: custom hardware ● Synchronous, pipelined electronics Readout channels (~80 million) ● Selects and acquires ~1800 Level-1 1/400 events accept Event fragments Event fragments Readout drivers Level-1 (~2000 per event) ● 40 MHz input, Level-1 ~100 Full events Full events result Readout systems 100 kHz output up to 24 Region-of-interest builder ● ~80 million input Readout buffers channels, aggregated Event clear Region of interest into ~2000 outputs ~2000 High-Level Trigger worker nodes (“Event fragments”) order of 10 Region of interest High-Level Trigger High-Level Trigger Data Collection ● Output is striped over supervisor processing units Manager HLT decision ~100 “Readout” nodes Event clear 10 with deep bufgers Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 8
Second stage: distributed software Readout channels (~80 million) ● Commodity hardware: ~1800 Level-1 accept Event fragments Event fragments ~10000 CPU cores in Readout drivers Level-1 (~2000 per event) ~2000 worker nodes Level-1 ~100 Full events Full events result Readout systems ● Events are processed up to 24 Region-of-interest builder Readout buffers in parallel, as soon as acquired by fjrst stage Event clear Region of interest ~2000 ● 100 kHz input, High-Level Trigger worker nodes order of 10 Region of ~1 kHz output interest High-Level Trigger High-Level Trigger Data Collection supervisor processing units Manager HLT decision Event clear 10 Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 9
Second stage: distributed software ● A “Supervisor” assigns events to free cores Readout channels (~80 million) (“Processing Units”) ~1800 Level-1 ● Each Unit handles a accept Event fragments Event fragments Readout drivers Level-1 (~2000 per event) difgerent event Level-1 ~100 Full events Full events result Readout systems – Retrieves event up to 24 Region-of-interest builder fragments Readout buffers – Decides if the event is Event clear Region of to be kept interest ~2000 High-Level Trigger worker nodes – Avg tjme per event: order of 10 Region of interest High-Level Trigger 50 ms High-Level Trigger Data Collection supervisor processing units Manager HLT decision ● I/O is mediated by a Event clear 10 per-node “Manager” Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 10
Second stage: commodity hardware ● Datacenter technologies ● Two large 10GbE routers Readout 1 Gbps System (x 98) – Several hundreds ports 10 Gbps ● Readout bufger nodes x 196 x 196 HLT Supervisor – 2x 10GbE links to each router Router Router ● Worker nodes organized in x 10 x 50 racks of 40 nodes each x 50 x 10 – One switch per rack Switch – GbE links from nodes Data x 40 Logger (x 10) to switch HLT rack (x 50) – 10GbE links from switch to each core router T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 11
DAQ traffic pattern ● Need to aggregate data from difgerent instruments ➡ Communicatjon patuern: many-to-one ● Data transfers are driven by the experimental conditjons ➡ Bursty traffjc ● In ATLAS: From Readout System – Event fragments are striped over all the readout nodes – A processing unit needs fragments from Funneling Router multjple nodes at the same tjme – Many nodes will start sending fragments at Bandwidth the same tjme to the same destjnatjon, mismatch Switch 1 Gbps creatjng instantaneous network congestjon 10 Gbps To HLT node T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 12
DAQ traffic pattern ● On a lossy network such as Ethernet, the DAQ traffjc patuern leads to the incast pathology: – A client (the worker node) simultaneously receives short bursts of data from multjple sources (the readout nodes) – The switch bufgers are overfmown – All the packets from one source are dropped – TCP congestjon control mechanisms cannot prevent this ● A dramatjc increase in data transfer latency is observed – Incast triggers slow TCP tjmeout-based retransmission – Causes under-utjlizatjon of computjng power – Can lead to violatjng DAQ latency requirements T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 13
DAQ traffic pattern ● Simple incast mitjgatjon strategy: client-side traffjc shaping Smoothing the rate of data requests limits the maximum size of the traffjc bursts ● Key metric: data collectjon tjme Time required to gather all fragments of an event ● Implementatjon in ATLAS: – Each worker node has a fjxed number of credits available – Each requested fragment “costs” one credit ● Results: – Few traffjc shaping credits: data collectjon tjme grows because the worker nodes cannot fully utjlize the network bandwidth – Many traffjc shaping credits: high latency due to incast – Optjmal working point must be found manually T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 14
Quantifying the problem ● Measurements in test system: one worker rack ● Synthetjc traffjc: – 2.1 MB events, assigned to Processing Units at 750 Hz 1.6 GB/s input ➡ ● Core routers have huge bufgers no drops ➡ ● Two worker rack switches tested: Per-port bufgers (600 kB each) Shared bufgers (2x 10 MB) T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 15
Recommend
More recommend