DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1
Introduction ◮ Work done in the context of comparing processing resources needed for CPU, GPU, FPGA ◮ I used the simplest possible pedestal subtraction, noise filtering and hit finding on FD MC. Get a “lower bound” on resources needed ◮ I’ll try to concentrate more on the algorithms than the performance today 2
Back-of-envelope calculation ◮ Collection wire samples/s/APA = 2e6 * 960 = 2e9 ◮ On a 2GHz CPU, that gives us 1 clock cycle per sample if we want to handle 1 APA ◮ But, all machines today have multiple cores, and we have single-instruction-multiple-data (SIMD) 3
SIMD Credit: Decora at English Wikipedia. CC Attribution-Share Alike 3.0 ◮ Act on multiple values simultaneously in one instruction ◮ Machines I can access have AVX2 with 256-bit registers, ie 16 16-bit numbers at a time ◮ Now our back-of-envelope looks better: 16 N core clock cycles per sample ◮ Just got access to a system at CERN with AVX-512 4
How I use SIMD ... (ch0, t0) (ch1, t0) (ch2, t0) ... (ch3, t0) (ch15, t0) ... (ch0, t1) (ch1, t1) (ch2, t1) ... (ch3, t1) (ch15, t1) ◮ Register holds the samples for 16 channels at the same time tick ◮ Makes operations on adjacent ticks in same channel easy ◮ (Makes operations on adjacent channels in same tick hard) ◮ (Incidentally this is the opposite order I store the input in memory, but it doesn’t seem to hurt too much?) 5
Extracting waveforms ◮ Much easier to work on this outside larsoft , so I extracted waveforms from non-zero-suppressed MC using gallery ( http://art.fnal.gov/gallery/ ) ◮ Converted waveforms to text format: simple to import/plot from C++ and python 6
Raw waveforms 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Some channels from SN MC which I selected because they look nice ◮ Reminder: FD MC noise is low, no coherent noise 7
Step 1: pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Intuition: pedestal is the median of the waveform ◮ “Frugal streaming” 1 gives an approximation that converges to median: 1. Start with an estimate of the median, read the next sample 2. If sample > median, increase median by 1 3. If sample < median, decrease median by 1 ◮ Unfortunately this “follows” hits too much, so try a modification. . . 1 https://arxiv.org/abs/1407.1121 8
Step 1: modified pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 1. Start with an accumulator=0, an estimate of the median, and read the next sample 2. If sample > median, increase accumulator by 1 3. If sample < median, decrease accumulator by 1 4. If accumulator = X , increase median by 1, reset accumulator to 0 5. If accumulator = − X , decrease median by 1, reset accumulator to 0 ◮ I used X = 10 because it was the first number I thought of ◮ Larger values of X mean you follow hits less, but respond less to real changes in the pedestal. For serious work, would need some investigation 9
Step 2: noise filtering 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ I used a simple FIR lowpass filter (=a discrete convolution with a fixed function) ◮ Hardcoded filter size (7 taps), unrolled inner loop. Lowpass filter with cutoff 0.1 of Nyquist frequency ◮ I’m using integer coefficients, which is why the scale changed ◮ Probably need a bigger filter for more realistic noise 10
Step 3: hit finding 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ Algorithm: first sample over a fixed threshold starts a hit. Integrate time and charge until fall below threshold again ◮ Could make threshold depend on pedestal RMS; require a number of samples above threshold; emit multiple primitives for long time-over-threshold 11
Benchmarking results summary ◮ Tested with a chunk of collection channel MC large enough to not fit in cache ◮ With about 4 threads, the multicore CPU I tested on can keep up with 1 APA worth of data ◮ More details in backups 12
Extensions/TODOs for benchmarking ◮ Consider more realistic input data, like from electronics: ◮ Samples are 12-bit numbers, not 16-bit ◮ Ordering of channels is different ◮ Input is 8b/10b encoded ◮ Run on different machine with more cores, no virtualization ◮ Time individual steps, vary parameters (eg # of taps) ◮ Check distribution of timings (eg, do we occasionally get very long times?) ◮ Eventually would test more complex algorithms ◮ Stream data into memory, eg using GPU (idea from Babak) 13
Algorithm extensions ◮ Deal with coherent noise somehow. Eg MicroBooNE technique: subtract median of a group of channels at the same tick ◮ MicroBooNE has “harmonic” noise at fixed frequencies. Would require large FIR filter to deal with. Not sure if there is another technique available 14
Physics performance studies ◮ We need to understand how well any given algorithm performs, especially in the presence of more realistic noise. I haven’t done this at all ◮ This also needs a more serious noise model (which doesn’t have to be in larsoft : can be standalone, glued to larsoft signal simulation) 15
Backup slides 16
Detour: Memory hierarchy and bandwidth https://software.intel.com/en-us/articles/memory-performance-in-a-nutshell ◮ Main memory bandwidth sets an upper limit on how much data we can process ◮ 100 GB/s more than enough to handle 1 or 2 APAs 17
Measuring memory bandwidth ◮ Can we actually achieve this memory bandwidth? ◮ Used the STREAM benchmark 2 , which effectively just does memcpy ◮ Ran on dunegpvm01 . With 1 thread, get ∼ 10 GB/s; with 4 threads, get ∼ 35 GB/s (17.5 GB/s in + 17.5 GB/s out) 2 https://github.com/jeffhammond/STREAM , http://www.cs.virginia.edu/stream/ref.html 18
Strategy details ◮ Run on DUNE FD detector MC (it’s all I’ve got. . . ) ◮ Use the simplest algorithms I can think of ◮ Use only collection channels, all calculations with short integers (16 bits) ◮ Write a simple non-SIMD code to check the results ◮ SIMD code written in C++ using “intrinsic” functions ◮ Nicer interfaces exist, though I haven’t tried them. Eg http://quantstack.net/xsimd , http://agner.org/optimize/#vectorclass 19
What code with intrinsics looks like // s holds the samples in 16 channels at the same tick // This whole block achieves the following : // if the sample s is greater than median , add one to accum // if the sample s is less than median , add one to accum // For reasons that I don ’t understand , there ’s no cmplt // for ‘‘compare less -than ’’, so we have to compare greater , // compare equal , and take everything else to be compared // less -then // ’epi16 ’ is a type marker for ’16-bit signed integer ’ // Create masks for which channels are >, < median __m256i is_gt= _mm256_cmpgt_epi16 (s, median); __m256i is_eq= _mm256_cmpeq_epi16 (s, median); // The value we add to the accumulator in each channel __m256i to_add = _mm256_set1_epi16 (-1); // Really want an epi16 version of this , but the cmpgt and // cmplt functions set their epi16 parts to 0xffff or 0x0 , // so treating everything as epi8 works the same to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (1) , is_gt); to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (0) , is_eq); // Actually do the adding accum = _mm256_add_epi16 (accum , to_add); 20
Test details ◮ Use DUNE FD MC waveforms, as seen above. ◮ Using 4492 samples × 1835 collection wires × 16 repeats = 69 APA · ms (since 960 collection wires/APA) ◮ Using short int (2 bytes) for samples, so size is 252 MB (big enough to not fit in cache) ◮ Start with this data in memory, ordered like: ( c 0 , t 0 ) , ( c 0 , t 1 ) , . . . ( c 0 , t N ) , ( c 1 , t 0 ) , ( c 1 , t 1 ) ◮ Loop over the data, and store the output hits (not the intermediate steps). Repeat 10 times, take average ◮ Timing doesn’t include putting input data in memory, allocating output buffer, compacting output hits from SIMD code ◮ Ran with multiple threads. Each thread gets a contiguous block of 16 N channels to deal with ◮ No time chunking: all 4492 ticks get processed at once 21
System under test: 2 (system 1 in backups) ◮ lscpu : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz CPU MHz: 1200.281 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K 22
Timing results: system 2 Threads 1 2 4 8 16 32 64 Non-SIMD ms 1322.3 667.1 347.5 195.0 110.4 80.6 86.3 APA/server 0.1 0.1 0.2 0.4 0.6 0.9 0.8 GB/s 0.4 0.7 1.4 2.5 4.4 6.1 5.7 SIMD ms 124.9 76.0 48.4 24.3 14.9 10.9 11.6 APA/server 0.6 0.9 1.4 2.8 4.6 6.3 5.9 GB/s 3.9 6.5 10.1 20.2 33.1 45.0 42.3 ◮ Apologies for gigantic table: I’ve highlighted the most interesting values ◮ “APA/server” is just the ratio of “APA · ms data processed” (69, in this case) to ms elapsed, so take it with a pinch of salt beyond observing whether it’s greater than 1 ◮ ie 2–4 cores can keep up with the data from one APA ◮ There’s a few 10s of % variation between runs in these numbers 23
Recommend
More recommend