Interactive NanoAOD analysis Nick Amin Aug 19, 2019
Introduction ⚫ Condor jobs have a lot of overhead • Transferring and setting up environment for every job • Scheduling/internal condor overhead ⚫ Condor jobs are not interactive • Designed for monolithic background processing • Job status through condor_q spam ⚫ Interactive login node use will not really be "interactive" anymore • Several versions (e.g., years/periods) of ntuples → everyone turning to parallel looping to make histograms on login nodes � 2
Introduction ⚫ Goal: Enable fast er interactive analysis with NanoAOD using a message broker and HTCondor • preliminary code in https://github.com/aminnj/redis-htcondor ⚫ Not a new concept! There are lots of tools designed to do this more generally (dynamic task execution with brokered communication): • htmap, parsl, dask, celery, ray, airflow, dramatiq, … • Limitations on open ports/communication within/outside the batch system means it’s di ffi cult to use these out of the box ⚫ Wrote my own simple task queue system based on redis to allow low-level customization to suit our use case (HTCondor, data locality, caching, compressed communication, hardware, hadoop, …). • jobs are "embarrassingly parallel", so there's no real need for dynamic logic, inter-worker communication, DAGs, etc, • this relies on exactly one redis master server (one single public-facing ip/port). � 3
Setup 1. First, user submits N condor jobs (each condor one is 1 "worker") which listen to a FIFO jobs/workers redis server tasks queue of the redis server tasks (broker) • There’s actually N+1 task queues, 1 worker1 results task queue general ("tasks") and N specific hadoop files ("tasks:worker1" … "tasks:workerN") worker2 results queue • Workers listen to the general task … queue and also the one specific workerN results to themselves tasks • In this way, we can send targeted tasks to specific workers 2. User communicates workload/tasks to 👥 redis server 👥 user submits 3. Workers perform work and push output UAF condor jobs into results queue 4. User checks results queue and combines/reduces task = (function, arguments) broker compress decompress with lz4 � 4
Simple example ⚫ Someone hosts a redis server (I have one on uaf-10 right now) ⚫ User first externally submits 30 jobs to condor, each of which is a worker ⚫ User writes a function and has the option to map it over a list of arguments locally or remotely ⚫ Very low overhead — 40ms to send tasks to 10 workers and get the results back � 5
Dimuon example ⚫ Take 200M event subset of Run2 / DoubleMuon/ ⚫ Make 448 chunks of up to 500k events each, over the 133 files ⚫ Make function that histograms dimuon invariant mass • Simple selection reading 4 branches ⚫ remote_map runs in under 2 minutes — 2MHz with 30 workers � 6
Dimuon example ⚫ Sum up the partial result histograms and plot the ~100M muon pairs � 7
What did the workers do? ⚫ The results include metadata about start and stop time, so we can plot blocks during which the 30 workers are working • Tasks are distributed to workers on first-come first-serve basis → no scheduling • No white gaps → negligible communication overhead • Some "tail" workers can cause ine ffi ciency near the end ⚫ Sustained network read speed ~0.26GB/s while processing � 8
Can we do better? More workers ⚫ Consider two toy examples • "single MET branch" — read 1 MET branch from DoubleMu dataset • "dimuon invariant mass" — function from previous slides, which reads 4 branches ⚫ Run a handful of times for di ff erent number of workers and plot the event rate (MHz) ⚫ Rate scales approximately linearly with number of branches • For dimuon/four branches, rate saturates at 7MHz with around 150 workers • cabinet/sdsc nodes have a 10gbit connection. With 150 workers we see on the right that the peak read speed is ~10gbit — this is an empirical statement, is it true? ⚫ Not shown here, but computation of invariant mass from momentum components is irrelevant here. Each worker can compute invariant mass at ~10MHz with numpy’s vectorization � 9
Can we do better? Caching ⚫ Need to introduce scheduling of tasks in order to take advantage of caching • I.e., events A-B of file C were processed on sdsc-48 and branches were cached there, so subsequent submissions to put that task on sdsc-48 again • Ensure that the function has a placeholder for a branch cache , subsequent runs of remote_map will use the same workers and this cache parameter will be filled automatically on the workers � 10
Can we do better? Caching ⚫ If we have 150 workers and each gets allocated 15GB of disk, that’s ~2TB of NanoAOD we could cache on disk instead of hadoop • Turns out this is slower than reading from hadoop because if we have 8 workers running in a given node they will compete to read from the same disk, introducing tail jobs ⚫ 6GB of RAM per worker means 1 TB of NanoAOD could be cached in RAM • Promising, especially if we operate on columns at a time from the same disk ⚫ Running the dimuon example a second time, we go from ~7MHz to 50-80MHz because branches have been cached 500k events/task 1M events/task � 11
Possible next steps ⚫ Right now, we read through hadoop fuse mount since I couldn’t get xrootd wrappers installed with python3 • Potentially another big speedup if we submit jobs to other sites ⚫ Investigate "columnarization" of NanoAOD • Each branch gets converted into a single standalone file which is a gzipped numpy array • O ff ers 2-3x speedup over reading branches from ROOT files initially, but once branches are cached in RAM, speed is the same • If files are smaller than hadoop block size (64MB), we can submit workers to the same nodes hosting the file. Rough tests give me ~3x read speed increase when running over a file on the same node wrt a di ff erent node ⚫ Intelligent worker selection • Some nodes are worse than others • Important when nodes are shared with other people � 12
Backup � 13
Task overhead ⚫ 150 workers, 10k tasks that sleep a random amount from 0.2 to 0.4 seconds ⚫ E ffi ciency based on fraction of inter-task whitespace (ignoring whitespace on right side) is histogrammed over 150 workers on the right • Mean e ff is ~ 99% � 14
Node metrics ⚫ Separate pub/sub queue to poll metrics for the worker process or whole node. E.g., for nodes as a whole, … � 15
Vs dask ⚫ Also try dask-distributed vs mine ⚫ 100 workers, 8GB array cache • Run dask 6 times and mine 6 times. 3 with cold mine cache, 3 with warm cache ⚫ With warm cache, overhead with dask makes it cold cache roughly ~2x slower than mine when ignoring tail jobs • Lots of variance with dask ⚫ Is this because of allow_other_workers/work stealing causing cache to not be used? mine dask warm cache warm cache � 16
Compression (LZMA) ⚫ Study a 1.7M event DoubleMuon nanoaod file — 1.4GB file, compressed with LZMA (default NanoAOD workflow) ⚫ Read ~20 branches on my laptop with SSD ⚫ Everything done with as warm cache as possible (run cells multiple times) ⚫ First, just open/initialize the file and TTree ⚫ Icicle plot shows time on x- axis, nested child function calls along y-axis ⚫ 1s overhead to open file and get tree � 17
Compression (LZMA) ⚫ Now actually read those branches ⚫ Takes 18s and 80% of that is decompression � 18
Compression (LZ4) ⚫ Take the same file and convert to LZ4 ( hadd -O -f404 doublemu_lz4_reopt.root doublemu_lzma.root ) — gives a 2.7GB file ⚫ Opening the file and getting the tree takes 0.74s and 1% of that is decompression • Thus, intrinsic overhead with uproot opening file/tree is on the order of 1s ‣ So 50 workers and 500 files of 1M events each will never surpass 50MHz � 19
Compression (LZ4) ⚫ Reading branches ⚫ Takes 2.7s and 11% of that is spent decompressing ⚫ So, LZMA took 13s to decompress vs LZ4 with 0.28s — 40x faster to decompress, and the remaining walltime is interpretation overhead � 20
Comparing compresison algos ⚫ Convert previous files with several algorithms • Compression enums from https://github.com/root-project/root/blob/master/core/zip/inc/ Compression.h • Default NanoAOD/LZMA is 2 08, and recommended LZ4 is 4 04 • 401-408 produce similar filesizes and decompression speeds, so proceed with 404 • Filesizes: ⚫ Time to read 1-12 branches, on right ⚫ LZ4 and uncompressed are 10x faster basically equivalent ⚫ LZ4 ~10x faster than LZMA � 21
Recommend
More recommend