Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - PowerPoint PPT Presentation

Interactive NanoAOD analysis Nick Amin Aug 19, 2019

Introduction ⚫ Condor jobs have a lot of overhead • Transferring and setting up environment for every job • Scheduling/internal condor overhead ⚫ Condor jobs are not interactive • Designed for monolithic background processing • Job status through condor_q spam ⚫ Interactive login node use will not really be "interactive" anymore • Several versions (e.g., years/periods) of ntuples → everyone turning to parallel looping to make histograms on login nodes � 2

Introduction ⚫ Goal: Enable fast er interactive analysis with NanoAOD using a message broker and HTCondor • preliminary code in https://github.com/aminnj/redis-htcondor ⚫ Not a new concept! There are lots of tools designed to do this more generally (dynamic task execution with brokered communication): • htmap, parsl, dask, celery, ray, airflow, dramatiq, … • Limitations on open ports/communication within/outside the batch system means it’s di ffi cult to use these out of the box ⚫ Wrote my own simple task queue system based on redis to allow low-level customization to suit our use case (HTCondor, data locality, caching, compressed communication, hardware, hadoop, …). • jobs are "embarrassingly parallel", so there's no real need for dynamic logic, inter-worker communication, DAGs, etc, • this relies on exactly one redis master server (one single public-facing ip/port). � 3

Setup 1. First, user submits N condor jobs (each condor one is 1 "worker") which listen to a FIFO jobs/workers redis server tasks queue of the redis server tasks (broker) • There’s actually N+1 task queues, 1 worker1 results task queue general ("tasks") and N specific hadoop files ("tasks:worker1" … "tasks:workerN") worker2 results queue • Workers listen to the general task … queue and also the one specific workerN results to themselves tasks • In this way, we can send targeted tasks to specific workers 2. User communicates workload/tasks to 👥 redis server 👥 user submits 3. Workers perform work and push output UAF condor jobs into results queue 4. User checks results queue and combines/reduces task = (function, arguments) broker compress decompress with lz4 � 4

Simple example ⚫ Someone hosts a redis server (I have one on uaf-10 right now) ⚫ User first externally submits 30 jobs to condor, each of which is a worker ⚫ User writes a function and has the option to map it over a list of arguments locally or remotely ⚫ Very low overhead — 40ms to send tasks to 10 workers and get the results back � 5

Dimuon example ⚫ Take 200M event subset of Run2 / DoubleMuon/ ⚫ Make 448 chunks of up to 500k events each, over the 133 files ⚫ Make function that histograms dimuon invariant mass • Simple selection reading 4 branches ⚫ remote_map runs in under 2 minutes — 2MHz with 30 workers � 6

Dimuon example ⚫ Sum up the partial result histograms and plot the ~100M muon pairs � 7

What did the workers do? ⚫ The results include metadata about start and stop time, so we can plot blocks during which the 30 workers are working • Tasks are distributed to workers on first-come first-serve basis → no scheduling • No white gaps → negligible communication overhead • Some "tail" workers can cause ine ffi ciency near the end ⚫ Sustained network read speed ~0.26GB/s while processing � 8

Can we do better? More workers ⚫ Consider two toy examples • "single MET branch" — read 1 MET branch from DoubleMu dataset • "dimuon invariant mass" — function from previous slides, which reads 4 branches ⚫ Run a handful of times for di ff erent number of workers and plot the event rate (MHz) ⚫ Rate scales approximately linearly with number of branches • For dimuon/four branches, rate saturates at 7MHz with around 150 workers • cabinet/sdsc nodes have a 10gbit connection. With 150 workers we see on the right that the peak read speed is ~10gbit — this is an empirical statement, is it true? ⚫ Not shown here, but computation of invariant mass from momentum components is irrelevant here. Each worker can compute invariant mass at ~10MHz with numpy’s vectorization � 9

Can we do better? Caching ⚫ Need to introduce scheduling of tasks in order to take advantage of caching • I.e., events A-B of file C were processed on sdsc-48 and branches were cached there, so subsequent submissions to put that task on sdsc-48 again • Ensure that the function has a placeholder for a branch cache , subsequent runs of remote_map will use the same workers and this cache parameter will be filled automatically on the workers � 10

Can we do better? Caching ⚫ If we have 150 workers and each gets allocated 15GB of disk, that’s ~2TB of NanoAOD we could cache on disk instead of hadoop • Turns out this is slower than reading from hadoop because if we have 8 workers running in a given node they will compete to read from the same disk, introducing tail jobs ⚫ 6GB of RAM per worker means 1 TB of NanoAOD could be cached in RAM • Promising, especially if we operate on columns at a time from the same disk ⚫ Running the dimuon example a second time, we go from ~7MHz to 50-80MHz because branches have been cached 500k events/task 1M events/task � 11

Possible next steps ⚫ Right now, we read through hadoop fuse mount since I couldn’t get xrootd wrappers installed with python3 • Potentially another big speedup if we submit jobs to other sites ⚫ Investigate "columnarization" of NanoAOD • Each branch gets converted into a single standalone file which is a gzipped numpy array • O ff ers 2-3x speedup over reading branches from ROOT files initially, but once branches are cached in RAM, speed is the same • If files are smaller than hadoop block size (64MB), we can submit workers to the same nodes hosting the file. Rough tests give me ~3x read speed increase when running over a file on the same node wrt a di ff erent node ⚫ Intelligent worker selection • Some nodes are worse than others • Important when nodes are shared with other people � 12

Backup � 13

Task overhead ⚫ 150 workers, 10k tasks that sleep a random amount from 0.2 to 0.4 seconds ⚫ E ffi ciency based on fraction of inter-task whitespace (ignoring whitespace on right side) is histogrammed over 150 workers on the right • Mean e ff is ~ 99% � 14

Node metrics ⚫ Separate pub/sub queue to poll metrics for the worker process or whole node. E.g., for nodes as a whole, … � 15

Vs dask ⚫ Also try dask-distributed vs mine ⚫ 100 workers, 8GB array cache • Run dask 6 times and mine 6 times. 3 with cold mine cache, 3 with warm cache ⚫ With warm cache, overhead with dask makes it cold cache roughly ~2x slower than mine when ignoring tail jobs • Lots of variance with dask ⚫ Is this because of allow_other_workers/work stealing causing cache to not be used? mine dask warm cache warm cache � 16

Compression (LZMA) ⚫ Study a 1.7M event DoubleMuon nanoaod file — 1.4GB file, compressed with LZMA (default NanoAOD workflow) ⚫ Read ~20 branches on my laptop with SSD ⚫ Everything done with as warm cache as possible (run cells multiple times) ⚫ First, just open/initialize the file and TTree ⚫ Icicle plot shows time on x- axis, nested child function calls along y-axis ⚫ 1s overhead to open file and get tree � 17

Compression (LZMA) ⚫ Now actually read those branches ⚫ Takes 18s and 80% of that is decompression � 18

Compression (LZ4) ⚫ Take the same file and convert to LZ4 ( hadd -O -f404 doublemu_lz4_reopt.root doublemu_lzma.root ) — gives a 2.7GB file ⚫ Opening the file and getting the tree takes 0.74s and 1% of that is decompression • Thus, intrinsic overhead with uproot opening file/tree is on the order of 1s ‣ So 50 workers and 500 files of 1M events each will never surpass 50MHz � 19

Compression (LZ4) ⚫ Reading branches ⚫ Takes 2.7s and 11% of that is spent decompressing ⚫ So, LZMA took 13s to decompress vs LZ4 with 0.28s — 40x faster to decompress, and the remaining walltime is interpretation overhead � 20

Comparing compresison algos ⚫ Convert previous files with several algorithms • Compression enums from https://github.com/root-project/root/blob/master/core/zip/inc/ Compression.h • Default NanoAOD/LZMA is 2 08, and recommended LZ4 is 4 04 • 401-408 produce similar filesizes and decompression speeds, so proceed with 404 • Filesizes: ⚫ Time to read 1-12 branches, on right ⚫ LZ4 and uncompressed are 10x faster basically equivalent ⚫ LZ4 ~10x faster than LZMA � 21

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - PowerPoint PPT Presentation

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of overhead Transferring and setting up environment for every job Scheduling/internal condor overhead Condor jobs are not interactive

Interactive Proofs Lecture 18 AM 1 Interactive Proofs 2 Interactive Proofs IP[k] 2

Zero-Knowledge Proofs Lecture 15 Interactive Proofs Interactive Proofs Interactive Proofs

Basics of Interactive Visual Analysis Helwig Hauser (Univ. of Bergen) Interactive Visual

Interim REPORT SEPNOV 2017 MAG INTERACTIVE AB (publ) MAG Interactive is a leading developer

Zero-Knowledge Proofs 1 Zero-Knowledge Proofs Lecture 15 1 Interactive Proofs 2 Interactive

Introducing the Bokeh Server Interactive Data Visualization with Bokeh Interactive Data

Conversational Exploratory Search via Interactive Storytelling Outline 1. Interactive

Interactive Data Visualization with Bokeh Interactive Data Visualization with Bokeh What is

Interactive traffic analysis and Interactive traffic analysis and visualization with Wisconsin

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Interactive Design Audio and Design Leonard Paul of Lotus Audio Vancouver, Canada Interactive

IAC 2018 Interactive Presentations FAQ You can find answers to most questions in this FAQ. You can

INTERACTIVE PRESENTATION FAQs What is an interactive presentation and how is it different than a

Interactive Smart Admirror room or in the business employees, visitors, guests Admirror is a

Programme 12:00 12:25 Interactive presentation 12:30 13:00 Interactive demo

CS344M Autonomous Multiagent Systems Patrick MacAlpine Department of Computer Science The

Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop Patrick Donnelly, Peter

Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience Ashok Agarwal, Andre Charbonneau,

Introduction to Programming Dr. Barry Wittman Not Dr. Barry Whitman Education: PhD

A Transformation Approach for Classifying ALCHI(D) Ontologies with a Consequence-based ALCH

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &

The Process: Running Experiments, Writing, Presenting LING575 Analyzing Neural Language

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - PowerPoint PPT Presentation

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of overhead Transferring and setting up environment for every job Scheduling/internal condor overhead Condor jobs are not interactive

Interactive Proofs Lecture 18 AM 1 Interactive Proofs 2 Interactive Proofs IP[k] 2

Zero-Knowledge Proofs Lecture 15 Interactive Proofs Interactive Proofs Interactive Proofs

Basics of Interactive Visual Analysis Helwig Hauser (Univ. of Bergen) Interactive Visual

Interim REPORT SEPNOV 2017 MAG INTERACTIVE AB (publ) MAG Interactive is a leading developer

Zero-Knowledge Proofs 1 Zero-Knowledge Proofs Lecture 15 1 Interactive Proofs 2 Interactive

Introducing the Bokeh Server Interactive Data Visualization with Bokeh Interactive Data

Conversational Exploratory Search via Interactive Storytelling Outline 1. Interactive

Interactive Data Visualization with Bokeh Interactive Data Visualization with Bokeh What is

Interactive traffic analysis and Interactive traffic analysis and visualization with Wisconsin

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Interactive Design Audio and Design Leonard Paul of Lotus Audio Vancouver, Canada Interactive

IAC 2018 Interactive Presentations FAQ You can find answers to most questions in this FAQ. You can

INTERACTIVE PRESENTATION FAQs What is an interactive presentation and how is it different than a

Interactive Smart Admirror room or in the business employees, visitors, guests Admirror is a

Programme 12:00 12:25 Interactive presentation 12:30 13:00 Interactive demo

CS344M Autonomous Multiagent Systems Patrick MacAlpine Department of Computer Science The

Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop Patrick Donnelly, Peter

Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience Ashok Agarwal, Andre Charbonneau,

Introduction to Programming Dr. Barry Wittman Not Dr. Barry Whitman Education: PhD

A Transformation Approach for Classifying ALCHI(D) Ontologies with a Consequence-based ALCH

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &amp;

The Process: Running Experiments, Writing, Presenting LING575 Analyzing Neural Language

CS 6453 LECTURE 6: MESOS PLATFORM REUBEN RAPPAPORT WHAT IS THE PROBLEM? There are many

Interoperabilty: The SAGA Approach and Experience Shantenu Jha, Andre Merzky, Ole Weidner &