Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 - PowerPoint PPT Presentation

Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 22 May 2019

Introduction • Some archaeology from my time at Fermilab – Earliest archived Fermilab talks at HTCondor Week – 15 years ago! – My earliest HTCondor Week talk in 2012 • Describe the current state of the cluster(s) • Along the way, I hope to: – Show some (maybe) unique uses of HTCondor – Explain why we did what we did – Give a peek into some future activities 2 5/16/19 Anthony Tiradani | HTCondor Week 2019

In the Beginning… (At least for me) • There was HTCondor! And it was Good. – When I started, the silent “HT” hadn’t been added to the name yet • Single VO • Grid-enabled • Multi-VO Pool • Priorities • Grid-enabled CMS Tier-1 • CMS + OSG • Quotas GPGrid • Many experiments + OSG • Single VO Pool • Local Analysis only • Priority based CMS LPC scheduling 3 5/16/19 Anthony Tiradani | HTCondor Week 2019

Net Batch Slot Utilization – 2013 Scientific Computing Portfolio Review Last 3 months Queued 24000 Idle Busy |- Holidays -| 5/16/19 4 Anthony Tiradani | HTCondor Week 2019

FIFEBatch • FifeBatch was created using GlideinWMS – Main motivation was the desire to use OSG resources seamlessly. GPGrid Pilot FifeBatch (GlideinWMS Pool) Pilot OSG 5 5/22/19 Anthony Tiradani | HTCondor Week 2019

FIFEBatch • FIFEBatch was a GlideinWMS pool – All slots are similar – controlled by pilot (glidein) – Used the glideinWMS Frontend to implement policies – Used the OSG Factory for pilot submission – Pilot “shape” defined by Factory – All of the benefits of glideinWMS and OSG • All FNAL experiment jobs ran within the FifeBatch pool • FIFEBatch managed by experimental support team • GPGrid Managed by Grid Computing team 6 5/22/19 Anthony Tiradani | HTCondor Week 2019

SC-PMT - GP Grid Processing requests: Large memory or multi-core as single slot • We began to see increased demand for large memory or multi-core slots Last year’s • For context: SC-PMT – A “standard” slot was defined as 1 core, 2GB RAM • Partitionable slots limited by the pilot size • Unable to use extra worker resources beyond what is claimed by the pilot 5/16/19 Anthony Tiradani | HTCondor Week 2019 7

Combined: GPGrid + FifeBatch = FermiGrid FermiGrid GlideinWMS OSG Services OSG Pilot Worker Nodes Pilots Quota based scheduling Priority based scheduling 8 5/22/19 Anthony Tiradani | HTCondor Week 2019

CMS Tier-1 + LPC • New requirements: – Make LPC available to CMS Connect – Make CRAB3 jobs run on LPC resources • LPC workers reconfigured to remove all extra storage mounts – Now LPC workers look identical to the Tier-1 workers • LPC needed Grid interface for CMS Connect and CRAB3 – The Tier-1 was already Grid-enabled • However, 2 competing usage models: – Tier-1 wants to be fully utilized – LPC wants resources at the time of need 9 5/22/19 Anthony Tiradani | HTCondor Week 2019

CMS Tier-1 + LPC CRAB Submit CMS CRAB3 CMS - Other CMS Connect Reserved glide-in CMS Global (From CRAB submit Global Pool Pilot Pool or CMS Connect) Combined CMS Pool CMS LPC HTCondor-CE Interactive CMS Tier-1 Login HTCondor-CE Nodes LPC Workers Tier-1 Workers LPC User CMS LPC Schedd Direct Submit 10 5/22/19 Anthony Tiradani | HTCondor Week 2019

CMS - Docker HTCondor-CE HTCondor Worker Job Router Advertises: FERMIHTC_DOCKER_CAPABLE=True Sets WantDocker = MachineAttrFERMIHTC_DOCKER_CAPABLE0 FERMIHTC_DOCKER_TRUSTED_IMAGES= <comma separated list> Sets DockerImage = image expression LPC Schedd GlideinWMS Pilot Job Transform Advertises: Sets WantDocker = MachineAttrFERMIHTC_DOCKER_CAPABLE0 FERMIHTC_DOCKER_CAPABLE=False Sets DockerImage = image expression 11 5/16/19 Anthony Tiradani | HTCondor Week 2019

HEPCloud - Drivers for Evolving the Facility • • HEP computing needs will be 10- Scale of industry at or above R&D 100x current capacity – Commercial clouds offering Two new programs coming online (DUNE, High-Luminosity increased value for decreased LHC), while new physics search programs (Mu2e) will be cost compared to the past operating Price of one core-year on Commercial Cloud 12 5/16/19 Anthony Tiradani | HTCondor Week 2019

HEPCloud - Drivers for Evolving the Facility: Elasticity • Usage is not steady-state • Computing schedules driven by real-world considerations (detector, accelerator, …) but also ingenuity – this is research and development of cutting-edge science NOvA jobs in the queue at FNAL Facility size 13 5/16/19 Anthony Tiradani | HTCondor Week 2019

HEPCloud - Classes of Resource Providers Grid Cloud HPC ▪ Community Clouds - Similar ▪ Researchers granted access to • Virtual Organizations (VOs) trust federation to Grids HPC installations of users trusted by Grid sites ▪ Commercial Clouds - Pay-As- ▪ Peer review committees award • VOs get allocations ➜ You-Go model Allocations Pledges ๏ Strongly accounted ๏ Awards model designed for individual PIs rather than ๏ Near-infinite capacity ➜ Elasticity – Unused allocations: opportunistic resources large collaborations ๏ Spot price market “Things you borrow” “Things you rent” “Things you are given” Trust Federation Economic Model Grant Allocation 14 5/22/19 Anthony Tiradani | HTCondor Week 2019

HEPCloud • New DOE requirements: Use LCF Facilities • HEPCloud adds Cloud and HPC resources to the pool • Cloud and HPC resource requests are carefully curated for specific classes of jobs – Only want appropriate jobs to land on Cloud and HPC resources – Additional negotiator also gives more flexibility in handling new resource types 15 5/22/19 Anthony Tiradani | HTCondor Week 2019

HEPCloud Era HPC CMS HEPCloud LPC Tier-1 HPC Cloud Services Workers Workers Pilots Pilots Cloud LPC Negotiator HEPCloud Negotiator Tier-1 Scheduler 16 5/22/19 Anthony Tiradani | HTCondor Week 2019

Monitoring – Negotiation Cycles Negotiation Cycle Time Idle Jobs Successful Matches Rejected Jobs Considered Jobs 17 5/22/19 Anthony Tiradani | HTCondor Week 2019

Monitoring – Central Manager Average match rates Recent Updates 18 5/22/19 Anthony Tiradani | HTCondor Week 2019

Next Steps • CI/CD pipelines for Docker containers • Containerizing workers? (Kubernetes, DC/OS, etc.) • HTCondor on HPC facilities with no outbound networking • Better handling of MPI jobs – No dedicated FIFO scheduler – No preemption 19 5/22/19 Anthony Tiradani | HTCondor Week 2019

Questions, Comments? 20 5/22/19 Anthony Tiradani | HTCondor Week 2019

Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 - PowerPoint PPT Presentation

Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 22 May 2019 Introduction Some archaeology from my time at Fermilab Earliest archived Fermilab talks at HTCondor Week 15 years ago! My earliest HTCondor Week talk

Pool::count Pool::grow() Pool::alloc() Pool_element_header Pool_element_header::next

CS5412 / LECTURE 7 Ken Birman THE PUZZLE OF ALWAYS SHARDED Spring, 2020 IOT DATA AND

My Neck, My Back, My Neck and My Back: Airrosti Musculoskeletal Cost by Pool TAC Pool 2017 Paid

PVC Swimming Pool Lin iners 1 PVC Swimming Pool Lin iners In ground pre-tailored pool liners A

Section 5 Vernal Pool Slides CMS Vernal Pool Study By Aliya Hosford, Alec Ernst, Macie Werntz,

MongoDB Sharded Cluster Tutorial Paul Agombin, Maythee Uthenpong 1 Introductions Paul Agombin

Friends of Hearst Pool Friends of Hearst Pool Who are we? Over 1000 People Many Parents

Vernal Pool Slides Section 1 Our Vernal Pool Mackenzie Pavlik , Callie Nairus, Sandy Buxton,

Section 2 Vernal Pool Slides Vernal Pools By Ava, Si, Leighton,and Cindy What is a Vernal Pool?

Pool Re Terrorism 2019 Warren Haydock CLAIMS MANAGER Pool Reinsurance Company Limited, 3rd Floor

2020 CCOGA POOL RENEWAL Small Group Meeting August 7, 2019 CCOGA Pool Governance and Management

Balcones Pool Ashley Wells, Aquatics Supervisor Pool Shade Swim Classes Offered

WORKER POOL IN GO A SIMPLE WORKER POOL IN GO ABOUT ME Adam Presley Father Software

Section 3 Vernal Pool Slides Vernal Pool By Ava Moore, Izzy Frangules, Aubrie Olsen, Kai Neal,

Chainspace: A Sharded Smart Contract Platform Authors Mustafa Al-Bassam* Alberto Sonnino*

Massively Sharded MySQL Evan Elias Velocity Europe 2011 Tumblr s Size and Growth 1 Year Ago

Administrivia Students are 75% EE, 25% CS. Lecture 2 Top three goals: Signal Processing and

Lecture 17: LPC speech synthesis and autocorrelation- based pitch tracking ECE 401, Signal and

a no-nonsense quick guide Jarlath Quinn Analytics Consultant Rachel Clinton Business

Forsaking Folly 2 1 10/1/2020 F OLLY : The Real Pandemic (1) The nave or simple

High Energy WW Scattering at the LHC James (Jamie) Gainer University of Florida August 19, 2013

Vocoders 1 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass

Optimize Primary Care Teams to Meet Patients Medical AND Behavioral Needs A 12- month IHI

Linear Prediction Analysis of Speech Sounds Berlin Chen 2004 References: 1. X. Huang et. al.,