CERN’s Virtual File System for Global-Scale Software Delivery Jakob Blomer for the CernVM-FS team CERN, EP-SFT MSST 2019, Santa Clara University
Agenda High Energy Physics Computing Model Software Distribution Challenge CernVM-FS: A Purpose-Built Software File System jblomer@cern.ch CernVM-FS / MSST’19 1 / 20
High Energy Physics Computing Model
Accelerate & Collide jblomer@cern.ch CernVM-FS / MSST’19 2 / 20
Measure & Analyze • Billions of independent “events” • Each event subject to complex software processing ⊃ High-Throughput Computing jblomer@cern.ch CernVM-FS / MSST’19 3 / 20
Federated Computing Model • Physics and computing: international collaborations • “The Grid”: ≍ 160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa jblomer@cern.ch CernVM-FS / MSST’19 4 / 20
Federated Computing Model Additional opportunistic HPC e. g. resources, backfill slots • Physics and computing: international collaborations • “The Grid”: ≍ 160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa jblomer@cern.ch CernVM-FS / MSST’19 4 / 20
Software Distribution Challenge
The Anatomy of a Scientific Software Stack 0.1 MLOC Individual changing Analysis Code Key Figures for LHC Experiments • Hundreds of (novice) developers 4 MLOC Experiment Software Framework • > 100 000 files per release • 1 TB / day of nightly builds 5 MLOC High Energy Physics • ∼ 100 000 machines world-wide Libraries • Daily production releases, 20 MLOC stable Compiler remain available “eternally” System Libraries OS Kernel jblomer@cern.ch CernVM-FS / MSST’19 5 / 20
Container Image Distribution Libs . . . Linux • Containers are easier to create than to role-out at scale • Due to network congestion: long startup-times in large clusters • Impractical image cache management on worker nodes • Ideally: Containers for isolation and orchestration, but not for distribution jblomer@cern.ch CernVM-FS / MSST’19 6 / 20
Shared Software Area on General Purpose DFS Working Set • ≈ 2 % to 10 % of all available files are requested at runtime • Median of file sizes: < 4 kB Software Flash Crowd Effect dDoS • O (MHz) meta data request rate • O (kHz) file open rate • • • Shared Software Area jblomer@cern.ch CernVM-FS / MSST’19 7 / 20
Software vs. Data Software Data POSIX interface put, get, seek, streaming File dependencies Independent files O (kB) per file O (GB) per file Whole files File chunks Absolute paths Relocatable WORM (“write-once-read-many”) Billions of files Versioned Software is massive not in volume but in number of objects and meta-data rates jblomer@cern.ch CernVM-FS / MSST’19 8 / 20
CernVM-FS: A Purpose-Built Software File System
Design Objectives Transformation HTTP Transport a1240 Caching & Replication 41ae3 . . . 7e95b Read-Only Content-Addressed Objects, Read/Write File System File System Merkle Tree Worker Nodes Software Publisher / Master Source 1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access jblomer@cern.ch CernVM-FS / MSST’19 9 / 20
Design Objectives Transformation HTTP Transport Several CDN options : a1240 41ae3 Caching & Replication . . . • Apache + Squids 7e95b • Ceph/S3 Read-Only Content-Addressed Objects, Read/Write File System File System Merkle Tree • Commercial CDN Worker Nodes Software Publisher / Master Source 1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access jblomer@cern.ch CernVM-FS / MSST’19 9 / 20
Scale of Deployment Source / Stratum 0 Replica / Stratum 1 Site / Edge Cache LHC infrastructure: • > 1 billion files • ≍ 100 000 nodes • 5 replicas, 400 web caches jblomer@cern.ch CernVM-FS / MSST’19 10 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxy Web Servery O (100) nodes / server O (10) DCs / server Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Load Balancing Web Servery O (100) nodes / server O (10) DCs / server HTTP HTTP Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Web Servery O (100) nodes / server O (10) DCs / server Failover Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Failover Worker Nodes Geo-IP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Failover Worker Nodes HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Worker Nodes Pre-populated Cache jblomer@cern.ch CernVM-FS / MSST’19 11 / 20
Reading CernVM-FS Global rAA Basic System Utilities HTTP Cache Hierarchy Fuse OS Kernel File System CernVM-FS Repository (HTTP or S3) Memory Buffer Persistent Cache All Versions Available ∼ 1 GB ∼ 20 GB ∼ 10 TB • Fuse based, independent mount points, e. g. /cvmfs/atlas.cern.ch • High cache effiency because entire cluster likely to use same software jblomer@cern.ch CernVM-FS / MSST’19 12 / 20
Writing Staging Area CernVM-FS Read-Only Union File System Read/Write Interface File System, S3 Publishing new content [ ~ ]# cvmfs_server transaction containers.cern.ch [ ~ ]# cd /cvmfs/containers.cern.ch && tar xvf ubuntu1610.tar.gz [ ~ ]# cvmfs_server publish containers.cern.ch jblomer@cern.ch CernVM-FS / MSST’19 13 / 20
Use of Content-Addressable Storage /cvmfs/alice.cern.ch Object Store amd64-gcc6.0 • Compressed files and chunks 4.2.0 • De-duplicated ChangeLog . . . File Catalog Compression, Hashing • Directory structure, symlinks 806fbb67373e9... • Content hashes of regular files Repository • Large files: chunked with rolling checksum • Digitally signed • Time to live Object Store File catalogs • Partitioned / Merkle hashes ⊕ Immutable files, trivial to check for corruption, versioning, efficient replication (possibility of sub catalogs) − compute-intensive, garbage collection required jblomer@cern.ch CernVM-FS / MSST’19 14 / 20
Partitioning of Meta-Data • certificates aarch64 • Locality by software version • Locality by frequency of changes x86_64 • Partitioning up to software librarian, steering through .cvmfscatalog magical marker files gcc Python v8.3 v3.4 jblomer@cern.ch CernVM-FS / MSST’19 15 / 20
Deduplication and Compression 24 months of software releases for a single LHC experiment 600 15 File system entries · 10 6 Volume [GB] 400 10 200 5 100 1 s s s s s d e e e e e e i l t l t r fi fi s a a s t n c c e r r a i a i r e l l p p p l l l u u u u m l A g g d d o e e R t R t C u u o o h h t t i u W W jblomer@cern.ch CernVM-FS / MSST’19 16 / 20
Site-local network traffic: CernVM-FS compared to NFS NFS server before and after the switch: Site squid web cache before and after the switch: Source: Ian Collier jblomer@cern.ch CernVM-FS / MSST’19 17 / 20
Latency sensitivity: CernVM-FS compared to AFS Use case: starting “stressHepix” standard benchmark Startup overhead vs. latency 600 AFS 12 CernVM-FS 500 Throughput [Mbit/s] 10 Throughput 400 8 ∆ t [min] 300 6 4 200 2 100 0 0 LAN 25 50 100 150 Round trip time [ms] jblomer@cern.ch CernVM-FS / MSST’19 18 / 20
Principal Application Areas ❶ Production Software ❸ Unpacked Container Images Example: /cvmfs/ligo.egi.eu Example: /cvmfs/singularity.opensciencegrid.org � Most mature use case � Works out of the box with Singularity ★ Fully unprivileged deployment of fuse module � CernVM-FS driver for Docker ★ Integration with containerd / kubernetes ❷ Integration Builds ❹ Auxiliary data sets Example: /cvmfs/lhcbdev.cern.ch Example: /cvmfs/alice-ocdb.cern.ch � High churn, requires regular garbage collection � Benefits from internal versioning ★ Update propagation from minutes to seconds • Depending on volume requires more planning for the CDN components ★ Current focus of developments jblomer@cern.ch CernVM-FS / MSST’19 19 / 20
Summary • CernVM-FS: special-purpose virtual file system that provides a global shared software area for many scientific collaborations • Content-addressed storage and asynchronous writing ( publishing ) key to meta-data scalability • Current areas of development: • Fully unprivileged deployment • Integration with containerd/kubernetes image management engine https://github.com/cvmfs/cvmfs jblomer@cern.ch CernVM-FS / MSST’19 20 / 20
Backup Slides
Recommend
More recommend