Mike May, Technology Director Research in Production Clouds Designed for Transition intelligent architectures and big data science
WHO I AM • Mike May, Technical Director – 30+ cloud deployments • 4 production US Government clouds • 6 simultaneous research DARPA clouds – Background • Cybersecurity • HPC Systems Engineering • Stacker since 2013 2
WHAT WE DO • Supporting 10 Active R&D Programs – Mostly DARPA programs – Design and deploy upstream IaaS • OpenStack • Mesos • Kubernetes • Burst support to public clouds – Drastically diverse research goals • Data science and data analytic heavy 3
WHAT WE DEPLOY 4
BANANA( 🍍 ) FOR SCALE • 1 DARPA Program Cluster (OpenStack and Mesos) – 476 Raw CPU cores (952 Threads/vCPUs) – 13TB RAM; never overprovisioned – 2PB Raw Disk – 14 GPU Nodes – 48 Pascal GPUs • 172032 CUDA Cores • 576GB of VRAM – Bare metal Mesos with burst to OpenStack – GPU development VMs available in OpenStack – GPU batch job support in Mesos – 100% Open Source Tools 5
HOW IS IT USED • Seamless development to production experience – Hardware is shifted between IaaS offerings as needed – Development heavy start, production heavy transition • Simultaneous provisioned and batch job support • CI/CD processes automatically promote work to relevant cluster resources – Provided to users automatically and they have full control – Fire and forget methodology; fail fast • L2 isolation by default 6
SYSTEM LIMITS • CPU overprovisioning is never >8x (which is a lot) – CPU NOPs are our enemy too – We collect metadata on performance; metadata is collected on boxes that are currently not overprovisioned – Per-program level – Case-by-case 7
SYSTEM LIMITS (CONT.) • RAM is NEVER overprovisioned – Problems bubbled up as bare metal OS issues • GPUs are (painfully) special – In and out of batch processing pipelines – Obvious but important: development and experiments change use case and needs 8
STARTING WITH A BASELINE • We needed a baseline that others could reproduce locally • Fuel was a great start because of the web interface that made the process much easier to ingest 9
CUSTOMIZE OFF OF BASELINE • Ansible supported all customizations applied after a baseline deployment • ”Program-public” for all to see 10
CLOUD OPS • Cloud administration – Configuration management • “Ansiblize” all the things • Idempotency is key – Automation • “(Almost) any task I have done more than once is to be automated/scripted” – Easier said than done 11
SYSTEMS FROM CODE • All it takes to build an OpenStack base image – GitLab – Packer – Ansible 12
A LITTLE CODE 13
CLOUDS AS CODE CODE FOR OUR CURRENT DEPLOYMENTS Every management and service task is captured and reviewable by the entire team. 14
SELF-SERVICE PROXY WITH AUTHENTICATION USER DRIVEN AUTHENTICATION 15
LESSONS LEARNED • Putting off automation reduces the chance you will ever do it • Monitoring is hard to do right but powerful to understand user’s interactions with services • Ground truth / root cause EVERYTHING – Issues, alerts, crashes, user reports • Researchers are biased (and so are admins and operators) • Evacuation must always be an option – Resource planning • Document and train by default 16
THANK YOU! 17
Recommend
More recommend