Stratus: Clouds with Microarchitectural Resource Management Kaveh Razavi and Animesh Trivedi
Once Upon a Time in the Cloud Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB 2
Once Upon a Time in the Cloud Large: 4 cores, 16 GB Medium: 2 cores, 8 GB cloud Small: 1 core, 4 GB provider allocate(small) CPU tenant DRAM 3
Once Upon a Time in the Cloud Large: 4 cores, 16 GB Medium: 2 cores, 8 GB cloud Small: 1 core, 4 GB provider allocate(small) CPU tenant DRAM 4
Once Upon a Time in the Cloud ) m u i d e m ( e t a c tenant o Large: 4 cores, 16 GB l l a Medium: 2 cores, 8 GB cloud Small: 1 core, 4 GB provider allocate(small) CPU tenant DRAM 5
Once Upon a Time in the Cloud ) m u i d e m ( e t a c o Large: 4 cores, 16 GB l l a Medium: 2 cores, 8 GB cloud Small: 1 core, 4 GB provider allocate(small) CPU Shared L3 Cache What could possibly go wrong here when DRAM two tenant share the L3 cache? 6
Problems with (Unsupervised) Sharing (a) Performance (b) Security CPU Shared L3 Cache NetCAT (S&P’20): detect activity dCat (EuroSys’18): 57.6% improvements of another tenant over network for Redis with noisy neighbours 7 Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service . ACM EuroSys 2018.
Problems with (Unsupervised) Sharing (a) Performance (b) Security CPU Shared L3 Cache NetCAT (S&P’20): detect activity dCat (EuroSys’18): 57.6% improvements of another tenant over network for Redis with noisy neighbours 8 Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service . ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.
Problems with (Unsupervised) Sharing (a) Performance (b) Security CPU Shared L3 Cache Are these two examples L3 Cache specific? NetCAT (S&P’20): detect activity dCat (EuroSys’18): 57.6% improvements of another tenant over network for Redis with noisy neighbours 9 Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service . ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.
New Classes of Microarchitectural Resources Microarchitectural Resource Resource CPU Caches, TLBs, Hyperthreads, ALUs Smart NICs Caches (memory, requests, connection), TLBs, RMT pipelines, DMA engines NVM Storage Blocks, pages, internal r/w ports, programmable cores, SRAM GPUs Memories, caches, execution units In-Network SRAM and TCAM memories, Match-action Switches Unit processors, ALUs And more - TPUs, FPGAs, near-memory compute elements … 10
What is happening ● Diverse microarchitectural resources are here to stay ● They have security and performance ramifications ● The root cause is unsupervised sharing (or lack of isolation) Key challenge How to manage microarchitectural resources in a principled manner? 11
Stratus: Clouds with Principled Microarchitectural Resource Management Key property : isolation 12
Stratus: Clouds with Principled Microarchitectural Resource Management A cloud resource allocation framework Security and performance requirements are the two sides of isolation Captures and reasons about microarchitectural resource isolation 13
Stratus: Clouds with Principled Microarchitectural Resource Management Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation? 14
Stratus: Clouds with Principled Microarchitectural Resource Management Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation? 15
Capturing Isolation: A Declarative Interface Isolation captured as handle = ISOLATE (resource, scale, quantity); constraints on resource allocations 16
Capturing Isolation: A Declarative Interface Isolation captured as handle = ISOLATE (resource, scale, quantity); constraints on resource allocations hard : discrete allocation Extent of isolation Number of resources for LLC slots, ALUs, TLBs, requested {0,1} which this constraint must be satisfied soft: contented (in time) DRAM bandwidth 17
Capturing Isolation: A Declarative Interface handle = ISOLATE (resource, scale, quantity); Multiple constraints can be ATTACH (handle1, handle2, ...); “attached” (AND) by their handles 18
Capturing Isolation: A Declarative Interface handle = ISOLATE (resource, scale, quantity); ATTACH (handle1, handle2, ...); ALLOCATE cloud_resource, .. where constraints Pass multiple microarchitectural constraints during cloud resource allocations (also labeled grouped constraints -- see the paper) 19
Capturing Isolation: A Declarative Interface handle = ISOLATE (resource, scale, quantity); ATTACH (handle1, handle2, ...); ALLOCATE cloud_resource, .. where constraints E.g., mitigating NetCAT: ALLOCATE small where h1 = ISOLATE (CPU.LLC, 1.0, 64), // the first 64 lines h2 = ISOLATE (NIC.*, *, ...), // all NIC-level uarch ATTACH (h1, h2); // both for small VM allocations 20
Stratus: Clouds with Principled Microarchitectural Resource Management Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation? 21
Building a Reasoning Framework Cloud Knowledge Base Resources - Topology, numbers, types ● Similar in the spirit as SKB - Datasheets information in Barrelfish OS (SOSP’09) CKB ● Structured representation of Online measurements knowledge in one place - Utilization, spare ● Can be queried - Occupancy 22
Building a Reasoning Framework Cloud Knowledge Base Resources - Topology, numbers, types - Datasheets information Allocation strategy CKB Online measurements - Utilization, spare - Occupancy Tenant’s constraints (CNF format) 23
Stratus: Clouds with Principled Microarchitectural Resource Management Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation? 24
Charging for Isolation: Isolation Credits Isolation spectrum Low isolation High isolation + High utilization + Low interference + Better efficiency + Better perf/security - Low perf/security - Low utilization 25
Charging for Isolation: Isolation Credits Isolation spectrum Low isolation High isolation + High utilization + Low interference + Better efficiency + Better perf/security - Low perf/security - Low utilization Isolation credit: a currency to capture this tradeoff ● Encourages tenants to only specify relevant constraints ● Encourages providers to innovate in better isolation mechanisms 26
Charging for Isolation: Isolation Credits Isolation spectrum Low isolation High isolation + High utilization + Low interference + Better efficiency + Better perf/security - Low perf/security - Low utilization Stratus c r e d Cloud i constraints t b u d g e t = constraints 4 2 42 credits Tenant who Tenant with knows a budget 27
Summary: Stratus Managing microarchitectural resources in a principled manner Three key ideas: 1. A declarative interface 2. A Cloud Knowledge Base (CKB) 3. Isolation Credits 28
Challenges and Discussion Points Enforcing isolation Right constraints Scalability ● ● ● Mechanisms Profile guided CKB ● ● ● Policies Security libraries O(1-10ms) Is microarchitectural resource management really worth it? More efforts on mechanisms or better policies? Can we have better hardware support from vendors? What are we missing from a cloud-operation point of view? 29
Recommend
More recommend