Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin
2 (c) Mattan Erez • Yes, resilience is an exascale concern – Checkpoint-restart not good enough on its own – Commercial datacenters face different problems – Heterogeneity keeps growing – Correctness also at risk (integrity)
3 (c) Mattan Erez • Containment Domains (CDs) – Isolate application resilience from system – Increase performance and efficiency – Simplify defensive (resilient) codes – Adapt hardware and software • Portable Performant Resilient Proportional
4 (c) Mattan Erez • Efficient resilience is an exascale problem
5 (c) Mattan Erez 100% Performance Efficiency 80% CDs, NT 60% h-CPR, 80% 40% g-CPR, 80% 20% 0% • Failure rate possibly too high for checkpoint/restart • Correctness also at risk
6 (c) Mattan Erez Energy Overhead 20% CDs, NT h-CPR, 80% 10% gCPR, 80% 0% 2.5PF 10PF 40PF 160PF640PF 1.2EF 2.5EF • Energy also problematic
7 (c) Mattan Erez • Something bad every ~ minute at exascale • Something bad every year commercially – Smaller units of execution – Different requirements – Different ramifications
8 (c) Mattan Erez • Rapid adoption of new technology and accelerators – Again, potential mismatch with commercial setting
9 (c) Mattan Erez • So who’s responsible for resilience? • Hardware? Software? • Algorithm? •
10 (c) Mattan Erez • Can hardware alone solve the problem? • Yes, but costly – Significant and fixed/hidden overheads – Different tradeoffs in commercial settings
11 (c) Mattan Erez • Fixed overhead examples (estimated) Both energy and/or throughput – Up to ~25% chipkill correct vs. chipkill detect – 20 – 40% for pipeline SDC reduction – >2X for arbitrary correction – Even greater overhead if approximate units allowed
12 (c) Mattan Erez • Relaxed reliability and precision – Some lunacy (rare easy-to-detect errors + parallelism) – Lunatic fringe: bounded imprecision – Lunacy: live with real unpredictable errors 50 40 Arith. Headroom 30 20 40 5 8 12 18 10 20 15 12 8 2 0 Today Scaled Researchy Some Lunatic Lunacy lunacy fringe Rough estimated numbers for illustration purposes
13 (c) Mattan Erez • Can software do it alone? – Detection likely very costly – Recovery effectiveness depends on error/failure frequency – Tradeoffs more limited
14 (c) Mattan Erez • Locality and hierarchy are key – Hierarchical constructs – Distributed operation • Algorithm is key: – Correctness is a range
15 (c) Mattan Erez • Containment Domains elevate resilience to first-class abstraction – Program-structure abstractions – Composable resilient program components – Regimented development flow – Supporting tools and mechanisms – Ideally combined with adaptive hardware reliability • Portable Performant Resilient Proportional
16 (c) Mattan Erez
17 (c) Mattan Erez
18 (c) Mattan Erez • CDs help bridge the gap – Help us figure out exactly how – Open source: lph.ece.utexas.edu/public/CDs bitbucket.org/cdresilience/cdruntime
19 (c) Mattan Erez CDs Embed Resilience within Application • Express resilience as a tree of CDs Root CD – Match CD, task, and machine hierarchies – Escalation for differentiated error handling • Semantics – Erroneous data never communicated – Each CD provides recovery mechanism • Components of a CD – Preserve data on domain start – Compute (domain body) – Detect faults before domain commits – Recover from detected errors Child CD
20 (c) Mattan Erez Mapping example: SpMV void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall (…) reduce(…) 𝑵 𝑾 SpMV(M […],Vi[…], Ri […]); cd->Complet Complete(); e(); } void task<leaf> SpMV (…) { Matrix M Vector V cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }
21 (c) Mattan Erez Mapping example: SpMV void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() () 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 ->Crea reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall (…) reduce(…) SpMV(M […],Vi[…], Ri […]); cd->Complet Complete(); e(); 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 } void task<leaf> SpMV (…) { Matrix M Vector V cd = GetCu tCurre rrentC ntCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }
22 (c) Mattan Erez Mapping example: SpMV void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Matrix M Vector V Distributed to 4 nodes 𝑵 𝟏𝟏 𝑵 𝟐𝟏 𝑵 𝟏𝟐 𝑵 𝟐𝟐 𝑾 𝟏 𝑾 𝟏 𝑾 𝟐 𝑾 𝟐
23 (c) Mattan Erez Mapping example: SpMV void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); 𝑵 𝟏𝟏 𝑵 𝟏𝟐 𝑾 𝟏 cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; 𝑵 𝟐𝟏 𝑵 𝟐𝟐 𝑾 𝟐 cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Matrix M Vector V Distributed to 4 nodes
24 (c) Mattan Erez Concise abstraction for complex behavior void task<leaf> SpMV (…) { cd = GetC tCur urrentC rentCD() () ->Crea reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof of(V (Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); } Local copy or regen Sibling Parent (unchanged)
25 (c) Mattan Erez • General abstractions – a “language” for resilience Replicate in space or time or none? B C A B C A A B C B C A = = ? ? A B C A B C B C A Local copy or regen Sibling Parent (unchanged)
26 (c) Mattan Erez • CDs natural fit for: – Hierarchical SPMD – Task-based systems • CDs still general: – Opportunistic approaches to add hierarchical resilience – Always fall back to more checkpoint-like mappings
27 (c) Mattan Erez • Reminder of why you care
28 (c) Mattan Erez • CDs enable per- experiment/system “optimality” – (Portable) Use same resilience abstractions across programming models and implementations • MPI ULFM? MPI-Reinit? OpenMP? UPC++? Legion? – Don’t keep rethinking correctness and recovery • CPU, GPU, FPGA accelerator, memory accelerator, … ? – ( Performant ) Resilient patterns that scale • Hierarchical / local • Aware of application semantics • Auto-tuned efficiency/reliability tradeoffs – ( Resilient ) Defensive coding • Algorithms, implementations, and systems • Reasonable default schemes • Programmer customization – ( Proportional ) Adapt hardware and software redundancy
29 (c) Mattan Erez CD Runtime System Architecture External Tool Internal Tool Future Plan CD-annotated Compiler Support Debugger Applications/Libraries User Interaction for customized error CD-App detection /handling / tolerance / injection Mapper CD Runtime System Scaling Tool Persistence Layer (LWM2) Runtime Error Communication Unified Runtime Auto-tuner State Preservation Logging Logging Error Detector Handling Interface CD Auto Profiling & Low-Level BLCR Tuner Communication Legion + CD-Storage Visualizatio Machine Check Runtime Library Libc n Interface (Legion + GasNet) HW/SW I/F Mapping Interface Sight DRAM SSD Buddy PFS Error Reporting HDD Architecture Hardware – Annotations, persistence, reporting, recovery, tools
Recommend
More recommend