HIVE: Fault Containment for Shared-Memory Multiprocessors J. - PowerPoint PPT Presentation

HIVE: Fault Containment for Shared-Memory Multiprocessors J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, A. Gupta CSE 598C Presented by: Sandra Rueda

The Problem  O.S. for managing FLASH architecture (large shared-memory multiprocessor) 2L$ … … Mem Proc CC Net I/O  Set of nodes connected in a mesh  NUMA Jan 30,2006 2

Hive: Main Goals  Memory Sharing: improving Performance  Possible Failures:  A faulty node makes that node’s memory inaccessible  A faulty node returns wrong values for reads  Software failures may corrupt other node’s memory Jan 30,2006 3

Main Goals  Fault Containment: Hardware or software faults are confined to the cell where they occurred, as a consequence just that cell crashes.  Scalability:  Few resources are shared among different cells.  More processors → more cells → more parallelism Jan 30,2006 4

O.S. Architecture  Multicellular architecture: processors are grouped into cells. An independent kernel manages each cell (UNIX SVR4). Memory Organization: Cell Organization: Cell n Global … … … Address Space Cell Cell 1 Local Address Cell 0 Space Jan 30,2006 5

Fault Containment (1) Failure Sources  Sources and control methods:  Message exchange (RPC):  timeout + message check  Remote reads:  careful_reference + message check  Remote writes:  Internal data: firewall  User level data:  Protection of local space  Preemptive discard Jan 30,2006 6

Fault Containment (2) Control Methods  Careful_reference protocol prevents errors from causing a kernel panic.  Save context  Check the memory range belongs to the expected cell  Copy data values  Check every remote data structure  Careful_off Jan 30,2006 7

Fault Containment (3) Control Methods  The Firewall controls which processors are allowed to modify each region of main memory.  Only the local processor can change firewall bits.  Rights are assigned to:  First process that requests a writable mapping to the page.  All the processors in a cell.  Preemptive Discard (recovery) Jan 30,2006 8

Fault Containment (4) Detection  Detection of a failure:  RPC request times out  Memory reading operation causes a bus error  Periodic updating of a shared location fails  Data fails consistency check  When a failure is detected then an agreement protocol is run among other cells Jan 30,2006 9

Fault Containment (5) Recovery  First Phase:  Each cell flushes its TLB and remove any remote mapping.  Second Phase:  At the end of the first phase there is no pending remote access, so it is possible to revoke firewall write permissions.  The virtual memory subsystem detects pages that were writable by a failed cell and notifies to the file system. Jan 30,2006 10

Fault Containment (6) Recovery  Preemptive Discard:  It is possible for a process to fetch stale data from disk after a recovery  Only processes that opened a file before a failure will receive I/O errors. It is implemented with a generation number, mismatches about the number will generate an error. Jan 30,2006 11

Memory Sharing (1)  Two types of memory sharing:  logical level: a process on a cell maps a data page from another cell into its address space cell i cell j mem mem pages pages pfdat imp pfdat table table exp Jan 30,2006 12

Memory Sharing (2)  Two types of memory sharing:  physical level: one cell transfers control over a page frame to another cell i cell j mem mem pages pages pfdat brw pfdat table table X Jan 30,2006 13

Memory Sharing (3)  WAX:  It is a user level process that may have access to all cells. In this way it is able to consolidate a global view of the system.  Some decisions are made based on the global view. For instance processes priorities. … Proc i Proc m WAX … Cell 0 Cell n Jan 30,2006 14

RPC: Optimization  Some times cells exchange information via RPC  FLASH architecture includes hardware support to minimize RPC latency  The mechanism is based on the cache-line delivery mechanism used by the cache coherency protocol (SIPS: Short Interprocessor Send Facility)  Primitive is reliable  No message fragmentation Jan 30,2006 15

Experimental Results  At the time of the paper  Hive was a prototype  FLASH hardware was not available yet  Authors used SimOS Jan 30,2006 16

Simulation Environment  Hardware  4 processors MIPS 200 MHz  memory 128 MB  4 disk controllers, each with one attached disk  4 ethernet interfaces  4 consoles  Hive  4 cells  each cell: 1 processor, 32 MB memory, 1 interface, 1 disk Jan 30,2006 17

Simulation Environment  Memory Hierarchy (per processor):  Instruction cache: 32 K, two-way-associative  Primary data cache: 32 K, two-way-associative  Secondary unified cache: 1 MB, two-way-associative  Given miss penalty  Given SIPS latency  Given interrupt latency  Given disk latency  Some values are based on other models Jan 30,2006 18

Simulation Environment  Performance Tests  Expected workloads (scientific application, parallel application)  Times for IRIX 5.2 (reference)  Different configurations: one, two, four cells  Conclusion: The partition into cells has little effect on performance, and it allows fault containment Jan 30,2006 19

Simulation Environment  Fault Injection Tests  Difficult to predict the reliability of a complex system  Fault injection tests are used to detect if reliability mechanisms are working properly  Authors chose to inject failures in situations where it seemed that a fault in one cell could corrupt another  They checked files after recovery to detect data corruption  The simulator allowed them to recreate scenarios from a specific checkpoint Jan 30,2006 20

Conclusion Simulation Environment  Advantages [1]  Evaluation of hardware support  Evaluation of designed mechanisms  Evaluation of tradeoffs  Problems [2]  Simulator Bugs  Omissions  Lack of Detail  Key Features define if it is useful Jan 30,2006 21

References  [1] Hive: Fault Containment for Shared-Memory Multiprocessors, J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, SOGOPS 1995.  [2] Flash Vs. Simulated Flash. Closing the Simulation Loop. Jeff Gibson, Robert Kunz, David Ofelt, Mark Horowitz, John Hennessy, Mark Heinrich. SIGARCH Volume 28 , Issue 5 (December 2000). Jan 30,2006 22

HIVE: Fault Containment for Shared-Memory Multiprocessors J. - PowerPoint PPT Presentation

HIVE: Fault Containment for Shared-Memory Multiprocessors J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, A. Gupta CSE 598C Presented by: Sandra Rueda The Problem O.S. for managing FLASH architecture (large shared-memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Spill Containment and Commerce www.containmentcorp.com (800) 235-7421 Executive Summary -What

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Beyond Today 2008 IEEE Wireless HIVE Networks Conference HIVE Networks Conference Keynote

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

Basic Structure of a Cell ppt viewing questions part 1 slides 1-26 Review List the main

EHEALTH COMMISSION MEETING APRIL 11, 2018 APRIL AGENDA Call to Order 12:00 Roll Call and

Zeus Financial Malware Samaneh Tajalizadehkhoob Hadi Asghari Carlos Gan Michel van Eeten

Early Twentieth-Century Fiction e20fic14.blogs.rutgers.edu Prof. Andrew Goldstone

Lindenmayer Systems, Coalgebraically Baltasar Trancn y Widemann 1 Joost Winter 2 1 University

Symbolic computation to determine parameter regions for multistaionarity in models of the MAPK

Challenges and Opportunities: Cellular Cryo-ET 10 -9 m 10 -3 10 -5 10 -6 Julia Mahamid

COMPLEX SYSTEMS, LIFE, & their ORIGIN PCES 5.61 The t true c comple lexity of real l

HIVE: Fault Containment for Shared-Memory Multiprocessors J. - PowerPoint PPT Presentation

HIVE: Fault Containment for Shared-Memory Multiprocessors J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, A. Gupta CSE 598C Presented by: Sandra Rueda The Problem O.S. for managing FLASH architecture (large shared-memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Working the Hive 1 * What *When *How What to do Everyone who own or manages a hive must be

Spill Containment and Commerce www.containmentcorp.com (800) 235-7421 Executive Summary -What

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Beyond Today 2008 IEEE Wireless HIVE Networks Conference HIVE Networks Conference Keynote

Apache HIVE Data Warehousing &amp; Analytics on Hadoop Hefu Chai What is HIVE? A system for

Basic Structure of a Cell ppt viewing questions part 1 slides 1-26 Review List the main

EHEALTH COMMISSION MEETING APRIL 11, 2018 APRIL AGENDA Call to Order 12:00 Roll Call and

Zeus Financial Malware Samaneh Tajalizadehkhoob Hadi Asghari Carlos Gan Michel van Eeten

Early Twentieth-Century Fiction e20fic14.blogs.rutgers.edu Prof. Andrew Goldstone

Lindenmayer Systems, Coalgebraically Baltasar Trancn y Widemann 1 Joost Winter 2 1 University

Symbolic computation to determine parameter regions for multistaionarity in models of the MAPK

Challenges and Opportunities: Cellular Cryo-ET 10 -9 m 10 -3 10 -5 10 -6 Julia Mahamid

COMPLEX SYSTEMS, LIFE, &amp; their ORIGIN PCES 5.61 The t true c comple lexity of real l

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

COMPLEX SYSTEMS, LIFE, & their ORIGIN PCES 5.61 The t true c comple lexity of real l