Big picture debate Shared Memory Multiprocessors � How best to exploit hardware parallelism? – “Old” model: develop an operating system Ken Birman married to the hardware; use it to run one of the major computational science packages Draws extensively on slides – “New” models: seek to offer a more transparent by Ravikant Dintyala way of exploiting parallelism � Today’s two papers offer distinct perspectives on this topic Contrasting perspectives Time warp… � Disco: � As it turns out, Disco found a commercially – Here, the basic idea is to use a new VMM to important opportunity make the parallel machine look like a very fast – But it wasn’t exploitation of ccNUMA machines cluster – Disco morphed into VMWare, a major product for – Disco runs commodity operating system on it running Windows on Linux and vice versa � Question raised – Company was ultimately sold for $550M – Given that interconnects are so fast, why not just � …. Proving that research can pay off! buy a real cluster? – Disco: focus is on benefits of shared VM Contrasting perspectives Bottom line here? � Key idea: clustered object � Tornado: Looks like a shared object – – Here, assumption is that shared memory will be the big But actually, implemented cleverly with one local object instance – attraction to end user per thread… � But performance can be whacked by contention, false sharing � Want “illusion” of sharing but hardware-sensitive � Tornado was interesting… implementation … and got some people PhD’s and tenure – – They also believe that user is working in an OO paradigm … but it ultimately didn’t change the work in any noticeable way – (today would point to languages like Java and C#, or platforms like .net and CORBA) � Why? – Goal becomes: provide amazingly good support for shared Is this a judgment on the work? (Very architecture-dependent) – component integration in a world of threads and objects that Or a comment about the nature of “majority” OS platforms (Linux, – interact heavily Windows, perhaps QNX)? 1
Trends when work was done OS Issues for multiprocessors � A period when multiprocessors were � Efficient sharing – Fairly tightly coupled, with memory coherence � Scalability – Viewed as a possible cost/performance winner for server applications � Flexibility (keep pace with new hardware � And cluster interconnects were still fairly slow innovations) � Research focused on several kinds of concerns: – Higher memory latencies; TLB management is critical � Reliability – Large write sharing costs on many platforms – Large secondary caches needed to mask disk delays – NUMA h/w, which suffers from false sharing of cache lines – Contention for shared objects – Large system sizes Ideas Virtual Machine Monitor � Statically partition the machine and run multiple, independent � Additional layer between hardware and operating OS’s that export a partial single-system image (Map locality and system independence in the applications to their servicing - localization aware scheduling and caching/replication hiding NUMA) � Provides a hardware interface to the OS, manages � Partition the resources into cells that coordinate to manage the the actual hardware hardware resources efficiently and export a single system � Can run multiple copies of the operating system image � Handle resource management in a separate wrapper between � Fault containment – os and hardware the hardware and OS � Design a flexible object oriented framework that can be optimized in an incremental fashion Virtual Machine Monitor DISCO � Additional layer between hardware and operating system � Provides a hardware interface to the OS, manages the actual hardware OS SMP-OS OS OS Thin OS � Can run multiple copies of the operating system DISCO � Fault containment – os and hardware � Overhead, Uninformed resource management, PE PE PE PE PE PE PE Communication and sharing between virtual machines? Interconnect ccNUMA Multiprocessor 2
Interface Implementation � Processors – MIPS R10000 processor (kernel pages � Virtual CPU in unmapped segments) � Virtual Physical Memory � Physical Memory – contiguous physical address space starting at address zero (non NUMA aware) � Virtual I/O Devices � I/O Devices – virtual disks (private/shared), virtual networking (each virtual machine is assigned a � Virtual Disks distinct link level address on an internal virtual � Virtual Network Interface subnet managed by DISCO; communication with outside world, DISCO acts as a gateway), other devices have appropriate device drivers All in 13000 lines of code Major Data Structures Virtual CPU � Virtual processors time-shared across the physical processors (under “data locality” constraints) � Each Virtual CPU has a “process table entry” + privileged registers + TLB contents � DISCO runs in kernel mode, the host OS in supervisor mode, others run in user mode � Operations that cannot be issued in supervisor mode are emulated (on trap – update the privileged registers of the virtual processor and jump to the virtual machine’s trap vector) Virtual Physical Memory NUMA Memory Management � Migrate/replicate pages to maintain locality between � Mapping from physical address (virtual machine virtual CPU and it’s memory physical) to machine address maintained in pmap � Uses hardware support for detecting “hot pages” � Processor TLB contains the virtual-to-machine – Pages heavily used by one node are migrated to that node mapping – Pages that are read-shared are replicated to the nodes � Kernel pages – relink the operating system code and most heavily accessing them data into mapped region. – Pages that are write-shared are not moved – Number of moves of a page limited � Recent TLB history saved in a second-level software � Maintains an “inverted page table” analogue cache (memmap) to maintain consistent TLB, pmap entries � Tagged TLB not used after replication/migration 3
Page Migration Page Migration Node 0 Node 1 Node 0 Node 1 VCPU 0 VCPU 1 VCPU 0 VCPU 1 Virtual Virtual TLB TLB Page Page Physical Physical Page Page Machine Machine Page Page memmap, pmap and tlb entries updated Page Migration Page Migration Node 0 Node 1 Node 0 Node 1 VCPU 0 VCPU 1 VCPU 0 VCPU 1 Virtual Virtual TLB TLB TLB TLB Page Page Physical Physical Page Page Machine Machine Page Page memmap, pmap and tlb entries updated Virtual I/O Devices Virtual Disks � Each DISCO device defines a monitor call � Virtual disk, machine memory relation is similar to buffer aggregates and shared memory in IOLite used to pass all command arguments in a � The machine memory is like a cache (disk requests serviced single trap from machine memory whenever possible) � Two B-Trees are maintained per virtual disk, one keeps track of � Special device drivers added into the OS the mapping between disk addresses and machine addresses, � DMA maps intercepted and translated from the other keeps track of the updates made to the virtual disk by the virtual processor physical addresses to machine addresses � Propose to log the updates in a disk partition (actual � Virtual network devices emulated using implementation handles non persistent virtual disks in the (copy-on-write) shared memory above manner and persistent disk writes routed to the physical disk) 4
Virtual Disks Virtual Network Interface Physical Memory of VM1 Physical Memory of VM0 � Messages transferred between virtual Code Data Buffer Cache Code Data Buffer Cache machines mapped read only into both the sending and receiving virtual machine’s physical address spaces Data Code Buffer Cache Data � Updated device drivers maintain data alignment Private � Cross layer optimizations Shared Free Pages Pages Pages Virtual Network Interface Virtual Network Interface NFS Server NFS Client NFS Server NFS Client Buffer Cache Buffer Cache Buffer Cache Buffer Cache mbuf mbuf Physical Physical Pages Pages Machine Machine Pages Pages Read request from client Data page remapped from source’s machine address space to the destination’s Virtual Network Interface Running Commodity OS NFS Server NFS Client � Modified the Hardware Abstraction Level (HAL) of IRIX to reduce the overhead of virtualization and Buffer Cache Buffer Cache improve resource use mbuf � Relocate the kernel to use the mapped supervisor segment in place of the unmapped segment � Access to privileged registers – convert frequently Physical used privileged instructions to use non trapping load Pages and store instructions to a special page of the address space that contains these registers Machine Pages Data page from driver’s mbuf remapped to the clients buffer cache 5
Recommend
More recommend