5/29/2016 Advanced Architectures 15A. Distributed Computing Operating Systems Principles 15B. Multi-Processor (and NUMA) Systems 15C. Tightly Coupled (SSI) Clusters Advanced Architectures 15D. Loosely Coupled (Horizontally Scalable) 15E. Cloud Models 3F. Virtual Machines Mark Kampe (markk@cs.ucla.edu) Advanced Architectures 2 Goals of Distributed Computing Major Classes of Distributed Systems • better services • Symmetric Multi-Processors (SMP) – scalability – multiple CPUs, sharing memory and I/O devices • apps too big to run on a single computer • Single-System Image (SSI) & Cluster Computing • grow system capacity to meet growing demand – a group of computers, acting like a single computer – improved reliability and availability • loosely coupled, horizontally scalable systems – improved ease of use, reduced CapEx/OpEx – coordinated, but relatively independent systems • new services • application level distributed computing – applications that span multiple system boundaries – peer-to-peer, application level protocols – global resource domains, services (vs. systems) – distributed middle-ware platforms – complete location transparency Advanced Architectures 3 Advanced Architectures 4 Evaluating Distributed Systems SMP systems and goals • Performance • Characterization: – overhead, scalability, availability – multiple CPUs sharing memory and devices • Functionality • Motivations: – adequacy and abstraction for target applications – price performance (lower price per MIP) • Transparency – scalability (economical way to build huge systems) – compatibility with previous platforms – perfect application transparency – scope and degree of location independence • Example: • Degree of Coupling – single socket, multi-core Intel CPUs – on how many things do distinct systems agree – how is that agreement achieved Advanced Architectures 5 Advanced Architectures 6 1
5/29/2016 SMP Price/Performance Symmetric Multi-Processors • a computer is much more than a CPU – mother-board, disks, controllers, power supplies, case CPU 1 CPU 2 CPU 3 CPU 4 – CPU might cost 10-15% of the cost of the computer interrupt controller cache cache cache cache • adding CPUs to a computer is very cost-effective – a second CPU yields cost of 1.1x, performance 1.9x shared memory & device busses – a third CPU yields cost of 1.2x, performance 2.7x • same argument also applies at the chip level device device device controller controller controller memory – making a machine twice as fast is ever more difficult – adding more cores to the chip gets ever easier • massive multi-processors are obvious direction Advanced Architectures 7 Advanced Architectures 8 SMP Operating System Design SMP Parallelism • one processor boots with power on • scheduling and load sharing – it controls the starting of all other processors – each CPU can be running a different process • same OS code runs in all processors – just take the next ready process off the run-queue – processes run in parallel – one physical copy in memory, shared by all CPUs – most processes don't interact (other than in kernel) • Each CPU has its own registers, cache, MMU • serialization – they must cooperatively share memory and devices – mutual exclusion achieved by locks in shared memory • ALL kernel operations must be Multi-Thread-Safe – locks can be maintained with atomic instructions – protected by appropriate locks/semaphores – spin locks acceptable for VERY short critical sections – very fine grained locking to avoid contention – if a process blocks, that CPU finds next ready process Advanced Architectures 9 Advanced Architectures 10 The Challenge of SMP Performance Managing Memory Contention • Fast n-way memory is very expensive • scalability depends on memory contention – without it, memory contention taxes performance – memory bandwidth is limited, can't handle all CPUs – cost/complexity limits how many CPUs we can add – most references satisfied from per-core cache • Non-Uniform Memory Architectures (NUMA) – if too many requests go to memory, CPUs slow down – each CPU has its own memory • scalability depends on lock contention • each CPU has fast path to its own memory – waiting for spin-locks wastes time – connected by a Scalable Coherent Interconnect – context switches waiting for kernel locks waste time • a very fast, very local network between memories • contention wastes cycles, reduces throughput • accessing memory over the SCI may be 3-20x slower – 2 CPUs might deliver only 1.9x performance – these interconnects can be highly scalable – 3 CPUs might deliver only 2.7x performance Advanced Architectures 11 Advanced Architectures 12 2
5/29/2016 OS design for NUMA systems Non-Uniform Memory Architecture Symmetric Multi-Processors • it is all about local memory hit rates – every outside reference costs us 3-20x performance CPU n CPU n+1 – we need 75-95% hit rate just to break even local local cache memory cache memory • How can the OS ensure high hit-rates? PCI bridge PCI bridge – replicate shared code pages in each CPU's memory PCI bus PCI bus – assign processes to CPUs, allocate all memory there – migrate processes to achieve load balancing CC NUMA device device CC NUMA device device interface controller controller interface controller controller – spread kernel resources among all the CPUs – attempt to preferentially allocate local resources Scalable Coherent Interconnect – migrate resource ownership to CPU that is using it Advanced Architectures 13 Advanced Architectures 14 The Dream Single System Image (SSI) Clusters Programs don’t run on hardware, they run atop operating systems. • Characterization: All the resources that processes see are already virtualized. physical systems Instead of merely virtualizing all the resources in a single system, virtualize all the resources in a cluster of systems. Applications – a group of seemingly independent computers proc 101 proc 103 CD1 that run in such a cluster are (automatically and transparently) proc 106 distributed. collaborating to provide SMP-like transparency lock 1A virtual HA computer w/4x MIPS & memory one global pool • Motivation: of devices processes 101, 103, 106, + LP2 proc 202 202, 204, 205, CD1 – higher reliability, availability than SMP/NUMA proc 204 + 301, 305, 306, proc 205 + 403, 405, 407 – more scalable than SMP/NUMA locks CD3 1A, 3B CD3 proc 301 – excellent application transparency proc 305 one large HA virtual file system proc 306 LP2 LP3 primary copies lock 3B • Examples: LP3 disk 1A disk 2A disk 3A disk 4A – Locus, Sun Clusters, MicroSoft Wolf-Pack, OpenSSI SCN4 SCN4 proc 403 disk 3B disk 4B disk 1B disk 2B – enterprise database servers proc 405 proc 407 secondary replicas Advanced Architectures 15 Modern Clustered Architecture Structure of a Modern OS system call interfaces user visible OS model geographic fail over file namespace authorization file file I/O IPC process/thread exception synchronization switch switch model model model model model model model model ethernet request replication run-time configuration fault quality … higher level transport file systems loader services management of service services protocols SMP SMP SMP SMP stream volume hot-plug block I/O system #1 system #2 system #3 system #4 services management services services memory logging swapping paging scheduling & tracing network serial display storage I/O class driver class driver class driver class driver abstraction virtual execution fault process/thread processes asynchronous engine optional device drivers device drivers handling scheduling (resource containers) events synchronous FC replication dual ported dual ported primary site RAID RAID back-up site DMA configuration thread memory memory thread bus drivers services analysis dispatching allocation segments synchronization Active systems service independent requests in parallel. They cooperate to maintain boot I/O resource enclosure processor processor context kernel processor shared global locks, and are prepared to take over partner’s work in case of failure. strap allocation management exceptions initialization switching debugger abstraction State replication to a back-up site is handled by external mechanisms. I/O processor processor memory memory cache cache atomic atomic DMA interrupts interrupts traps traps timers timers operations mode mode mapping mapping mgmt mgmt updates updates Advanced Architectures 17 Advanced Architectures 18 3
Recommend
More recommend