a dist ribut ed syst em 18 dist ribut ed syst ems
play

A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: - PDF document

A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: 7/ 3/ 2004 1:49:01 PM -1 -2 Loosely Coupled Dist ribut ed Tight ly Coupled Dist ribut ed- Syst ems Syst ems Users are aware of mult iplicit y of Users not aware


  1. A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: 7/ 3/ 2004 1:49:01 PM -1 -2 Loosely Coupled Dist ribut ed Tight ly Coupled Dist ribut ed- Syst ems Syst ems � Users are aware of mult iplicit y of � Users not aware of mult iplicit y of machines. Access t o resources of various machines. Access t o remot e resources machines is done explicit ly by: similar t o access t o local resources � Remot e logging int o t he appr opr iat e r emot e � Examples machine. � Dat a Migr at ion – t r ansf er dat a by t r ansf er r ing � Tr ansf er r ing dat a f r om r emot e machines t o ent ir e f ile, or t r ansf er r ing only t hose por t ions local machines, via t he File Transf er Prot ocol of t he f ile necessary f or t he immediat e t ask. (FTP ) mechanism. � Comput at ion Migr at ion – t r ansf er t he comput at ion, r at her t han t he dat a, acr oss t he syst em. -3 -4 Dist ribut ed-Operat ing Syst ems Why Dist ribut ed Syst ems? (Cont .) � Communicat ion � Pr ocess Migr at ion – execut e an ent ir e pr ocess, � Dealt wit h t his when we t alked about net wor ks or part s of it , at dif f erent sit es. • Load balancing – dist ribut e processes across net work t o even t he workload. � Resour ce shar ing • Comput at ion speedup – subprocesses can run concurrent ly on dif f erent sit es. � Comput at ional speedup • Hardware pref erence – process execut ion may require specialized processor. • Sof t ware pref erence – required sof t ware may be available at only a part icular sit e. � Reliabilit y • Dat a access – run process remot ely, rat her t han t ransf er all dat a locally. -5 -6 1

  2. OS Support f or resource Resource Sharing sharing � Dist ribut ed Syst ems of f er access t o � Resource Management ? specialized resources of many syst ems � Dist r ibut ed OS can manage diver se r esour ces of nodes in syst em � Example: � Make r esour ces visible on all nodes • Some nodes may have special dat abases • Like VM, can provide f unct ional illusion bur rarely hide • Some nodes may have access t o special hardware t he perf ormance cost devices (e.g. t ape drives, print ers, et c.) � Scheduling? � DS of f ers benef it s of locat ing processing � Dist r ibut ed OS could schedule pr ocesses t o r un near dat a or sharing special devices near t he needed r esour ces � I f need t o access dat a in a lar ge dat abase may be easier t o ship code t her e and r esult s back t han t o r equest dat a be shipped t o code -7 -8 Design I ssues Why Dist ribut ed Syst ems? � Resour ce shar ing � Transparency – t he dist r ibut ed syst em should appear as a convent ional, cent r alized syst em t o t he user. � Comput at ional speedup � Fault tolerance – t he dist r ibut ed syst em should cont inue t o f unct ion in t he f ace of f ailur e. � Reliabilit y � Scalability – as demands incr ease, t he syst em should easily accept t he addit ion of new r esour ces t o accommodat e t he incr eased demand. � Clusters vs Client / Server � Clust ers: a collect ion of semi- aut onomous machines t hat act s as a single syst em. -9 -10 Comput at ion Speedup Breaking up t he problems � Some t asks t oo lar ge f or even t he f ast est single � To harness comput at ional speedup must comput er f irst break up t he big problem int o many � Real t ime weat her/ climat e modeling, human genome smaller problems proj ect , f luid t urbulence modeling, ocean circulat ion � More art t han science? modeling, et c. � ht t p:/ / www.nersc.gov/ research/ GC/ gcnersc.ht ml � Somet imes br eak up by f unct ion � What t o do? • P ipeline? � Leave t he problem unsolved? • Job queue? � Engineer a bigger/ f ast er comput er? � Somet imes br eak up by dat a � Harness resources of many smaller (commodit y?) • Each node responsible f or port ion of dat a set ? machines in a dist ribut ed syst em? -11 -12 2

  3. Decomposit ion Examples Decomposit ion Examples (con’t) � Decrypt ing a message � Bar nes Hut – calculat ing ef f ect of bodies in space on each ot her � Easily parallelizable , give each node a set of keys t o t ry � Could divide space int o NxN regions? � J ob queue – when t r ied all your keys go back � Some regions have many more bodies f or more? � I nst ead divide up so have r oughly � Modeling ocean circulat ion same number of bodies � Give each node a por t ion of t he ocean t o model � Wit hin a r egion, bodies have lot s (N squar e f t r egion?) of ef f ect on each ot her (close t oget her ) � Model f lows wit hin r egion locally � Communicat e wit h nodes managing neighbor ing � Abst r act ot her r egions as a single r egions t o model f lows int o ot her r egions body t o minimize communicat ion -13 -14 Linear Speedup Super -linear Speedup � Linear speedup is of t en t he goal. � Somet imes can act ually do bet t er t han linear speedup! � Allocat e N nodes t o t he j ob goes N t imes as f ast � Especially if divide up a big dat a set so t hat t he � Once you’ve broken up t he problem int o N piece needed at each node f it s int o main memor y on t hat machine pieces, can you expect it t o go N t imes as f ast ? � Savings f r om avoiding disk I / O can out weigh t he communicat ion/ synchr onizat ion cost s � Are t he pieces equal? � When split up a pr oblem, t ension bet ween � I s t her e a piece of t he wor k t hat cannot be br oken up (inher ent ly sequent ial?) duplicat ing pr ocessing at all nodes f or r eliabilit y � Synchr onizat ion and communicat ion over head and simplicit y and allowing nodes t o specialize bet ween pieces? -15 -16 OS Support f or P arallel J obs OS Support f or P arallel J obs (con’t) � Gr oup Communicat ion? � P rocess Management ? � OS could provide f acilit ies f or pieces of a single j ob t o � OS could manage all pieces of a par allel j ob as communicat e easily one unit � Locat ion independent addressing? � Allow all pieces t o be cr eat ed, managed, � Shared memory? dest r oyed at a single command line � Dist ribut ed f ile syst em? � For k (pr ocess,machine)? � Synchr onizat ion? � Scheduling? � Support f or mut ually exclusive access t o dat a across mult iple machines � Pr ogr ammer could specif y wher e pieces should � Can’t rely on HW at omic operat ions any more r un and or OS could decide � Deadlock management ? • P rocess Migrat ion? Load Balancing? � We’ll t alk about clock synchronizat ion and t wo - phase � Tr y t o schedule piece t oget her so can commit lat er communicat e ef f ect ively -17 -18 3

  4. Why Dist ribut ed Syst ems? Reliabilit y � Resour ce shar ing � Dist r ibut ed syst em of f er s pot ent ial f or incr eased reliabilit y � I f one part of syst em f ails, rest could t ake over � Comput at ional speedup � Redundancy, f ail- over � !BUT! Of t en r ealit y is t hat dist r ibut ed syst ems of f er less r eliabilit y � Reliabilit y � “A dist ribut ed syst em is one in which some machine I ’ve never heard of f ails and I can’t do work!” � Hard t o get rid of all hidden dependencies � No clean f ailure model • Nodes don’t j ust f ail t hey can cont inue in a br oken st at e • Par t it ion net wor k = many many nodes f ail at once! (Det er mine who you can st ill t alk t o; Ar e you cut of f or ar e t hey?) • Net work goes down and up and down again! -19 -20 Robust ness Failure Det ect ion � Det ect ing har dwar e f ailur e is dif f icult . � To det ect a link f ailur e, a handshaking pr ot ocol can be used. � Det ect and recover f rom sit e f ailure, � Assume Sit e A and Sit e B have est ablished a link. f unct ion t ransf er, reint egrat e f ailed sit e At f ixed int er vals, each sit e will exchange an I - am-up message indicat ing t hat t hey ar e up and r unning. � Failur e det ect ion � I f Sit e A does not receive a message wit hin t he f ixed int er val, it assumes eit her (a) t he ot her sit e � Reconf igur at ion is not up or (b) t he message was lost . � Sit e A can now send an Ar e-you-up? message t o Sit e B. � I f Sit e A does not r eceive a r eply, it can r epeat t he message or t r y an alt er nat e r out e t o Sit e B. -21 -22 Failure Det ect ion (cont ) Reconf igurat ion � When Sit e A det er mines a f ailur e has occur r ed, it � I f Sit e A does not ult imat ely r eceive a r eply f r om must r econf igur e t he syst em: Sit e B, it concludes some t ype of f ailure has occurred. 1. I f t he link f rom A t o B has f ailed, t his must be br oadcast t o ever y sit e in t he syst em. � Types of f ailures: - Sit e B is down - The direct link bet ween A and B is down 2. I f a sit e has f ailed, ever y ot her sit e must also - The alt er nat e link f r om A t o B is down be not if ied indicat ing t hat t he ser vices of f er ed by t he f ailed sit e ar e no longer available. - The message has been lost � When t he link or t he sit e becomes available again, � However , Sit e A cannot det er mine exact ly why t he t his inf or mat ion must again be br oadcast t o all f ailur e has occur r ed. ot her sit es. � B may be assuming A is down at t he same t ime � Can eit her assume it can make decisions alone? -23 -24 4

Recommend


More recommend