Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters Felix Rauch Christian Kurmann, Thomas M. Stricker Laboratory for Computer Systems, ETH Zürich CoPs project: http://www.cs.inf.ethz.ch/CoPs/ Eidgenössische 31. August 2000 Technische Hochschule Zürich 1
Clusters of PCs • Scientific computing (computational grids) • Enterprise computing (distrib. databases/datamining) • Corporate computing (multimedia/collaborative work) • Education and training (classrooms) 2
Common Problem Maintenance of software installations is hard: • Different operating systems or applications in Cluster • Temporary installations: tests, experiments, courses • Software rejuvenation to combat software rotting process Manual Install: days, Network Installs: hrs, Cloning: min 3
Partition Cast (cloning) Fast replication of entire system installations (OS image, application, data) on clusters is helpful • How to do ultra fast data distribution in clusters? Essential tradeoffs: What network • is needed? Giga/Switches/Hubs Protocol family • ? (multicast, broadcast, unicast) Compressed or raw • data? Best logical topology • for distribution path? 4
Overview • Network topologies and embedding • Related work • Analytical model for partition cast • Implemented tools for partition cast • Evaluation of alternative topologies • Model vs. measurement • Conclusion 5
Network Topologies Given: Physical network topology • Resource constraints • links nodes (maximal throughput over or trough ) Wanted: logical network topology • Best for data distribution. embedding • Best of logical network into physical net- work? throughput • Limit on for distribution of big data sets (partition cast) 6
Physical Network COPS Patagonia 16 Nodes 8 Nodes ... ... Cabletron Cabletron SSR 8000 SSR 8600 Fast Ethernet Gigabit Ethernet Cabletron SSR 8600 Matrix 7 ... ... Math./Phys. Linneus Beowulf 16 Nodes 192 Nodes • Graph given by cables, nodes and switches 7
Logical Network S S S S S S • Spanning tree, embeded into physical network 8
Previous and Related Work • Protocols and tools for the distribution of data to large number of clients USENIX 1996 [Kotsopoulos and Cooperstock, ] • Model is based on ideas for throughput-oriented memory system performance for MPP computers ISCA95 [Stricker and Gross, ] • High speed multicast leads to great variation in per- ceived bandwidth, complex to implement and quite resource intensive. High speeds seem impossible. ETHZ 97 [Rauch, masters thesis, ] 9
Simple Model of Partion Cast Definitions: Node types • Capacity constraints • Algorithm for evaluation • of model Example: • Heterogenious network: Gigabit / Fast Ethernet 10
Node Types Active node Passive node • Active node: Participates in partition cast, can dupli- cate and store stream • Passive node: Can neither duplicate nor store data, passes one or more streams between active nodes 11
Capacity Constraints • Reliable transfer promise • Fair sharing of links • Edge capacity → Link 125 MB/s, 2 logical channels <62 MB/s • Node capacity → Switches 30 MB/s, 3 Streams <10 MB/s Examples: Active Node Passive Node 12
Model Algorithm (Constraint Satisfaction) Algorithm “evaluate basic model” 1 Choose logical network 2 Embed into given physical network 3 For all edges Post bandwidth limitations due to edge congestions 4 For all nodes Post bandwidth limitations due to node congestions 5 Over all posted limitations Find minimum bandwidth 13
Example Network S S 14
Example Network S S 15
Example Network < 12.5 < 125/2 < 125/3 < 125 S < 125/2 16
Example Network < 12.5 < 30/3 < 30/4 < 125/2 < 125/3 < 125 S < 4000/5 < 4000/6 < 30/2 < 125/2 < 30/4 17
Example Network < 12.5 < 24 < 24 < 30/3 < 30/4 < 125/2 < 24 < 125/3 < 125 S < 4000/5 < 4000/6 < 30/2 < 125/2 < 30/4 < 24 18
Detailed Model of Active Nodes • In the simple model active nodes were black boxes • Detailed model would allow accurate predictions of achievable data stream bandwidths • Requires detailed knowledge of: • Flows of node-internal data streams • Limits of involved subsystems • Complexity of handling and coordinating data streams and subsystems 19
Detailed Example: Data-Streams Network DMA System buffers Copy S User buffer Copy gunzip (uncompress) Copy System buffer DMA SCSI Logical Topology Data streams within active node 20
Limitations in Active Nodes • Link capacity Gigabit Ethernet: 125 MB/s Fast Ethernet: 12.5 MB/s • Disk system Seagate Cheetah SCSI harddisk: 24 MB/s • I/O bus capacity Current 32 bit PCI bus: 132 MB/s • CPU utilization Processing power required for each stream, depen- ding on speed and complexity of handling 21
Detailed Example of an Active Node • Modelling switching capacity: Binary spanning-tree topology with Fast Ethernet and compression: b const c: c < 12.5 MB/s link receive compression 2b factor c < 12.5 MB/s link send e.g. c=2 b < 24 MB/s SCSI disk b: bandwidth 3b c + b < 132 MB/s I/O, PCI 8b + 3b < 180 MB/s Memory c ( 3 1 4 1 c + + + + 9 )b < 1 (100%) CPU 45c 80 90c 90 Solve equations for b -> node can handle 5.25 MB/s 22
Implementation (tools for partition cast) • dd/NFS , built-in OS function and network file system based on UDP/IP - simple - permits star topology only • Dolly , small application for streaming with cloning based on TCP/IP - reliable data streaming Dolly for reliable data casting on all spanning trees • star (n-ary) • 2-ary, 3-ary • chain (un-ary) 23
Active Nodes with Dolly • Simple receiver for star topologies Simple receiver • Advanced cloning node for multi drop chains Multi-drop receiver • Node cloning streams for general spanning trees Active node cloning streams 24
Experimental Evaluation • Topologies: • Star • 3-ary spanning tree • Multi-drop chain • Fast Ethernet / Gigabit Ethernet • Compressed / Uncompressed Images All experiments: Distribute 2 GByte to 1..15 clients 25
Star topology (Standard NFS) 2000 1.0 2200 Bandwidth per Node [MByte/s] ★ 1.1 1800 ✩ 1600 1.3 ✩ Fast Ethernet Execution Time [s] compressed 1400 1.4 ✩ ✩ ✩ ★ Fast Ethernet 1.7 1200 ● ● ● ● ● ● ● ● raw 1000 2.0 ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ❍ Gigabit Ethernet ✩ ❍ ❍ 800 2.5 compressed ❍ ❍ 600 3.3 ✩ ✩ ✩ ✩ ✩ ❍ ● Gigabit Ethernet ❍ ✩ ✩✩ ❍ ★ 400 ❍ 5.0 raw ★ 200 10 0 1 2 5 10 15 20 Number of Nodes 26
3-Tree (Dolly) 600 3.3 Bandwidth per node [MByte/s] ★ ★ ★ ★ 500 4 ★ Fast Ethernet Execution Time [s] raw ✩ ✩ 400 ❍ 5 ● Gigabit Ethernet ❍ ✩ ★ raw ❍ 300 ✩✩ 6.7 ❍❍ ✩ Fast Ethernet compr. ★ ● ● ● ●● 200 10 ❍ Gigabit Ethernet compr. 100 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Nodes 27
Multi-Drop Chain (Dolly) 600 3.3 Bandwidth per Node [MByte/s] 500 4 ✩ Fast Ethernet Execution Time [s] compressed ✩ ✩ ✩ ✩ 400 5 ★ Fast Ethernet ❍ ❍ ❍ raw ❍❍ 300 6.7 ✩ ★ ★ ❍ Gigabit Ethernet ★ ★★ ●● ● ● ● compressed 200 10 ● Gigabit Ethernet 100 20 raw 0 1 2 5 10 15 20 Number of Nodes 28
Scalability ) d 180 ● e Aggregate Bandwidth [MByte/s] Gigabit Ethernet e ● p 160 multi-drop/raw s ★ k s 140 Fast Ethernet i ● d ★ ( ● multi-drop/raw t 120 i m ★ i Gigabit Ethernet l 100 l a ● spanning tree/ c ● i ● t raw 80 e ★ r o ★ e Fast Ethernet 60 h ★ T ★ spanning tree/ ● ● ★ 40 raw ★ ❍ ❍ ❍ ✩ ✩ ✩ ❍ 20 Gigabit Ethernet ● ● ★ ✩ ★★ ❍ ★★ ❍❍ ● ✩✩ ● star/compressed 0 Fast Ethernet 1 5 10 15 20 ✩ star/compressed Number of Nodes 29
Predictions and Measurements Giga Giga Fast Fast Fast Giga Fast Giga 12 11.1 11.1 Bandwidth per Node [MByte/s] 10 9 8.8 8 8 6.4 6.1 6.1 6.1 6 5 5 4.9 4.2 3.8 4.1 3.6 4 2 0 Multi-drop chain topology Star topology 3 clients, raw 5 clients, compressed raw compressed Modelled Measured 30
Conclusions • A simple model captures network topology and node congestion • An extended model also captures the utilisation of basic ressources in nodes and switches. • Optimal configurations can be derived from our model. • For most physical networks a linear multi-drop chain is better than any other spanning tree configuration for distributing large data sets. • Dolly - our simple tool - transfers an entire 2 GB Win- dows NT partition to 24 workstations in less than 5 minutes , with a sustained transfer rate of 9 MB/s per node 31
Questions/Discussion? Our Project CoPs - Cluster of PCs Lab for Computersystems ETH Zürich, Switzerland Dolly is available for download under the GNU general public license (source code included). http://www.cs.inf.ethz.ch/CoPs/ 32
Recommend
More recommend