Ninth IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, Pennsylvania, August 1-4, 2000 Speculative Defragmentation – – Speculative Defragmentation A Technique to Improve the Communication A Technique to Improve the Communication Software Efficiency for Gigabit Ethernet Software Efficiency for Gigabit Ethernet Ch. Kurmann, F. Rauch, M. Müller, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Ecole polytechnique fédérale de Zurich Eidgenössische Technische Hochschule Politecnico federale di Zurigo Zürich Swiss Federal Institute of Technology Zurich
Comm. Speeds of Commodity PCs Comm. Speeds of Commodity PCs Myrinet 32bit-PCI 126 MPI-Linux 2.0-BIP 126 20 20 TCP-Windows NT 42 TCP-Linux 2.2 42 Gigabit Ethernet 32bit-PCI 35 35 MPI-Linux 2.2 0 20 40 60 80 100 120 140 Transfer-rate [MByte/s] ÿ For Gigabit Ethernet and TCP/IP the OS-software cannot keep up with the hardware speed 2
Overview Overview • Why Gigabit Ethernet • Packet Defragmentation • TCP/IP Overheads • Speculative Packet Defragmentation • Performance Analysis • Conclusion 3
Problem Statement Problem Statement How can we sustain network bandwidths of 75-100 MByte/s with a commodity PC cluster node: • memory copy 90 MByte/s • 32-bit PCI I/O-bus 132 MByte/s • commodity Gigabit Ethernet adapter 100 MByte/s • standard TCP/IP protocol • fully transparent standard socket-API 4
Papers 10 Years Ago Papers 10 Years Ago The same problem — 30-100 times slower • memory copy < 3 MByte/s • VME I/O-bus < 3 MByte/s • commodity 10BaseT Ethernet adapter 1 MByte/s • special purpose blast transfer protocol [Zwaenepoel85] • optimistic bulk transfers [Carter89] • transparent blasts by header padding [Peterson90] Not standard protocol & not fully transparent ÿ Solutions did not find their way into current systems! 5
Why Gigabit Ethernet Why Gigabit Ethernet • Compatible to Ethernet and Fast Ethernet (UTP Cat5) • Uncomplicated technology which results in high reliability and low cost • Switched Ethernet provides link level flow control on full duplex channels • In larger networks only unacknowledged, connectionless datagram delivery service ÿ TCP needed • Standard frame size is still limited to 46-1500 Byte of data 6
Alternatives / Extensions Alternatives / Extensions • Dedicated network hardware with customized lightweight protocols: Myrinet, SCI, Giganet, ServerNet ÿ primarily designed for internal communication in server farms • Jumbo Frames (9 KByte) for Gigabit Ethernet to reach a Maximal Transfer Unit (MTU) of a memory page: ÿ • change of standard • higher latencies in store and forward switches • do not solve the header/payload separation 7
Packet [De]Fragmentation Packet [De]Fragmentation • IP standard technique • Data to be sent is fragmented into small chunks < network MTU (Maximal Transfer Unit) • Network protocols enclose the frames with header/trailer • Receiver separates the headers from the payload and defragments the data again • Implications for Ethernet: • MTU < Memory Page • DMA-logic not optimal ÿ Therefore memory copy for packet [de]fragmentation 8
TCP/IP Host Overheads TCP/IP Host Overheads PII 400MHz, Linux 2.2 • Single largest overhead: 100 copying and checksums 80 ÿ Zero-copy Copy & Checksum Percent CPU techniques Interrupt 60 • Per-packet processing TCP/IP 40 and interrupt overhead Driver/ DMA also high 20 Init ÿ Interrupt coalescing 0 Host Overhead for TCP/IP over Gigabit Ethernet 9
OS Environment OS Environment Control Path Data Path User Application User Mapped Data Pages User Protection space boundary Middleware (CORBA, MPI) ORB Marshalling, Buffering copies . . . . . . . . . Previous Socket Layer System Page Pool work Protocol handling, Packet Kernel TCP/IP Stack Generation space Driver NIC Driver copies DMA PCI Bus Speculative NIC Firmware NIC Defragmentation Send and Receive Buffers 10
Required Technologies Required Technologies • Well known solutions to eliminate the User/Kernel copy: • User-Level Network Interface (U-Net) or Virtual Interface Architecture (VIA) • User/Kernel Shared Memory (FBufs, IOLite), Copy Emulation or Page Remapping with Copy on Write • The Driver copy remains for Gigabit Ethernet ÿ Goal: Elimination of driver copy for the packet defragmentation and header separation ÿ True zero-copy 11
Commodity GE-Adapters Commodity GE-Adapters • Until now, zero-copy support is only available for “intelligent” network adapters (ATM, SiliconTCP) • Today’s Gigabit Ethernet adapters are too simple • no processor, TLBs on board • limited DMA capabilities • no protocol filtering implemented ÿ Deterministic zero-copy implementation with commodity GE adapters is not possible! • Approach: Making just the common case fast ÿ Speculation Techniques for Defragmentation 12
Speculative Defragmentation I Speculative Defragmentation I • Our driver manages to send/receive entire 4 KByte pages • Decomposition of 4 KByte IP-packets into 3 IP-fragments on driver level (standard IP fragmentation) • Attachment of 14,20,20 1460 1480 4096 1156 status length ETH TCP ETH TCP headers to the 1 st Frag. 2 nd Frag. 3 rd Frag. IP IP data status length payload data with zcdata 14,20 a separate status length ETH ETH IP data IP DMA-descriptor status length zcdata 14,20 status length ETH ETH IP IP data status length zcdata 13
Speculative Defragmentation II Speculative Defragmentation II What are we speculating about? • Speculation that all fragments of a whole page will be received in order • Speculation about the precise packet format (header- lengths, data-fields) • The receiver has to fix the DMA descriptors without knowledge about the next packets to arrive • In clusters with one or two switches, the probability is high, that the three fragments arrive in order • Software cleanup when mis-speculation 14
Speculative Defragmentation III Speculative Defragmentation III Fragmentation/Defragmentation of a 4 KByte memory page by the DMA of the network interface sk_buff sk_buff Protocol Headers header header 4 KByte Page zcdata zcdata ... ... Fragmentation Defragmentation 1156 1460 1480 1480 3rd 1st 1460 1156 2nd 2nd 1st 3rd Ethernet Network 15
Performance Evaluation Performance Evaluation • Gains by Successful Speculation • Penalty for Speculation Misses • Speculation Success Rates in Applications • Consequences: - Network Control Architecture - Suggested Hardware Improvements 16
Gains with Speculation Gains with Speculation TCP/IP Performance of Gigabit Ethernet Linux 2.2 Standard 42 42 42 ZeroCopy Remapping 1 copy 46 46 46 with Copying Driver Speculative Defrag. 45 45 45 with Copying Socket API Spec. Defragmentation 65 65 65 0 copy with ZeroCopy Remapping Spec. Defragmentation 75 75 75 with ZeroCopy FBufs 0 10 20 30 40 50 60 70 80 Transfer-rate [MByte/s] ÿ 80 % increase in performance (bandwidth) 17
Penalty with Speculation Misses Penalty with Speculation Misses TCP/IP Performance of Gigabit Ethernet Linux 2.2 Operation 42 42 42 Standard Sender Standard Receiver Fallback 35 35 35 Standard Sender Zero-Copy Receiver Compatibility 45 45 45 Zero-Copy Sender Standard Receiver 0 10 20 30 40 50 60 70 80 Transfer-rate [MByte/s] ÿ The common case is fast, the fallback not much slower 18
Evaluation of Success Rates Evaluation of Success Rates • Application traces show success of speculative transfers Traces Oracle TPC-D TreadMarks SOR Master Host1 Host2 Master Host1 Host2 total 129835 67524 62311 68182 51095 50731 Ethernet large 90725 45877 44848 44010 30707 30419 frames zcopy 79515 37833 41682 44004 30693 30405 ok 38235 37833 41682 44004 30675 30399 Success Rate 48 % 100 % 100 % 100 % > 99 % > 99 % • TreadMarks has an inherent scheduling that prevents interference • TPC-D needs a control architecture or hardware changes 19
Network Control Architecture Network Control Architecture • Problem: Multiple synchronous, fast receives may garble the zero-copy frames � • Solution: Admission Control on Ethernet driver level with negotiation for one single sender to blast • Implicit channel allocation by OS works • Fully transparent • No explicit scheduling of transfers through a special interface ÿ the API remains the same 20
Suggested Hardware Improvements Suggested Hardware Improvements • Additional control-path between the checksumming- and the DMA-logic for detection of protocol & header fields ÿ Reliable header/payload separation • Stream detection with a simple matching register and a separate DMA descriptor chain for fast transfers: ÿ Detection of at least one high performance stream ÿ Separation of this stream with its DMA descriptors ÿ Improvement of the speculation rate Lower driver complexity 21
Recommend
More recommend