Programmable NICs: What they mean for parallel middleware (and are they here to stay?) Anthony Skjellum, Vijay Velusamy, ChangZheng Rao, Boris Protopopov* Department of Computer Science Mississippi State University * Now with Mercury Computer Systems June 18, 2002
MSU Motivations Long term efforts in parallel programming design (MPI- 1, MPI-2, MPI/RT, PacketWay) Interested in the performance triple (latency,overhead,bandwidth) and not (latency,bandwidth) alone for real applications Emphasis on lowering overhead and delivering ! Predictability ! Overlap of communication and computation Interested in offloading work to NICs (security, pt2pt primitives, collective operations, ...) Long term interest in programmable NICs 6/20/2002 2
Outline Introduction Survey of Programmable NICs Similarities, Differences Advantages, Disadvantages Economics of Programmability Conclusions Future work 6/20/2002 3
Introduction Programmable NICs are great for research ! Hosts ! Switches ! New algorithms, strategies, ideas Programmable NICs are flexible Often difficult to program More expensive per unit than ASICs Where do they fit in, when, and for how long? 6/20/2002 4
Some Programmable NICs Myrinet LANai 2.x – 9.x (Myricom) Alteon (Netgear, Farallon) Quadrics ELAN 3 (successor to Meiko Computing Surface, which used ELAN) Sitera Prism IBM IB HCA (PPC-based) 6/20/2002 5
Myrinet LANai Very popular NIC/network Simultaneous DMAs on host side (1 active) [since LANai 5] Simultaneous DMAs on network side (2 active) [since LANai 2] Chainable DMA (host+ network send, host+ network recv) [LANai 9] Programmable LANai processor (no cache, high speed SRAM) No support for non-contiguous physical page DMA Not necessarily enough memory bandwidth for LANai to execute many instructions while all DMAs running (degree of this varies with generation of the NIC) Clock resolution for real-time purposes, .5 µ S Supports User-space Communication NICs moderately expensive; overall solution “cost effective/upscale” For last few years, official firmware (GM - Glenn’s Messages) has been promoted in lieu of end-users programming themselves GM-based overhead has not proven as low as ASIC-based SANs, such as Giganet cLAN... 6/20/2002 6
Alteon NIC 2 MIPS processors Dual DMA Engines, DMA Assist Engine Support TCP/IP checksum offloading, interrupt coalescing, failover Features Jumbo frames (create bigger packets of data, works to offload CPU, by reducing interrupts) 6/20/2002 7
Quadrics ELAN 3 [4] Two primary processing engines Microcode processor ! Thread processor ! Supports four hardware threads (each can individually issue pipelined memory requests to the memory system) Full programmability with user thread(s) 32-bit RISC thread processor MMU contains 16-entry, fully associative look-aside buffer and a small data path and state machine DMA Engine prioritizes outstanding DMAs and time-slices large DMAs to prevent adverse blocking of small DMAs Only one packet in flight in end-to-end protocol for ELAN 3; ELAN 4 fixes this (stop&wait vs. go-back-N link-layer protocols) Very expensive NICs; overall solution “high end” 6/20/2002 8
Other Noteworthy NICs Sitera Prism Intel EtherExpressPro Alacritech VI/IP from Emulex IBM 1st generation IB HCA (PPC) Myrinet FI32 Other NICs that hold host protocols 6/20/2002 9
Similarities, Differences DMA engines Handling virtual memory, pages, MMU Division of labor/resources ! what part of protocol executes where ! how notification is accomplished ! how resources are divided among users of NIC 6/20/2002 10
Advantages, Disadvantages Advantages ! Programmability means flexible uses with non-TCP/IP situations ! Programming models: Message Passing, DSM, Transaction, … ! Experimentation with different kinds of transports ! Scale-relevant transport choices (performance vs. resource consumption) ! Evolution (IB) Disadvantages: ! Processor overhead for things that don’t fit on the NIC still ! Relative slow NICs in most cases 6/20/2002 11
Economics of NIC Programmability Cost to develop/maintain/upgrade transports Cost to deliver NIC with programmability First generation parts – flexibility and low volume wins out Next generation parts – flexibility not needed, high volume demands savings (read ASIC) Even programmable parts may not be able to be programmed by a vendor who won’t reveal this feature, because support costs them money and effort. 6/20/2002 12
Qualitative 1 st vs. 2 nd generation cost tradeoff Programmability a plus for first 1 st generation part generation prototyping ( e.g., µ P + FPGA(s)) Programmability most often a casualty of economy of scale sought in 2nd generation 2 nd generation part Aggregate ( e.g., ASIC) whole cost Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds... Break even Number of parts or volume 6/20/2002 13
Myricom-like tradeoff ASIC someday? Aggregate Increasing product generations whole cost Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds... Number of parts or volume 6/20/2002 14
Qualitative Costs of Programmability ($ and/or Performance) Capital cost: ! additional cost of fielded system as a function of the nearest equivalent cost non-programmable system ! additional hardware needed to reach “level X of required performance” for acceptability of system (some app. suite) Cost of ownership: ! Additional maintenance of extra hardware/software needed because of any inefficiency, including extra failures of larger system ! Less efficacy of higher overhead system on applications causes more cycles to be used on overhead-intolerant applications 6/20/2002 15
Security Implications of Programmable NICs DMA memory operations potentially mean cross-SAN access to memory of disparate processes (security important) Memory of NIC contains contents that can persist between sessions and co-exist for multi-level security (headache) Potential new opportunity for covert channels (covert channels cannot be disproven, but appears like a source) Some Intrusion attacks: denial of service by messing up NIC, malicious NIC code, using NIC to damage local or remote resource Some Insider threats: copying data not belonging to your process, obtaining more than fair share of resources in QoS setting, executing code improperly on NIC, using NIC as means to move data between unrelated processes in one system 6/20/2002 16
Real Time Implications of Programmable NICs Known problem of coordinating two CPUs with a single schedule If CPU drives NIC in non-programmable mode, then all need for programmability is moot For example, Xinyan Zan found in her MS thesis that GM is not at all predictable Chakravarthi et al implemented stringent Myrinet control programs to achieve high predictability (less bandwidth, higher latency, better predictability) 6/20/2002 17
When is Offloading to NIC better than using the HOST? Limiting cases ! zero-performance NIC -> use host [e.g., case of Meiko CS-1 ELAN] ! very fast-integer-performance NIC relative to host -> use NIC, host becomes like “second level” (as with Myricom 2- level multicomputer) or attached processor In between, legitimately interesting constrained combinatorial optimization of (latency,overhead,bandwidth), aimed at minimizing application runtime. Studying, modeling, and describing this problem is a key area of our current research effort. (B. Protopopov is completing a PhD in this area at MSU.) 6/20/2002 18
When is Offloading to NIC better than using the HOST? Memory hierarchy also impacts whether a host or a NIC should do certain work A concurrency metric (h/w_latency)* (h/w_bandwidth)/quantum_work provides an upper bound for concurrent number of activities in a host needed to mask latency from network… how big is this number? quantum_work is # of bytes for something useful 6/20/2002 19
What should be left for host? Connection management, flow control, per- connection-sized work/space [UNM/Sandia/Portals philosophy] Items that naturally travel up the memory hierarchy anyway Some of these items may be handled cheaply now, given additional thread concurrency (e.g., hyperthreading), whose cycles may be resources in excess anyway 6/20/2002 20
Protopopov’s Analysis, Pt 1 If CPU does protocol stack ! There are potentially several cache-assisted memory copies of data " checksum, protocol-layer packet formatting If NIC does protocol stack ! Does the NIC do the work in place (in host DRAM, in local memory)? ! What is the implication of the cache flush/invalidate on moving data between NIC and Host? 6/20/2002 21
Protopopov’s Analysis, Pt 2 System achieves lower bound on performance, when a component becomes a bottleneck - utilized 100% (e.g., NIC) System achieves a certain level of (higher) performance when the utilization of components are balanced, provides a constraint curve to follow toward higher utilization 6/20/2002 22
Protopopov’s Analysis, Pt 3 “Computational slack” [cf, BSP model] available to application helps establish if NIC specified Slack is amount of computation available in processors to manage communication overhead, without increasing overall application critical path (time to solution) 6/20/2002 23
Recommend
More recommend