tram improving fine grained communication performance
play

TRAM: Improving Fine-grained Communication Performance with - PowerPoint PPT Presentation

TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages Presented by Lukasz Wesolowski 11th Annual Charm++ Workshop 1 April 15 - 16, 2013 T opological R outing and A ggregation M odule


  1. TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages Presented by Lukasz Wesolowski 11th Annual Charm++ Workshop 1 April 15 - 16, 2013

  2. T opological R outing and A ggregation M odule 11th Annual Charm++ Workshop 2 April 15 - 16, 2013

  3. T opological exploits physical network topology R outing and A ggregation M odule 11th Annual Charm++ Workshop 3 April 15 - 16, 2013

  4. T opological R outing and determines message path A ggregation M odule 11th Annual Charm++ Workshop 4 April 15 - 16, 2013

  5. T opological R outing and A ggregation combines messages M odule 11th Annual Charm++ Workshop 5 April 15 - 16, 2013

  6. T opological R outing and A ggregation M odule component of a larger system 11th Annual Charm++ Workshop 6 April 15 - 16, 2013

  7. Introduction • Charm++ library – Prototype: Mesh Streamer – Originally developed for the 2011 Charm++ HPC Challenge submission • Aggregates fine grained messages to improve communication performance 11th Annual Charm++ Workshop 7 April 15 - 16, 2013

  8. Why Aggregation? • Sending a message involves overhead – Allocating buffer – Serializing into buffer – Injecting onto the network – Routing – Receiving – Scheduling 11th Annual Charm++ Workshop 8 April 15 - 16, 2013

  9. Communication Overhead • Some overhead depends on data size – Serialization • Some does not – Scheduling • Aggregation targets the latter, constant overhead 11th Annual Charm++ Workshop 9 April 15 - 16, 2013

  10. Two Types of Constant Overhead • Processing overhead – Processing time involved in sending a message • Bandwidth overhead – To send some bytes on the network you must first … send some more bytes on the network – What does it mean to send a 0-byte message? • Answer: in Charm++, to send at least 48 bytes 11th Annual Charm++ Workshop 10 April 15 - 16, 2013

  11. Bandwidth Overhead • Message header – Charm++ envelope: 48 bytes • Network overhead – Routing – Error checking – Partially filled packets 11th Annual Charm++ Workshop 11 April 15 - 16, 2013

  12. Bandwidth Methodology • Network bandwidth is tricky to deal with • Fundamentally, it is a property of a single link, but our tendency is to try to distill it into a single value (e.g. bisection bandwidth) • If all links are utilized equally and link bandwidth is saturated, then each link’s consumption is significant – We can then add up each link’s utilization, and concern ourselves with this aggregate bandwidth 11th Annual Charm++ Workshop 12 April 15 - 16, 2013

  13. Fine-grained Communication • Constant communication overhead really adds up when sending large numbers of small messages – What about large numbers of large messages? • Sources of fine-grained communication – Control messages, acknowledgments, requests, etc. • For strong scaling, communication becomes increasingly fine-grained with increasing processor count 11th Annual Charm++ Workshop 13 April 15 - 16, 2013

  14. Why Routing? • By routing, we mean not selection of the links along which messages travel, but instead: – Selection of intermediate destination nodes or processes and delivery of the message to the runtime system at the intermediate destinations – Analogy: bus route • Why does a passenger bus make stops before reaching the end of the route? 11th Annual Charm++ Workshop 14 April 15 - 16, 2013

  15. Why Routing? • Why does a bus make stops before reaching the end of the route? – To serve more people along its direction of travel • Picking up people who want to board the bus at ANY stop along the route • Dropping off people whose destination is ANY subsequent stop along the route – Stopping at n stops serves (n-1)(n-2) separate trips (source/destination pairs) • This is why a relatively small number of buses can serve a large area of a city 11th Annual Charm++ Workshop 15 April 15 - 16, 2013

  16. Why Topological? • It is infeasible to have a separate hardware network link between every pair of nodes in the system • Consequences – some messages must travel through one or more intermediate nodes or switches • How it happens is normally invisible to the application and runtime system – aggregate bandwidth consumed grows linearly with every additional link along the route 11th Annual Charm++ Workshop 16 April 15 - 16, 2013

  17. Congestion • Messages traveling concurrently along a link must split the bandwidth, leading to congestion • Due to aggregation, TRAM messages are larger than typical, so congestion is of higher concern 11th Annual Charm++ Workshop 17 April 15 - 16, 2013

  18. Network Topology • No single network topology for supercomputers is accepted as best, so in practice several are in use Source: en.wikipedia.org Source: Bhatele et al., SC ‘11 Source: wiki.ci.uchicago.edu 11th Annual Charm++ Workshop 18 April 15 - 16, 2013

  19. Virtual Topology • The nodes of a physical topology can be mapped onto a virtual topology • The same virtual topology can be reused for various physical topologies • TRAM employs a mesh virtual topology 11th Annual Charm++ Workshop 19 April 15 - 16, 2013

  20. Topological Routing • Most messages pass across multiple links to reach the destination • We can try combining messages, taking advantage of intermediate destinations analogously to bus stops • But hardware-level routing is transparent to the runtime system – Solution: lift routing into software, at the level of the runtime system – Possible pitfall: routing will still happen independently in hardware 11th Annual Charm++ Workshop 20 April 15 - 16, 2013

  21. Minimal Routing • Routing is minimal if every message sent travels over the minimum number of links possible to reach its destination • Our goal with TRAM is to preserve minimal routing if possible – Reason: non-minimal routing consumes additional aggregate bandwidth 11th Annual Charm++ Workshop 21 April 15 - 16, 2013

  22. Virtual to Physical Topology Mapping • Simplest and often best: make virtual topology identical to physical – Using Charm++ Topology Manager • For high dimensional meshes, tori – Reduce number of dimensions while preserving minimal routing • Fat trees – 2D within/across nodes 11th Annual Charm++ Workshop 22 April 15 - 16, 2013

  23. Data Item • Unit of fine-grained communication to be sent by TRAM • Sent for a particular destination • Submitted using a local library call instead of the regular Charm++ syntax for a message send 11th Annual Charm++ Workshop 23 April 15 - 16, 2013

  24. TRAM Peers • In the context of TRAM, a process is allowed to communicate only with its peers – peers are all the processes that can be reached from it by moving arbitrarily far strictly along a single dimension 11th Annual Charm++ Workshop 24 April 15 - 16, 2013

  25. Mesh Routing Algorithm • Order the N dimensions in the virtual topology • According to the order, send data items along the highest dimension whose index does not match the destination’s – to the peer whose index does match the final destination’s index along that dimension • Aggregate at the source and each intermediate destination 11th Annual Charm++ Workshop 25 April 15 - 16, 2013

  26. Mesh Routing and Aggregation 11th Annual Charm++ Workshop 26 April 15 - 16, 2013

  27. Aggregation Buffer Size • Buffers should be large enough to give good bandwidth utilization, but no larger – Buffering time should be relatively low • On Blue Gene/P Buffers of size 4 KB or more are sufficient to almost saturate the bandwidth 11th Annual Charm++ Workshop 27 April 15 - 16, 2013

  28. TRAM Memory Footprint • Number of peers is typically a small fraction of all the processes in the run – For example, 32 x 32 x 32 topology • 32768 processes • 93 peers • This allows TRAM’s memory footprint to remain relatively small – Small enough for lower level cache 11th Annual Charm++ Workshop 28 April 15 - 16, 2013

  29. TRAM Usage Pattern • Start-up • Initialization • Sending and receiving • Termination • Re-initialization 11th Annual Charm++ Workshop 29 April 15 - 16, 2013

  30. Alltoall Performance on Blue Gene/P 11th Annual Charm++ Workshop 30 April 15 - 16, 2013

  31. ChaNGa on Blue Gene/Q 11th Annual Charm++ Workshop 31 April 15 - 16, 2013

  32. EpiSimdemics on Blue Waters 11th Annual Charm++ Workshop 32 April 15 - 16, 2013

  33. Future Plans • Develop alternative virtual topologies for non- mesh networks • Generalize – First within Charm++ – Then to other communication models • Automate – Library parameter selection – Virtual topology dimensions – Choice of which messages to aggregate 11th Annual Charm++ Workshop 33 April 15 - 16, 2013

  34. Acknowledgements This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. EpiSimdemics results courtesy of Jae-Seung Yeom and the EpiSimdemics team. 11th Annual Charm++ Workshop 34 April 15 - 16, 2013

  35. Thank You 11th Annual Charm++ Workshop 35 April 15 - 16, 2013

Recommend


More recommend