Burst Buffer Simulation In Dragonfly Network Jian Peng, Michael Lang Illinois Institute of Technology, Los Alamos National Laboratory
Purpose: • Residing in the compute node network, and using solid state drives (SSD), burst buffers bring a significant I/O performance boost compared with traditional external hard disk drive (HDD) storage system. • bottleneck of fully leveraging burst buffers still remains unknown. • Sharing network • Limited burst buffer number • Due to large system scales, it is usually too expensive to change system setup and configurations.
Trinity System Overview : • System Level • Cabinet Level • Chassis Level
Inside a Group:
Trinity Phase II Network Configuration: • Global link: • 37.6 GB/s • Local link: • Intra-chassis, 5.25GB/s • Inter-chassis, 15.75GB/s (3 tiles). • Intra-Blade link: • PCIE 3.0, 16GB/s *All bi-directional
General: • All-to-all pattern interconnections among routers. • Optical Cable Link among groups. In Trinity, each group link is made with 2 cables. Each cable provides 4.7GB/s bandwidth. • On the cabinet level, there are 2 cabinets in each group, which are connected by backplane electrical links. Each cabinet contains 3 chassis. The bandwidth of each inter-chassis link is 15.75GB/s. • Inside a chassis, the bandwidth of link between each router is 5.25GB/s.
Connections-Router • All nodes connected by routers • 10 inter-group ports • 15 inter-chassis ports • 15 inter-blade ports
Connections-Chassis • 16 blades • 40 connectors to other groups • 5 connectors to other chassis per blade • Backplane connections among blades • PCIE -3 x 16 between a node and blade
Connections-Inter-group • Connection between 2 group ports of 2 routers in 2 groups. • One link between each group • Use Absolute(Direct) pattern.
Datawarp • Burst buffers are implemented as Datawarp nodes in Cray XC40.
Simulation Detail • 96 routers • 10 Burst Buffer nodes • 2 LNET nodes • Final phase 224 LNET nodes • 360 Compute Nodes • 384 Nodes in total, 372 in simulation. • Trinity Phase II, 23 Groups • 230 Burst Buffers • 8280 Compute Nodes • Adaptive routing
Simulation Framework • Application Layer • IOR workload • Darshan 3.1 workload • Model-net Layer • Burst Buffer process • Codes-0.5.2
Results • N-N Write • 4 procs per node • 8MB stripe size on BB • <1024: 32GB per proc • >=1024: 2GB per proc
Results • N-1 Write
Results • N-N Read
Problems • Darshan traces of applications. • Mostly checkpointing at LANL • Lustre is fast enough. • Must use Datawarp APIs. • Datawarp software is still updating. • Currently modeling Trinity Phase II. Final phase is undergoing: • 576 Burst Buffer Nodes • ~20,000 Compute nodes • More modeling details need to be confirmed. • READ bug: • Simulation ends when compute nodes scale up to 2048 in N-N Read. • Ends sooner in N-1 Read simulation. • Seems to be messages are already freed when a reverse event is received.
Recommend
More recommend