HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube Operations John D. Leidel, Yong Chen May 23, 2016 AsHES 2016 1
Overview • Introduction & Overview • CMC Simulation • Sample CMC Mutexes • Future Research 2
Hybrid Memory Cube Device Simulation INTRODUCTION & OVERVIEW 3
GC64 Driving Research • Driving force behind the GC64 architecture research is the ability to find and exploit memory bandwidth • Exhaustive search on forthcoming memory technologies Traditional DDR/GDDR devices • did not provide sufficient accessibility and bandwidth • Hybrid Memory Cube devices were chosen http://gc64.org 4
Intro to Hybrid Memory Cube • Technology Through-silicon-via [TSV] design that combines logic layer and • DRAM layers Packetized interface specification the behaves similar to a • network device Routing capabilities built into the device logic layer • • Device-to-device routing • Hybrid Memory Cube Consortium Standards body to drive the public HMC specification. • Similar in function to JEDEC for DDR memory • http://www.hybridmemorycube.org/ • 5 / 22
HMC TSV Technology • Substrate Contains the physical pin- • out for data, power and ground SERDES • • Logic Layer Contains the logic necessary • to perform: • Routing • Arbitration (weakly ordered) • Addressing • AMO • DRAM Layers H. M. C. Consortium. Hybrid memory cube specification 2.1, 2015. Contains the DRAM arrays • 6 / 22
HMC-Sim Overview • Our architecture research required access to a configurable HMC simulation platform None existed that were: 1) open source and/or • 2) available without an NDA • We exhaustively studied the HMC specification and developed HMC- Sim based upon the spec …as opposed to a individual device SKU • • HMC-Sim Design Requirements Configurable for different host CPUs (link • connectivity, clock frequency, packet configuration, etc) Configuration for different device SKU’s • Support for device-to-device routing • Simulation of all the internal queuing arbitration • stages as defined by the spec Cycle-based simulation • Discrete logging capabilities • Packaged as a library (can be integrated into • other high-level simulators) 7
HMC-Sim 1.0 • Developed the first open source HMC simulation platform Designed to explore how different • applications affect memory throughput & latency Becoming the standard for HMC • modeling and simulation • Permits us to model different concurrency mechanisms to determine the best mixture of parallelism and bandwidth across different algorithms and applications 8
HMC-Sim 2.0 • Several users of HMC-Sim requested a number of new features in future revisions: Support for Gen2 HMC specification • Gen2 specification’s inclusive support for atomic memory operations • Gen2 packet specification • Custom Memory Cube (CMC) exploration • • CMC Exploration What if we could implement new operations in the HMC logic layer? • What if these operations were NOT just simple memory operations? • Additional Atomic operations, transactional operations, arithmetic • reductions, logical reductions, processing near memory, etc If we could have any operation embedded in the HMC logic layer, what • would it be? 9
Custom Memory Cube Operation Simulation CMC SIMULATION 3
CMC Support Requirements • API Compatibility: • Discrete Tracing Existing integration with • HMC-Sim 1.0 had extensive • other simulators shouldn’t be support for logging, CMC ops broken (Sandia SST) will need this as well • External Implementation: • Separable Implementation CMC implementer should • focus on CMC, not learning Current HMC-Sim is BSD • HMC-Sim internals licensed. We want to make • Creative Experimentation sure users can develop/ No limitation to the user’s • distribute their CMC ideas creativity in implementing separate from the simulator CMC ops • No Simulation • Utilize Existing HMC Packet Perturbation Formatting Existing crack/decode logic No perturbation to existing • • should be maintained simulation results! 11
CMC Support Architecture • We explicitly map all the unused HMC opcodes to CMC* ops RD16 CMC04 libMY_CMC_1.so 70 potential CMC opcodes • RD32 CMC05 libMY_CMC_2.so We provide a template • . CMC07 . CMC20 infrastructure to construct a WR16 CMC21 WR32 single CMC operation mapped to CMC22 . CMC23 libSomeCMC.so . a single opcode in a shared . CMC04 . library CMC05 . . We provide one additional API • CMC Data HMC Data Structure & interface to load the CMC Structures & Function Commands Pointers shared library at runtime libhmcsim.a Runtime processing is • otherwise the same for CMC operations! 12
CMC Library Architecture • The CMC library requires the int (*cmc_register)(hmc_rqst_t *, user to define structure of uint32_t *, uint32_t *, uint32_t *, the CMC operation: hmc_response_t *, hmc_rqst_t rqst uint8_t *); uint32_t cmd CMC Name (string): used for • uint32_t rqst_len logging int (*cmc_execute)(void *, uint32_t rsp_len uint32_t, uint32_t, hmc_response_t rsp_cmd Request command enum • uint32_t, uint32_t, uint8_t rsp_cmd_code uint64_t, uint32_t, (from the list of 70) uint64_t, uint64_t, uint32_t active uint64_t *, uint64_t *); void *handle Request & Response packet • lengths void (*cmc_str)(char *); Response command enum • (can be custom response) Data Structures Function Pointers • One function must be hmc_cmc_t implemented by the user: hmcsim_execute_cmc() • • Everything else is provided CMC Tutorial: in our example CMC http://gc64.org/?page_id=140 implementation 13
CMC Registration extern int hmcsim_load_cmc( struct hmcsim_t *hmc, char *cmc ); No Is HMC-Sim Initialized? Yes Begin Registering CMC Library Initiate Dynamic Loader dlopen( char *cmc, return error RTLD_NOW) Yes int (*cmc_register)(hmc_rqst_t *, Shared Lib No uint32_t *, uint32_t *, Loaded? uint32_t *, hmc_response_t *, uint8_t *); Register CMC Function int (*cmc_execute)(void *, Pointers uint32_t, uint32_t, uint32_t, uint32_t, dlsym(handle,FUNC) uint64_t, uint32_t, uint64_t, uint64_t, uint64_t *, uint64_t *); Execute Registration void (*cmc_str)(char *); Function int (*cmc_register) (...) Save Data to return success hmc_cmc_t 14 / 22 Structure
CMC Processing HMC Vault Request Queue extern int hmcsim_process_rqst(...) Decode Packet Header & Tail Find Available Response Queue Slot Yes No return error Available? return success Examine the Register request Response command code Yes Yes Process CMC No No Normal Response Command HMC Required? ? Command Process CMC Command Is CMC No Command Active? Execute CMC Retrieve Command struct hmc_cmc_t Execution Using cmc[] Function Function 15 / 22 Pointer Pointer
Locking Primitives as CMC Operations CMC MUTEXES 3
CMC Mutexes • We implemented several Thread/Task ID Lock CMC commands as initial 127 64 63 0 tests • Each HMC mutex payload is a 16-byte • What if we could memory location accelerate traditional • Lower 8 bytes: LOCK region mutex operations? • Upper 8 bytes: Thread/Task ID • “Owner” of the LOCK region HMC_LOCK • • Relative to the user’s process HMC_TRYLOCK • space HMC_UNLOCK • • 16-bytes is wasteful… but • Designed to perform • 16-bytes in the minimum request pthread-style mutex size for normal HMC RD/WR operations requests **does not block on • • Minimal logic overhead required HMC_LOCK to implement our mutexes 17
CMC Mutex Implementation Request Request Response Response Operation Pseudocode Command Enum Command Length Command Length IF ( ADDR[63:0] == 0 ) { ADDR[127:64 = TID; ADDR[63:0]=1; RET 1 } ELSE { RET hmc lock CMC125 125 2 FLITS WR RS 2 0 } IF ( ADDR[63:0] == 0) { ADDR[127:64 = TID; ADDR[63:0]=1; RET 126 2 FLITS RD RS 2 hmc trylock CMC126 ADDR[127:64] } ELSE { RET ADDR[127:64] } IF ( ADDR[127:64] == TID && ) { hmc unlock ADDR[63:0] == 1 ADDR[63:0] = CMC127 127 2 FLITS WR RS 2 0; RET 1 } ELSE { RET 0 } HMC_LOCK HMC_TRYLOCK HMC_UNLOCK if( LOCK == 0 ){ if( LOCK == 0 ){ if( TID == MY_TID TID = MY_TID; TID = MY_TID; && LOCK == 1){ LOCK = 1; LOCK = 1; LOCK = 0; return 1; return TID; return 1; }else{ }else{ }else{ return 0; return TID; return 0; } } } 18
CMC Mutex Experimentation • Attempt to perform naïve spin-wait locks on a single mutex location • Deliberate hot-spotting • Scale the number of parallel threads/ Algorithm 1 CMC Mutex Algorithm tasks from 2-100 for Nthreads do HMC LOCK(ADDR) • Execute the tests for different HMC if LOCK SUCCESS then configurations HMC UNLOCK(ADDR) else 4LINK-4GB • HMC TRYLOCK(ADDR) 8LINK-8GB • while LOCK FAILED do • Record: HMC TRYLOCK(ADDR) end while Min_Cycle : Minimum number of • HMC UNLOCK(ADDR) cycles for any thread to obtain the lock end if Max_Cycle : Maximum number of • end for cycles for any thread to obtain the lock Avg_Cycle : Average number of cycles • for all threads to obtain the lock 19 / 22
Recommend
More recommend