Cray XMT Scalable, multithreaded, shared memory machine Designed - PowerPoint PPT Presentation

� Cray XMT � Scalable, multithreaded, shared memory machine � Designed for single ‐ word random global access patterns � Very good at large graph problems � Next Generation Cray XMT Goals � Memory System Improvements � Improve bandwidth for random access � Improve capacity for large graphs � Hot Spot Avoidance � Shared memory programming models generally susceptible to hot spotting � The current XMT is no exception � Add hot spot avoidance hardware to the CPU CUG 2011 Golden Nuggets of Discovery 5/23/2011 2

� Relative latency to memory continues to increase � Vector processors amortize memory latency � Cache ‐ based microprocessors reduce memory latency � Multithreaded processors tolerate memory latency � Multithreading is most effective when: � Parallelism is abundant � Data locality is scarce � Large graph problems perform well on the Cray XMT � Semantic databases � Big data CUG 2011 Golden Nuggets of Discovery 5/23/2011 3

� A thread is a software object � A program counter and a set of registers � Very lightweight � Not pthreads � No OS state � A stream is a hardware object � Stores and manipulates a thread’s state � Very lightweight stream creation � A single instruction executed from user space � More threads than streams � Threads multiplexed onto the processor’s streams CUG 2011 Golden 5/23/2011 4 Nuggets of Discovery

� The XMT memory word has 66 bits � 64 bits of data, byte addressable � Data is stored big ‐ endian � 2 tag bits � The full/empty bit � Used for synchronization � The extended bit � Set when the entry is forwarded or when a trap bit is set 64 data bits Extended bit Full/empty bit CUG 2011 Golden Nuggets of Discovery 5/23/2011 5

� Specified by pointer or instruction � Three access modes � FE_NORMAL � FE_FUTURE � FE_SYNC ( readFE , writeEF ) � Provides efficient, abundant, fine ‐ grained synchronization Stream 1: Stream 2: Code A X: … writeEF X readFE X … Code B CUG 2011 Golden 5/23/2011 6 Nuggets of Discovery

Cray XMT blade Threadstorm3 CUG 2011 Golden Nuggets of Discovery 5/23/2011 7

CUG 2011 Golden Nuggets of Discovery 5/23/2011 8

� Storage to track up to 1024 memory references � Performs data address translation � Relocate according to domain data state � Scrambling to hash address bits � Distribution to spread references across machine � Issues requests to Switch � Handles retries if necessary � Updates stream state upon completion CUG 2011 Golden Nuggets of Discovery 5/23/2011 10

CUG 2011 Golden 5/23/2011 11 Nuggets of Discovery

� All remote memory references go through the RMA block in the HyperTransport Bridge � RMA block serves three purposes: � Bypass HT native addressing to allow up to 512TB of memory to be directly referenced � Support extended memory semantics � Encapsulate multiple references in each HT packet for efficient use of the link � All RMA traffic packed into 64 ‐ byte payload of HT posted writes CUG 2011 Golden 5/23/2011 12 Nuggets of Discovery

Next Generation Cray XMT blade Threadstorm4 CUG 2011 Golden 5/23/2011 13 Nuggets of Discovery

� Two memory controllers per node � Each 50% faster than the current implementation � 3x bandwidth improvement � 8x capacity improvement � Optimized for single 8B word random address accesses � 64b adder for atomic Fetch&Add � 128kB buffer cache between Switch and DIMMs � No coherency issues � All DIMM operations go through cache � This buffer is associated with the physical memory, not the processor � 64B cache line CUG 2011 Golden 5/23/2011 14 Nuggets of Discovery

DIMM0 DIMM1 � Standard DIMMs store 9 bytes per address � 8 bytes for data � 1 byte for check bits � Each DIMM rank implemented with 18 4 ‐ bit memory parts � Correct any number of errors in a single part � Gang two DIMMS together � Reed ‐ Solomon code implemented over two flit times � 288 bits total � 32 parts for data � 1 part for state � 3 parts for check bits CUG 2011 Golden 5/23/2011 16 Nuggets of Discovery

� DDR2 registered DIMMs at 300MHz � Supports Burst=4 � Allows 64B cache line in ganged mode � Better for single word random accesses � DDR3 only supports Burst=8, doubling cache line size � Better timing windows � DIMMs supported by hardware: � 4GB Dual Rank � 8GB Dual Rank � 8GB Quad Rank � 8 DIMM slots per node � 32GB per node using 4GB DIMMs � 64GB per node using 8GB DIMMs CUG 2011 Golden Nuggets of Discovery 5/23/2011 17

� Many streams may access the same memory location simultaneously � Threadstorm4 solves the problem in the M ‐ unit � Allow only one outstanding reference of a given type for each address � Use the network more efficiently � Synchronized Reference CAM for readFE (or writeEF ) � Only one operation can find the location full (or empty) � Others are deferred and tried later � Fetch&Add Combining CAM � Fetch&Add operands to same address combined in M ‐ unit � One network request satisfies multiple Fetch&Add requests CUG 2011 Golden Nuggets of Discovery 5/23/2011 19

� readFE waits for full, then loads and sets empty � writeEF waits for empty, then stores and sets full � Critical code segment may be protected by readFE/writeEF � If frequently executed, readFE may cause hot spot � Retries handled by M ‐ unit—one round ‐ trip to memory and back for each retry � Each processor may issue about 100 readFE operations at once � At most one will be successful � Others just consume network and memory bandwidth CUG 2011 Golden Nuggets of Discovery 5/23/2011 20

� SynchRef CAM in next generation Cray XMT avoids hot spots � Only one readFE to a given address can succeed � Don’t allow more than one on the network � When readFE would be injected, check in the CAM CAM entry deallocated when response is received CUG 2011 Golden 5/23/2011 21 Nuggets of Discovery

� Test SynchRef CAM with worst possible program � Large reduction protected by readFE/writeEF pair � Only one stream at a time does work � Run on 100 streams per processor � For N processors, 100*N streams compete to read location CUG 2011 Golden Nuggets of Discovery 5/23/2011 22

� Cray XMT supports fetching and non ‐ fetching atomic add operations � A single memory location may be accessed by all streams � Queue pointer or global reduction � Each processor generates about 100 Fetch&Add requests � Oversubscribes memory node CUG 2011 Golden Nuggets of Discovery 5/23/2011 24

� Fetch & Add Combining in next generation Cray XMT eliminates hot spots � Fetch & Add operation checks in F&A Combining CAM (FACC) � If a match is not found, allocate in the FACC � If a match is found, attach itself to a linked list of dependents � FACC entry � Accumulates data � Generate network request after specified wait time � F&A Retirement CAM entry � Allocated when network request is made � Pointer to linked list of dependents � When response is received, multiple register file writes generated CUG 2011 Golden Nuggets of Discovery 5/23/2011 25

� Current Cray XMT trick when updating a global accumulator � Make several copies of the accumulator � Randomly select one to update � Requires an additional computation at the end � Test F&A Combining Logic using this trick � Perform global additive reduction � Vary the number of copies: 1, 2, 4, 8, 16, 32 � Current Cray XMT � Hot spot created with small numbers of copies � Performance improves as copies are added � Next generation Cray XMT Performs best with a single copy � CUG 2011 Golden Nuggets of Discovery 5/23/2011 28

� Next Generation builds on successful Cray XMT � Memory system improved significantly � 3x improvement in bandwidth � 8x improvement in capacity � Hot Spot Avoidance � Productivity—simple implementation performs best � Reliability—difficult programs cannot interrupt system services � Performance—use network more efficiently CUG 2011 Golden Nuggets of Discovery 5/23/2011 30

Cray XMT Scalable, multithreaded, shared memory machine Designed - PowerPoint PPT Presentation

Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements Improve

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

DNS OVER HTTPS (DOH) Performance Implications & Risks 2018 Workshop on Internet Economics

Study on OS OS Finge gerprinting g and NAT/Tethering Based on DN DNS Log Analysis Deliang

Auto-tuning HotSpot JVM using OpenTuner OpenTuner Workshop International Symposium on Code

Detection of Spatial Cluster for Suicide Data using Echelon Analysis Fumio Ishioka (Okayama

ProtoDUNE/DUNE timing system and CCM Stoyan Trilov, University of Bristol CCM meeting

Performance E ff ects of Dynamic Graph Data Structures in Community Detection Algorithms Rohit

Adding Safeness to Dynam ic Adaptation Techniques Betty H.C. Cheng Software Engineering and

Cray XMT Scalable, multithreaded, shared memory machine Designed - PowerPoint PPT Presentation

Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements Improve

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

DNS OVER HTTPS (DOH) Performance Implications &amp; Risks 2018 Workshop on Internet Economics

Study on OS OS Finge gerprinting g and NAT/Tethering Based on DN DNS Log Analysis Deliang

Auto-tuning HotSpot JVM using OpenTuner OpenTuner Workshop International Symposium on Code

Detection of Spatial Cluster for Suicide Data using Echelon Analysis Fumio Ishioka (Okayama

ProtoDUNE/DUNE timing system and CCM Stoyan Trilov, University of Bristol CCM meeting

Performance E ff ects of Dynamic Graph Data Structures in Community Detection Algorithms Rohit

Adding Safeness to Dynam ic Adaptation Techniques Betty H.C. Cheng Software Engineering and

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

DNS OVER HTTPS (DOH) Performance Implications & Risks 2018 Workshop on Internet Economics