Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com
Agenda: ● Why CMT Processors ● Highlights of Sun Niagara Processor ● Performance characteristics of T1 ● Need for Virtualization ● CMT & Virtualization ● Sun Virtualization Solutions ● HW and Software Network Virtualization. Sun Proprietary Information
Case For CMT Processors Sun Proprietary Information
Tradional processor behavior C M C M C M Thread Time Compute Compute Memory Latency Memory Latency Single scalar processor Time Saved C M C M C M Thread Time Compute Compute Memory Latency Memory Latency Processor optimized for ILP Sun Proprietary Information
Characteristics of Commercial Work Load ● High degree of thread level parallelism (TLP) ● Large working sets result in poor locality of reference leading to high cache miss rates ● There is significant data sharing among threads resulting in coherence misses ● There is low instruction level parallelism (ILP) due to high cache miss rates, difficult to predict branches etc... ● Performance is bottle necked by stalls on memory access Sun Proprietary Information
Sun Solution NIAGARA Chip Multi Threaded Processor Sun Proprietary Information
Niagara(T1) ● Uses CPU threads to exploit TLP – Memory and Pipeline stall times are hidden due to multiple threads – Shared L2 cache allows efficient data sharing between threads ● Memory system is designed for high throughput – High bandwidth interface to L2 cache for L1 misses – Highly associative L2 cache – High bandwidth interface to DRAM Sun Proprietary Information
Designed for Performance and Efficiency DDR-2 DDR-2 DDR-2 DDR-2 SDRAM SDRAM SDRAM SDRAM On-Chip Simplicity Dedicated Means No Integrated Wait Latency L2$ L2$ L2$ L2$ Memory Xbar FPU Controllers Clean Sheet C1 C2 C3 C4 C5 C6 C7 C8 Design Delivers Highest Integrated Performance, Internal Sys I/F Efficiency Communications Buffer Switch Core BUS Sun Proprietary Information
Niagara Specs ● Up to 32 threads, 8 cores ● Unique L1$ 16KB-I, 8KB-D per core ● Shared L2$ 3MB, 134GB/s, 12 way associative ● Radically changed cache coherency processing ● 4XDDR2 Mem on CHIP Controllers 23GB/sec ● Upto 128 GB memory ● SSL support - 7X the RSA throughput of Xeon ● Requires about 70 Watts ● Each thread requires just about 2.0 watts ● No Recompilation required Sun Proprietary Information
Thread Selection Policy ● CPU switches between available threads every cycle giving priority to least recently executed thread ● Threads become unavailable due to: – Long latency ops: loads, branch, mul, div – Pipeline stalls such as cache misses, traps, and resource conflicts ● Loads are speculated as cache hits, and the thread is switched in with lower priority. Sun Proprietary Information
Multithreaded Process on Niagara Thread 4 Thread 3 Pipe7 Thread 2 Thread 1 Thread 4 Thread 3 Pipe6 Thread 2 Thread 1 Thread 4 Thread 3 Pipe5 Thread 2 Thread 1 Thread 4 Thread 3 Pipe4 Thread 2 Thread 1 Thread 4 Thread 3 Pipe3 Thread 2 Thread 1 Thread 4 Thread 3 Pipe2 Thread 2 Thread 1 Thread 4 Thread 3 Pipe1 Thread 2 Thread 1 Thread 4 Thread 3 Pipe0 Thread 2 Thread 1 Time Compute Memory Latency Larger number of Memory References outstanding from overlapping h/w threads leads to higher throughput Sun Proprietary Information
SWaP (Space, Watts and Perf) Sun FireT2000 SWaP Rating = 30.4 Performance: 19,000 Users (1) = SWaP: 30.4 Space: 2RU x Watts: 312 Performance/(Space*Watts ) = SWaP Rating 1. LotusR6iNotes Sun Confidential: Sun Employees and Authorized Partners Only
Sun Fire T1000 Crushes Xeon and p5+ Dell SC1425 IBM p5+ 520 T100 SPECjbb2005 SPECjbb2005 vs. Sun Fire 0 Performance 2.1X 1.6X Power Usage 1/2 1/2 Space Same 1/4 SWaP 4.4X 14X Sun Confidential: Sun Employees and Authorized Partners Only
Niagara-2 (T2): True System on a Chip ● Better performance than Niagara-1 ● Up to 8 Cores ● Up to 64 threads per CPU ● Same power envelope as T1 ● On chip NIC's ● And much more that I can not state Sun Proprietary Information
Performance Characteristics of T1 Sun Proprietary Information
Positive Characteristics ● If a strand is stalled, its cycles can be utilized by other threads ● Multiple threads running the same application benefit by sharing text and data in L2 cache ● These characteristics make CMT ideal for throughput computing. Sun Proprietary Information
Not so Positive Characteristics ● If one thread is thrashing the L1 instruction cache, data cache, or TLB's on a core, it can adversely affect other threads on that core. ● If all threads run on the same core they are only getting one-quarter of the CPU time. ● So CMT is not ideal for real time applications. Sun Proprietary Information
Scaling issues to be aware of ● Hot locks are the most common reason applications fail to scale on CMT processors ● Tuning Critical Sections ● Apply more threads as CMT is a thread rich environment. Sun Proprietary Information
Server Virtualization Sun Proprietary Information
Benefits of Virtualization ● Virtualization is masking and sharing of server resources ● Results in Server Consolidation Higher server utilization Increased operational efficiency Improved manageability Sun Proprietary Information
CMT and Virtualization ● CMT provides hooks for server virtualization ● Each Strand can be a Virtual CPU ● Niagara-2 also provides support for Network Virtualization Sun Proprietary Information
Solaris Virtualization Solutions ● Containers (BSD Jails) ● Logical Domains (Individual OS Instance per domain) ● Xen Sun Proprietary Information
Logical Domains + Zones • Partitioning capability LDom 1 LDom 2 LDom 3 > Create virtual machines each Solaris 10 Solaris 10 Solaris 11 App with sub-set of App resources App App App App App > Protection & App Zone 2 Isolation using Zone 1 Zone HW+firmware Hypervisor combination Hardware CPU CPU CPU CPU Shared CPU, Mem Mem I/O Mem Memory, IO Sun Confidential: Sun Employees and Authorized Partners Only
Network Virtualization Sun Confidential: Sun Employees and Authorized Partners Only
HW Based Network Virtualizarion ● Niagara-2 (T2) has on chip network interfaces ● Supports network virtualization/partitioning – Multiple Partitions can co-exist within a port – Only cable, MAC and RX FIFO's are shared. ● Virualization/Partitioning can be Based on – VLANS – upto 4K per port – MAC address – upto 16 per port – Service addresses (IP addresses, TCP/UDP ports) - upto 256 per device ● Interrupts for flow are sent to a particular CPU ● Full register sets are provided to control RX Rings Sun Proprietary Information
NIU RX Classification Model Incoming flows are classified at layer 2, 3, or 4 and put into RX DMA channel according to classification rules that matched the flow. RX RX RX RX RX RX ... DMA DMA DMA DMA DMA DMA NIU Flow Classifier NIU Flow Classifier Solaris Classification Interface: m_l2_classify_add() m_l2_classify_remove() MAC m_classify_add() m_classify_remove() Incoming Traffic Sun Confidential: Sun Employees and Authorized Partners Only
Software Based Network Virtualization ● Not All NIC's have HW support for Virtualization ● Software creates virtual stacks over 1Gb and 10Gb NIC's ● Virtual stacks are isolated from each other (for both resources and security purposes) ● Each Virtual stack can be tuned separately Sun Proprietary Information
Virtualized Networking Global Zone 2 Zone 1 Zone Specific Global Zone 1 Zone 2 To Zone Squeue Squeue Squeue Containers Exclusive Shared Shared Network Network Network Stack with Stack Stack Global Zone Stack Virtual Virtual .. . Virtual NIC NIC NIC Common To All Global Zone Zone 1 Zone n Virtual .. . Mem area Mem area Mem area Machines Flow Classifier NIC Sun Proprietary Information
Virtual Network with XEN Solaris Guest OS 2 Solaris Host OS Solaris Guest OS 1 NIC Virtualization NIC NIC Engine Virtualization Virtualization Engine Engine Guest OS 2 Guest 1 Virtual SQUEUE Virtual SQUEUE Host OS All Traffic Virtual SQUEUE HTTP HTTPS Default .. . All Traffic Squeue Squeue Squeue Guest OS 2 VNIC Virtual Virtual Virtual Host OS VNIC NIC NIC NIC HOST OS Guest OS 2 Guest OS 1 Guest OS 1 Guest OS 1 .. . . . All traffic All Traffic HTTP HTTPS Default .. . .. . . Mem area Mem area Mem area Mem area Mem area Flow Classifier NIC Sun Proprietary Information
Future Work ● More work is needed to characterize different workloads on CMT processors and define best practices ● Open Interfaces are needed to implement Virtualization ● Network Bandwidth/Resource control support is needed in HW Sun Proprietary Information
References ● Various Sun internal and external documents and publications on Niagara Sun Proprietary Information
Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com
Recommend
More recommend