Advanced Seminar – Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1
1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual Memory 2. Performance 3. 4. Conclusion 2
Smartphone/Tablet use cases: Idle most of the time 1. low power CPU 2. High-performance requirements high performance CPU Difficult to achieve with one CPU 3
Idea: ARM big.LITTLE Fusing a low-power and a high-performance CPU in one chip big 1 2 Big Cortex A57 LITTLE LITTLE Cortex A53 • Gaming 1 2 • HD – videos • OS • Rich Web • UI 3 4 3 4 Services • Internet • … • E-Mail L2 Cache L2 Cache • … Cache Coherent Interconnect 4
Basics 5
A dvanced R ISC M achines Founded: 1990 by Acorn, Apple and VLSI Origin: Microcontrollers / Embedded Systems Business model: design and licensing of Intellectual Property (IP) Revenue: 1.2 billion USD ( Intel: 55.8 billion USD ) Employees: 3,300 ( Intel: 106,700 ) Market Share: > 90% (2014, smartphone/tablet) 6
ARM Instruction Set: RISC (Reduced Instruction Set Computing) 7
RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] 8
ARM Instruction Set: RISC (Reduced Instruction Set Computing) 16 general purpose registers + 2 status registers 32-bit fixed-size instructions Condition Codes for (almost) all instructions Barrel Shifter for ALU 16-bit fixed-size THUMB instructions Digital Signal Processing (DSP) instructions Cryptography Extension Instructions Not strictly RISC 9
RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] ADD r0, r0, r1, LSL #3 Microcode LDR r3, [r0, #4]! ADD r3, #1 STR r3, [r0] 10
Instruction Set Architecture (ISA) has no significant impact on performance and power consumption Average Power (normalized) 4 A8 (ARM, 0.6GHz, 65nm, iPhone 4) 3 A15 (ARM, 1.66GHz, 32nm, Galaxy S4) 2 Atom (x86, 1.66GHz, 45nm, Netbook) 1 i7 (x86, 3.4GHz, 32nm, Desktop) 0 Tech-independet, scaled to 1GHz, 45 nm process, normalized to A8 11
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating Power-modes Pipelining Caches SoC (System-On-A-Chip) design 12
ARM Instruction Set Microarchitecture: Technology-node and feature size Reducing capacitance Voltage and Frequency Scaling Power-domains Clock-gating Power-modes Pipelining Caches SoC (System-On-A-Chip) design 13
ARM Instruction Set Microarchitecture: Technology-node and feature size Dynamically adjusting supply voltage and Voltage and Frequency Scaling clock speed according Power-domains to need Clock-gating Power-modes Pipelining Caches SoC (System-On-A-Chip) design 14
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power supply for different sections of Power-domains core can be turned Clock-gating on/off independently Power-modes Pipelining Caches SoC (System-On-A-Chip) design 15
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock for different sections of the core Clock-gating can be turned on/off Power-modes independently Pipelining Caches SoC (System-On-A-Chip) design 16
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating Predefined low-power modes utilizing the Power-modes above mentioned Pipelining features Caches SoC (System-On-A-Chip) design 17
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating Power-modes Reducing idle time of Pipelining different parts of core Caches SoC (System-On-A-Chip) design 18
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating Power-modes Pipelining Reducing time and power intensive Caches accesses to main SoC (System-On-A-Chip) design memory 19
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating Power-modes Pipelining Caches Adjusting all components of a SoC (System-On-A-Chip) design processor to one- another 20
ARM Instruction Set Microarchitecture: Technology-node and feature size Voltage and Frequency Scaling Power-domains Clock-gating ARMs emphasis is on power Power-modes consumption and size Momentum for mobile market Pipelining Caches SoC (System-On-A-Chip) design 21
Core 1 MMU Instr. Instr. Arbiter BUS L1 Cache µTLB TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 22 SoC
Cortex A53 8-stage (integer), in-order Core 1 MMU Cortex A57 Instr. Instr. µTL Arbiter BUS L1 Cache 15-stage (integer), out-of-order B TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 23 SoC
LITTLE big CPU Cortex A53 Cortex A57 64-bit Yes Yes Cores 1 – 4 1 – 4 Frequency* 1.3 GHz 1.9 GHz L1 Cache 8 – 64 kB 48/32 kB L2 Cache 128 – 2,048 kB 512 – 2,048 kB Integer depth 8 15 Pipeline Out-of-order No Yes Performance 2.3 DMIPS/MHz 4.1 DMIPS/MHz Technology node* 20 nm 20 nm Core Size* 0.70 mm² 2.05 mm² Cluster Size* 4.58 mm² 15.10 mm² * Values for SoC Samsung Exynos 5433 (Galaxy Note 4) 24
Cortex-A Power Consumption 8 Power Consumption (W) 7 6 5 A53 (1 Core) 4 A53 (4 Cores) 3 A57 (1 Core) 2 A57 (4 Cores) 1 0 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Frequency (MHz) SoC: Samsung Exynos 5433 (Galaxy Note 4) 25
Heterogenous multi-processing 26
Connecting two heterogeneous clusters… 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache Binary compatible 27
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI AXI = A dvanced e X tensible I nterface 28
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI Read_Adress Read_Data Write_Adress Write_Data Write_Ack 29
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Read_Adress Read_Data Write_Adress Write_Data Write_Ack C_Address C_Data C_Response 30
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE ACE = A XI C oherency E xtension C_Address C_Data C_Response 31
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Cache Coherent Interconnect 32
SoC L2 Cache 1 2 1 2 Memory Controller 3 4 3 4 Periphery Cache Coherent L2 Cache Interconnect BUS Display GPU 33
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 Coherency States 3 4 Valid Invalid 3 4 Unique Shared L2 Cache L2 Cache Unique Shared Dirty AXI ACE AXI ACE Dirty Dirty Invalid Clean Unique Shared Clean Clean Cache Coherent Interconnect Analogical to MOESI -protocol: Modified, Owned, Exclusive, Shared, Invalid 34
1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 35
1. LITTLE load(A) 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 36
1. LITTLE load(A) 1 2 Big Cortex A57 2. CCI snoop(A) LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 37
1. LITTLE load(A) 1 2 Big Cortex A57 2. CCI snoop(A) LITTLE Cortex A53 3. big resp(miss) 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 38
1. LITTLE load(A) 1 2 Big Cortex A57 2. CCI snoop(A) LITTLE Cortex A53 3. big resp(miss) 1 2 4. CCI load_mem(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect to main memory 39
1. LITTLE load(A) 1 2 Big Cortex A57 2. CCI snoop(A) LITTLE Cortex A53 3. big resp(miss) 1 2 4. CCI load_mem(A) 5. CCI return(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 40
1. LITTLE load(A) 1 2 Big Cortex A57 2. CCI snoop(A) LITTLE Cortex A53 3. big resp(miss) A u 2 4. CCI load_mem(A) 5. CCI return(A) 3 4 3 4 Cache A u Cache AXI ACE AXI ACE Cache Coherent Interconnect 41
Recommend
More recommend