arm big little technology
play

ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - PowerPoint PPT Presentation

Advanced Seminar Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual


  1. Advanced Seminar – Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1

  2. 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual Memory 2. Performance 3. 4. Conclusion 2

  3.  Smartphone/Tablet use cases: Idle most of the time 1.  low power CPU 2. High-performance requirements  high performance CPU  Difficult to achieve with one CPU 3

  4.  Idea: ARM big.LITTLE Fusing a low-power and a high-performance CPU in one chip big 1 2 Big Cortex A57 LITTLE LITTLE Cortex A53 • Gaming 1 2 • HD – videos • OS • Rich Web • UI 3 4 3 4 Services • Internet • … • E-Mail L2 Cache L2 Cache • … Cache Coherent Interconnect 4

  5. Basics 5

  6.  A dvanced R ISC M achines  Founded: 1990 by Acorn, Apple and VLSI  Origin: Microcontrollers / Embedded Systems  Business model: design and licensing of Intellectual Property (IP)  Revenue: 1.2 billion USD ( Intel: 55.8 billion USD )  Employees: 3,300 ( Intel: 106,700 )  Market Share: > 90% (2014, smartphone/tablet) 6

  7.  ARM Instruction Set:  RISC (Reduced Instruction Set Computing) 7

  8. RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] 8

  9.  ARM Instruction Set:  RISC (Reduced Instruction Set Computing)  16 general purpose registers + 2 status registers  32-bit fixed-size instructions  Condition Codes for (almost) all instructions  Barrel Shifter for ALU  16-bit fixed-size THUMB instructions  Digital Signal Processing (DSP) instructions  Cryptography Extension Instructions Not strictly RISC 9

  10. RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] ADD r0, r0, r1, LSL #3 Microcode LDR r3, [r0, #4]! ADD r3, #1 STR r3, [r0] 10

  11. Instruction Set Architecture (ISA) has no significant impact on performance and power consumption Average Power (normalized) 4 A8 (ARM, 0.6GHz, 65nm, iPhone 4) 3 A15 (ARM, 1.66GHz, 32nm, Galaxy S4) 2 Atom (x86, 1.66GHz, 45nm, Netbook) 1 i7 (x86, 3.4GHz, 32nm, Desktop) 0 Tech-independet, scaled to 1GHz, 45 nm process, normalized to A8 11

  12.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 12

  13.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size Reducing capacitance  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 13

  14.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size Dynamically adjusting supply voltage and  Voltage and Frequency Scaling clock speed according  Power-domains to need  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 14

  15.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling Power supply for different sections of  Power-domains core can be turned  Clock-gating on/off independently  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 15

  16.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains Clock for different sections of the core  Clock-gating can be turned on/off  Power-modes independently  Pipelining  Caches  SoC (System-On-A-Chip) design 16

  17.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating Predefined low-power modes utilizing the  Power-modes above mentioned  Pipelining features  Caches  SoC (System-On-A-Chip) design 17

  18.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes Reducing idle time of  Pipelining different parts of core  Caches  SoC (System-On-A-Chip) design 18

  19.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining Reducing time and power intensive  Caches accesses to main  SoC (System-On-A-Chip) design memory 19

  20.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches Adjusting all components of a  SoC (System-On-A-Chip) design processor to one- another 20

  21.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating ARMs emphasis is on power  Power-modes consumption and size  Momentum for mobile market  Pipelining  Caches  SoC (System-On-A-Chip) design 21

  22. Core 1 MMU Instr. Instr. Arbiter BUS L1 Cache µTLB TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 22 SoC

  23. Cortex A53 8-stage (integer), in-order Core 1 MMU Cortex A57 Instr. Instr. µTL Arbiter BUS L1 Cache 15-stage (integer), out-of-order B TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 23 SoC

  24. LITTLE big CPU Cortex A53 Cortex A57 64-bit Yes Yes Cores 1 – 4 1 – 4 Frequency* 1.3 GHz 1.9 GHz L1 Cache 8 – 64 kB 48/32 kB L2 Cache 128 – 2,048 kB 512 – 2,048 kB Integer depth 8 15 Pipeline Out-of-order No Yes Performance 2.3 DMIPS/MHz 4.1 DMIPS/MHz Technology node* 20 nm 20 nm Core Size* 0.70 mm² 2.05 mm² Cluster Size* 4.58 mm² 15.10 mm² * Values for SoC Samsung Exynos 5433 (Galaxy Note 4) 24

  25. Cortex-A Power Consumption 8 Power Consumption (W) 7 6 5 A53 (1 Core) 4 A53 (4 Cores) 3 A57 (1 Core) 2 A57 (4 Cores) 1 0 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Frequency (MHz) SoC: Samsung Exynos 5433 (Galaxy Note 4) 25

  26. Heterogenous multi-processing 26

  27. Connecting two heterogeneous clusters… 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache  Binary compatible 27

  28. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI AXI = A dvanced e X tensible I nterface 28

  29. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI Read_Adress Read_Data Write_Adress Write_Data Write_Ack 29

  30. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Read_Adress Read_Data Write_Adress Write_Data Write_Ack C_Address C_Data C_Response 30

  31. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE ACE = A XI C oherency E xtension C_Address C_Data C_Response 31

  32. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Cache Coherent Interconnect 32

  33. SoC L2 Cache 1 2 1 2 Memory Controller 3 4 3 4 Periphery Cache Coherent L2 Cache Interconnect BUS Display GPU 33

  34. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 Coherency States 3 4 Valid Invalid 3 4 Unique Shared L2 Cache L2 Cache Unique Shared Dirty AXI ACE AXI ACE Dirty Dirty Invalid Clean Unique Shared Clean Clean Cache Coherent Interconnect Analogical to MOESI -protocol: Modified, Owned, Exclusive, Shared, Invalid 34

  35. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 35

  36. 1. LITTLE  load(A) 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 36

  37. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 37

  38. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 38

  39. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect to main memory 39

  40. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 40

  41. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) A u 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache A u Cache AXI ACE AXI ACE Cache Coherent Interconnect 41

Recommend


More recommend