increasing the efficiency of an embedded multi core
play

Increasing the Efficiency of an Embedded Multi-Core Bytecode - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache


  1. Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache Martin Zabel, Thomas B. Preußer, Rainer G. Spallek JTRES‘12 25.10.2012 Martin.Zabel@tu-dresden.de http://vlsi-eda.inf.tu-dresden.de

  2. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 2 von 28 Core Bytecode Processor using an Object Cache

  3. Motivation Why Java? � Object orientation, portability � Automatic memory management, security � Support for thread parallelization Why Java-(bytecode-)processor? Why Java-(bytecode-)processor? � Native execution of Java-bytecode � no OS, no interpretation, no re-compilation � Real-time � Suited for embedded systems with limited resources Why multi-core processors? � Power consumption increases over-proportional with clock frequency. � Use thread-level parallelism instead. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 3 von 28 Core Bytecode Processor using an Object Cache

  4. Java Multi-Core Processor SHAP Multi-Core Architecture configurable Examples: JopCMP, jamuth, Core n-1 REALJava and SHAP Stack Method Cache Common property: central shared heap for all cores Memory Manager SHAP Multi-Core: SHAP Multi-Core: Core1 oller Controlle 32 32 Stack Method Garbage Wishbone Bus � Local stack-memory per core Cache Collector 32 Data Core0 � Method-cache per core Code 32 Stack Method Cache � Pipelining of heap-accesses 8 � Concurrent GC for real-time apps 32 � Maximum speed-up of 8 for programs with an above-average DDR: 16 UART SDR: 32 DMA number of memory accesses [1] Graphics Unit Memory � CLDC, constant-time interface - SRAM Ethernet MAC - DDR-RAM method dispatch, … Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 4 von 28 Core Bytecode Processor using an Object Cache

  5. Goal Further reduce the demands on the heap memory interface to achieve higher speed-ups through thread-level parallelism. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 5 von 28 Core Bytecode Processor using an Object Cache

  6. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 6 von 28 Core Bytecode Processor using an Object Cache

  7. Related Work Common solution for object-oriented processors: Cache for objects in analogy to data-caches [2] [3] Tag Data Object Object Offset Offset Object Content Object Content Reference Especially for real-time systems: Separate Caches for different data areas [4]: � Classic data cache for data at static addresses (e.g. class data) � Object-cache for data at dynamic addresses (e.g. objects) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 7 von 28 Core Bytecode Processor using an Object Cache

  8. Indirect Object-Addressing Stack Problem of JopCMP and SHAP: � Object-table stored in external memory. � Additional latency for each heap-access. � Additional demand on memory bandwidth. Object-Table Solution: � Translation look-aside buffer (TLB) [1] � Virtually indexed object-cache [2] Heap Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 8 von 28 Core Bytecode Processor using an Object Cache

  9. Cache Coherence Problem: Coherence of distributed caches Advantage of the Java Virtual Machine: Synchronization only when [5] � entering a critical section, or � accessing a “volatile” variable. � accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 9 von 28 Core Bytecode Processor using an Object Cache

  10. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 10 von 28 Core Bytecode Processor using an Object Cache

  11. Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9], all except microkernel benchmarks � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 11 von 28 Core Bytecode Processor using an Object Cache

  12. Heap-Access Analysis 25 Data Accesses (Core Itself) h Utilization [%] Bytecode Fetches (Method Cache) Accesses for Memory Management 20 15 15 Memory Bandwidth U 11% 10 5 0 AES Bubble Kfl Lift Matrix N- Sieve UdpIp Sort Mul Queens Benchmark Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 12 von 28 Core Bytecode Processor using an Object Cache

  13. Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9] � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Results: Results: 1. Most frequently object-accesses are reads on arrays 1 and member variables. 2. 80% of all object accesses concentrate on 6 objects. 3. Frequent access onto the first user-specific object offsets (-2 and 1) � Further evaluation : small full-associative cache for each core 1 (already accounts for implicit reads of array length) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 13 von 28 Core Bytecode Processor using an Object Cache

  14. Small Full-associative Local Cache Storing only invariant data: � Would require no extra logic for cache coherency. � In general, only class information pointer and array size are invariant. � Significant reduction only for array-intensive programs. BubbleSort BubbleSort 33% 33% of all of all Sieve 20% memory accesses MatrixMul 17% Write-through instead of write-back: � No special GC interaction required. � Simple cache coherence: Invalidate cache when • entering a critical section, or • accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 14 von 28 Core Bytecode Processor using an Object Cache

  15. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 15 von 28 Core Bytecode Processor using an Object Cache

  16. Cache Design (1) Cache integration: into memory manager port Core modifications: � Bytecode to access volatile variables � Microcode to invalidate cache SHAP Multi-Core with AOC (Excerpt) configurable Data Port Data Port AOC Controller Memory 32 32 Data Manager Core0 Code 32 Stack Method Cache Garbage 8 Collector DDR: 16 Wishbone Bus SDR: 32 Memory - SRAM - DDR-RAM Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 16 von 28 Core Bytecode Processor using an Object Cache

  17. Cache Design (2) Features: � Address & Offset-Cache (AOC) with write-through, LRU-strategy � 1 valid-bit per cached word � Configurable: Core External Memory # of cache lines, Stack Object Table Heap Offset cached offsets cached offsets Object 1 Data W -2 Physical Addr. Data X -1 Object 2 Object Handle Data Y 0 Disadvantage: Object 3 Data Z 1 Additional latency of 1 clock cycle during Adress- and Offset-Cache (AOC) Tags Address Mem. Valid & Data Memory cache miss Line V Offset -2 V Offset 1 0 Object Handle Physical Addr. 1 Data W 1 Data Z 1 2 3 Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 17 von 28 Core Bytecode Processor using an Object Cache

  18. Cache Configuration Huge configuration space: > , 0 ∈ l l N � Cache lines: [ − , ] with , ∈ , > 0 x y x y N x � Offsets: no (address only), range 1 , 2 , , 18 = n K But synthesis for cores to expensive. � Search for good initial configuration. Configuration space exploration: � Baseline design for comparison : TLB with 2 entries � Benchmarks: • JemBench • JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers) � Platform: • SHAP on Virtex-5 FPGA XC5VLX110T • Same clock frequency of 80 MHz as baseline design Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 18 von 28 Core Bytecode Processor using an Object Cache

  19. Setup: 1 Core, Cache Address Only 1.14 SparseMatmultInt MatrixMul (N=20) 1.12 Lift 1.1 AES erformance UdpIp 1.08 NQueens 1.06 Kfl Kfl Relative Perf HS 1.04 Sieve 1.02 BS 1 0.98 0.96 Baseline 2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines Design Cache Configuration � Use 8 lines � � � Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 19 von 28 Core Bytecode Processor using an Object Cache

  20. Setup: 1 Core, 8 Lines 1.45 SparseMatmultInt 1.4 MatrixMul (N=20) Lift 1.35 AES ormance UdpIp 1.3 Relative Perform NQueens NQueens 1.25 Kfl HS 1.2 Sieve 1.15 BS 1.1 1.05 1 Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55] � Cache Offsets -2 & 1 � � � Cache Configuration Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 20 von 28 Core Bytecode Processor using an Object Cache

Recommend


More recommend