Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek Prague, 19.08.2010
Itinerary • Motivation • Application Requirements • SHAP Multi-Core Design • Performance Evaluation • Summary TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16
Motivation • Complexity of applications increases: – Raise computational throughput. – Decrease latency. • Previous Approaches: smarter Java bytecode single-cores. – Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism. • Now: thread-level parallelism exploited by multi-cores. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16
Application Requirements Address Spaces • Code Area: – Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching. • Shared Java Heap: – UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization. • Shared Peripherals. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16
Application Requirements Memory Bandwidth • SHAP single-core already includes multi-port memory manager: – DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). 0 1 2 3 4 5 6 7 Clock Cycle Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16
Application Requirements Memory Bandwidth Utilization Utilization: Application ML505 DE2 = u m Total /e Total Queens 3.09% 6.35% Lift 10.18% 22.12% P f b · m b FScriptME 8.46% 17 .60% b = SMMI 13.26% 30.43% P f b · e b El-Kharashi ≈ 10% ≈ 21% b Clock Cycle 0 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 W W W W DE2 Req. C0 C1 C2 C3 C0 DE2 Repl. C0 C1 C2 C3 TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16
Application Requirements Conclusion ⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16
SHAP Multi-Core Microarchitecture SHAP Multi−Core Architecture configurable • MMU w. variable port count. Core n−1 • Full-duplex memory bus with Stack Method pipelined transactions. Cache • Multi-threaded, real-time capable Memory cores with local on-chip stack and Manager Core1 method cache. Controller 32 Stack Method Garbage Wishbone Bus Cache Collector • Exact and fully concurrent 32 Data Core0 non-blocking garbage collector. Code Stack 32 Method Cache 8 • Native execution of Java bytecode. 32 • Fast atomic operations for DDR: 16 independent locks. UART SDR: 32 • Synthesizable for a variable number Graphics Unit Memory DMA Ctrl Ethernet MAC − SRAM of cores. − DDR−RAM TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16
SHAP Implementation on ML505 • Main evaluation platform: Xilinx ML505 Development Board – Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores. • Setup: 8 KB stack and 4 KB method cache per core. Minimum I/O. • Chip-space scales linear: LUTs ( n ) 2794 + 2831 · n ≈ FFs ( n ) 1933 + 1447 · n ≈ 18 kbit BRAMs ( n ) = 1 + 2 · n 36 kbit BRAMs ( n ) = 1 + 3 · n Multiplier ( n ) = 2 + 3 · n • Other platforms available: Xilinx Spartan-3/3E Starter Kit, Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16
Relative Speed-Up • Measured on: Xilinx ML505 Virtex-5 Development Board. • Pipelined ZBT SRAM with 32-bit data bus. 9 Queens 8 Lift FScriptME 7 SparseMatmultInt Speedup S 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16
Area Efficiency • Absolute / relative area of LUTs, FFs, BRAMs and multipliers is unknown. • Speed-Up in relation to count of BRAMs on ML505. 1.4 1.2 Area Efficiency AE 1 0.8 0.6 0.4 Queens Lift 0.2 FScriptME SparseMatmultInt 0 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16
Comparison Against Related Projects • Other Java bytecode multi-cores: JopCMP , jamuth IP multi-core, REALJava. • Comparison of absolute performance, etc. • Using same platform. ⇒ JopCMP on Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16
Comparison Against JopCMP Absolute Performance on DE2 Board • SHAP @ 60 MHz, JopCMP @ 90 MHz • Asynchronous SRAM with only 16-bit data bus. 140000 140000 SHAP ML505 120000 120000 SHAP DE2 jopCMP DE2 2KB I-Cache 100000 100000 jopCMP DE2 1KB I-Cache Executions / s Executions / s 80000 80000 60000 60000 40000 40000 20000 20000 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Number of Cores n Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16
Comparison Against JopCMP Chip-Space on DE2 Board • SHAP implements GC in hardware. • SHAP-core requires about 23% more LEs than a JopCMP core. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16
Synchronization Performance • Sync. limits speed-up ⇒ Keep as short / rare as possible. • Highly application / API dependent. • Typical: synchronized blocks. – Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() 46 119 6 118 1 10 Prepare Update Invoke A R • Alternative: compare and swap. – Short blocking period: only atomic bus access. – Complex code. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16
Summary • Multi-core Java bytecode processor w. shared heap. • Multi-port MMU w. pipelined transactions. • Chip-space scales linear. • Application mix: – Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120 % ( > 1 core). • Better absolute performance than related project JopCMP . http://shap.inf.tu-dresden.de TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16
Recommend
More recommend