APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - PowerPoint PPT Presentation

Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek Prague, 19.08.2010

Itinerary • Motivation • Application Requirements • SHAP Multi-Core Design • Performance Evaluation • Summary TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16

Motivation • Complexity of applications increases: – Raise computational throughput. – Decrease latency. • Previous Approaches: smarter Java bytecode single-cores. – Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism. • Now: thread-level parallelism exploited by multi-cores. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16

Application Requirements Address Spaces • Code Area: – Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching. • Shared Java Heap: – UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization. • Shared Peripherals. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16

Application Requirements Memory Bandwidth • SHAP single-core already includes multi-port memory manager: – DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). 0 1 2 3 4 5 6 7 Clock Cycle Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16

Application Requirements Memory Bandwidth Utilization Utilization: Application ML505 DE2 = u m Total /e Total Queens 3.09% 6.35% Lift 10.18% 22.12% P f b · m b FScriptME 8.46% 17 .60% b = SMMI 13.26% 30.43% P f b · e b El-Kharashi ≈ 10% ≈ 21% b Clock Cycle 0 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 W W W W DE2 Req. C0 C1 C2 C3 C0 DE2 Repl. C0 C1 C2 C3 TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16

Application Requirements Conclusion ⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16

SHAP Multi-Core Microarchitecture SHAP Multi−Core Architecture configurable • MMU w. variable port count. Core n−1 • Full-duplex memory bus with Stack Method pipelined transactions. Cache • Multi-threaded, real-time capable Memory cores with local on-chip stack and Manager Core1 method cache. Controller 32 Stack Method Garbage Wishbone Bus Cache Collector • Exact and fully concurrent 32 Data Core0 non-blocking garbage collector. Code Stack 32 Method Cache 8 • Native execution of Java bytecode. 32 • Fast atomic operations for DDR: 16 independent locks. UART SDR: 32 • Synthesizable for a variable number Graphics Unit Memory DMA Ctrl Ethernet MAC − SRAM of cores. − DDR−RAM TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16

SHAP Implementation on ML505 • Main evaluation platform: Xilinx ML505 Development Board – Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores. • Setup: 8 KB stack and 4 KB method cache per core. Minimum I/O. • Chip-space scales linear: LUTs ( n ) 2794 + 2831 · n ≈ FFs ( n ) 1933 + 1447 · n ≈ 18 kbit BRAMs ( n ) = 1 + 2 · n 36 kbit BRAMs ( n ) = 1 + 3 · n Multiplier ( n ) = 2 + 3 · n • Other platforms available: Xilinx Spartan-3/3E Starter Kit, Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16

Relative Speed-Up • Measured on: Xilinx ML505 Virtex-5 Development Board. • Pipelined ZBT SRAM with 32-bit data bus. 9 Queens 8 Lift FScriptME 7 SparseMatmultInt Speedup S 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16

Area Efficiency • Absolute / relative area of LUTs, FFs, BRAMs and multipliers is unknown. • Speed-Up in relation to count of BRAMs on ML505. 1.4 1.2 Area Efficiency AE 1 0.8 0.6 0.4 Queens Lift 0.2 FScriptME SparseMatmultInt 0 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16

Comparison Against Related Projects • Other Java bytecode multi-cores: JopCMP , jamuth IP multi-core, REALJava. • Comparison of absolute performance, etc. • Using same platform. ⇒ JopCMP on Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16

Comparison Against JopCMP Absolute Performance on DE2 Board • SHAP @ 60 MHz, JopCMP @ 90 MHz • Asynchronous SRAM with only 16-bit data bus. 140000 140000 SHAP ML505 120000 120000 SHAP DE2 jopCMP DE2 2KB I-Cache 100000 100000 jopCMP DE2 1KB I-Cache Executions / s Executions / s 80000 80000 60000 60000 40000 40000 20000 20000 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Number of Cores n Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16

Comparison Against JopCMP Chip-Space on DE2 Board • SHAP implements GC in hardware. • SHAP-core requires about 23% more LEs than a JopCMP core. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16

Synchronization Performance • Sync. limits speed-up ⇒ Keep as short / rare as possible. • Highly application / API dependent. • Typical: synchronized blocks. – Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() 46 119 6 118 1 10 Prepare Update Invoke A R • Alternative: compare and swap. – Short blocking period: only atomic bus access. – Complex code. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16

Summary • Multi-core Java bytecode processor w. shared heap. • Multi-port MMU w. pipelined transactions. • Chip-space scales linear. • Application mix: – Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120 % ( > 1 core). • Better absolute performance than related project JopCMP . http://shap.inf.tu-dresden.de TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - PowerPoint PPT Presentation

Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

JAVA Java vs. Java Java Language Specification

Java Comes Home to the Consumer Chet Haase Java SE Client Architect Java Comes Home to the

Multi-core in JVM/Java Concurrent programming in java Prior Java 5 Java 5 (2006)

Java Java Basics Java Program Statements Java Review Conditional statements

CISS Workshop on Java in Embedded Systems Realtime Java and the Jamaica Virtual Machine aicas

How Java works The java compiler takes a .java file and generates a .class file The .class

OpenJDK The Future of Open Source Java on GNU/Linux Dalibor Topi Java F/OSS Ambassador

DTrace Topics: -> java/lang/System.arraycopy <- java/lang/System.arraycopy Java <-

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Java SE 9 and the Application Server Kevin Sutter MicroProfile and Java EE Architect

Upgrading Past Java 9 Sounds Scary and I dont want to pay for Java Super happy with Java 8,

The testing pyramid Maurcio F. Aniche M.F.Aniche@tudelft.nl A.java ATest.java Thats what

Philly Java Users Group Whats new in Whats new in Java 2 Standard Edition 1.4 Java 2

Migen A Python toolbox for building complex digital hardware S ebastien Bourdeauducq 2013

Development of readout test system for prototype ASIC of pixel detector for ATLAS upgrade

Fast architecture prototyping on FPGAs: frameworks, tools, and challenges Philipp Wagner

CLB: Current status and development on CLBv2 in Valencia David Calvo IFIC (CSIC Universidad

Symbiotic EDA Suite www.symbioticeda.com Symbiotic EDA We build Open Source EDA Tools And

The Kernel Accelerator Device -reconfigurable computing for the kernel- Lecture held at 21C3 in

JOP Design Flow Microcode make JopSim Java ModelSim JVM Quartus VHDL Eclipse FPGA IO bus

Pattern Discovery in Colored Strings Zsuzsanna Liptk 1 , Simon J. Puglisi 2 , and Massimiliano

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - PowerPoint PPT Presentation

Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

JAVA Java vs. Java Java Language Specification

Java Comes Home to the Consumer Chet Haase Java SE Client Architect Java Comes Home to the

Multi-core in JVM/Java Concurrent programming in java Prior Java 5 Java 5 (2006)

Java Java Basics Java Program Statements Java Review Conditional statements

CISS Workshop on Java in Embedded Systems Realtime Java and the Jamaica Virtual Machine aicas

How Java works The java compiler takes a .java file and generates a .class file The .class

OpenJDK The Future of Open Source Java on GNU/Linux Dalibor Topi Java F/OSS Ambassador

DTrace Topics: -&gt; java/lang/System.arraycopy &lt;- java/lang/System.arraycopy Java &lt;-

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Java SE 9 and the Application Server Kevin Sutter MicroProfile and Java EE Architect

Upgrading Past Java 9 Sounds Scary and I dont want to pay for Java Super happy with Java 8,

The testing pyramid Maurcio F. Aniche M.F.Aniche@tudelft.nl A.java ATest.java Thats what

Philly Java Users Group Whats new in Whats new in Java 2 Standard Edition 1.4 Java 2

Migen A Python toolbox for building complex digital hardware S ebastien Bourdeauducq 2013

Development of readout test system for prototype ASIC of pixel detector for ATLAS upgrade

Fast architecture prototyping on FPGAs: frameworks, tools, and challenges Philipp Wagner

CLB: Current status and development on CLBv2 in Valencia David Calvo IFIC (CSIC Universidad

Symbiotic EDA Suite www.symbioticeda.com Symbiotic EDA We build Open Source EDA Tools And

The Kernel Accelerator Device -reconfigurable computing for the kernel- Lecture held at 21C3 in

JOP Design Flow Microcode make JopSim Java ModelSim JVM Quartus VHDL Eclipse FPGA IO bus

Pattern Discovery in Colored Strings Zsuzsanna Liptk 1 , Simon J. Puglisi 2 , and Massimiliano

DTrace Topics: -> java/lang/System.arraycopy <- java/lang/System.arraycopy Java <-