HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores (Slides from Patrick Schaumont’s course notes) The most successful programmable component of the past decades is, without doubt, the microprocessor . Just about any electronic device more complicated than a pushbutton seems to con- tain a microprocessor. There have been a number of drivers for the popularity of the microprocessor. • Microprocessors come with tools (compilers and assemblers), that help a designer create applications. The availability of a compiler to automatically translate a code into a binary for a microprocessor is an enormous advantage for development. It de-couples the design of the application software from the application hard- ware. An embedded software designer can therefore be proficient in one programming language like C, and this alone allows him to move seamlessly across different microprocessor architectures. ECE UNM 1 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores No hardware development technique has ever been able to de-couple , in a simi- lar way, the design of an application from its implementation . Even micro-programming requires significant knowledge of the underlying architecture. • There have been very few devices that have been able to cope as efficiently with reuse as microprocessors have done. A general-purpose embedded core by itself is an excellent example of reuse. Moreover, microprocessors have also dictated the rules of reuse for electronic system design in general. They have provided bus protocols that enabled the physical integration of an entire system. Their compilers have enabled the development of standard software librar- ies . ECE UNM 2 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores • No other programmable components have the same scalability as microprocessors. The same concept (i.e. stored-program computer) has been implemented across a large range of word-lengths (4-bit ... 64-bit) and basic architecture-types. In addition, microprocessors have also extended their reach to entire chips , containing many other components, while staying ’in command’ of the system. This approach is commonly called System-On-Chip (SoC). Given that entire courses exist on the topic of microprocessors, we need to be very selective on the topics we will cover in a single lecture on this topic. We will therefore focus specifically on issues that are relevant to hardware-software codesign, including: • Forms of parallelism in microprocessors • RISC Pipeline and pipeline hazards • Brief overview of the Microblaze processor • Dynamic program analysis, with examples in the SA-110 (StrongARM) ECE UNM 3 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors A microprocessor is a machine for sequential execution of operations. Internally, the microprocessor architecture enables parallel execution of these opera- tions, where possible. This parallelism can be classified in three categories. • Bit-level parallelism Standard microprocessors are 32-bit or larger, even though shorter word-lengths (4-bit, 8-bit, 16-bit) are still in use for low-performance/ low-power apps. The standard operation on such a microprocessor therefore processes 32- bits at a time . In an actual application, the typical word-length may match, be larger , or be smaller than those 32-bits. For example, multimedia apps on 8 bits-per-pixel, and internet packet for- matting are naturally formulated at byte-length. Certain cryptographic apps, on the other hand, may require hundreds of bits per number. ECE UNM 4 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors • Bit-level parallelism (cont) In either case, it is up to the designer to match the natural word-lengths of the application to the word-length provided by the processor. This is not an easy task. It is well known for example that microprocessors perform badly on ’bit twiddling’ operations while custom hardware excels on these. • Operator-level parallelism After an application is decomposed into operations, one wants to exploit all opportunities for concurrent execution . Microprocessors offer several mechanisms to map these operations to parallel operators. Of the three described below, in most practical processors, these techniques tend to be exclusive (ie. a processor will use one, but not all of them). ECE UNM 5 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors • Operator-level parallelism (cont) • SIMD : Single-Instruction Multiple Data This type of parallelism is useful to parallel process multiple values with the same instruction. High-end Intel processors provide a SIMD unit (called MMX ) for multime- dia processing. SIMD instructions are not a by-product of a compiler -- usually a designer has to write assembly programs to call these instructions explicitly . • Pipelining : Pipelined operators enable overlapped execution of instructions by means of pipeline registers. Modern compilers are well equipped to deal with pipelining and its com- plexities. ECE UNM 6 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors • Operator-level parallelism (cont) • VLIW : Very Long Instruction Word Here, a set of operators is available in the microprocessor and VLIW instructions invoke these operators in parallel. It is non-trivial to optimally map a program into VLIW instructions such that they take full advantage of the parallelism available. This is usually done by a compiler. • Task-level parallelism While real systems consist of concurrent tasks, there are almost no generally- accepted techniques to implement parallel tasks in a microprocessor. With the exception of hyper-threading. As a result, microprocessors rely on threading software to provide a sequential implementation of concurrent tasks. ECE UNM 7 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors • Task-level parallelism (cont) On the other hand, it is generally agreed that task-level parallelism is the next big target for microprocessors. This type of parallelism will leverage multiple processors on a chip ( MPSoC ) technology. A very convincing argument driving this hypothesis is provided by Deszo Sima. (SPECint92/100 MHz) 2.0 microarchitecture Efficiency of the Pentium III Pentium Pro x 1.5 x x Pentium II x 1.0 i486 Pentium x i386 0.5 x 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 Year of first volume shipment This figure shows the efficiency of Intel processors after normalization to their clock frequency. ECE UNM 8 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Forms of Parallelism in Microprocessors It can be seen that since the mid 90’s, this efficiency (i.e. the number of operations completed per clock cycle) remains relatively constant . In other words, processor performance improvements of the past decade are not due to better architectures , but rather due to increasing the processor clock speed. Multi-processor architectures , that implement task-level parallelism, are a solution to this problem. RISC Pipeline: Operation and Hazards We will now focus on one particular form of processor parallelism: the RISC pipe- line . The following is an overview of a generic RISC pipeline structure. The figure below shows a five-stage pipeline, in which standard instructions take 5 clock cycles to complete (this is the instruction latency ). ECE UNM 9 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 RISC Pipeline: Operation and Hazards Instruction Instruction Memory Read Fetch Instruction Register Instruction Evaluate Decode Read Decode Next-PC Execute Datapath Custom DP Operation Buffer Data Memory R/W (Memory) Register Write-back Write Each dashed line represents a pipeline register . Instruction Fetch: an instruction is retrieved from memory or the instruction cache. Instruction Decode: an instruction is decoded and the register operands for this instruction are fetched. Branch instructions will modify the PC during this phase. ECE UNM 10 (4/7/09)
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 RISC Pipeline: Operation and Hazards Execute: the operands are input to the datapath operators and executed. Buffer: the data memory is accessed using an address generated in the execute phase. Write-back: registers are updated to reflect the final result of the instruction execu- tion. In an ideal situation, the architecture above can complete 1 instruction per clock cycle (this is instruction throughput ). Even though instruction latency is 5 clock cycles, the pipeline enables over- lapped execution of these instructions to increase throughput. The clock cycle time is limited by the slowest component in the pipeline, plus the overhead of the pipeline registers (clock skew and setup). If a pipeline stage is too slow, additional pipeline stages can be added spreading the computation over multiple clock cycles . Doing so also extends the instruction latency . ECE UNM 11 (4/7/09)
Recommend
More recommend