15/04/2015 The transition On May 17 th , 2004, Intel, the world’s largest chip maker, canceled the development of the Tejas processor, the successor of the Pentium4-style Prescott processor. On July 27 th , 2006, Intel announced the official release of the Core Duo processors family. Giorgio Buttazzo Since then, all major chip producers decided to switch from single core to multicore platforms. Such a phenomenon is known as the multicore revolution. The reason why this happened has to do with a market Scuola Superiore Sant’Anna, Pisa law, predicted by Gordon Moore, Intel's co-founder, in 1965, known as Moore's Law. Moore’s Law Gate reduction transistors Number of transistors/chip doubles every 24 months The Moore's Law was made possible by the Gate length progressive reduction of transistor dimensions. 10 G (nm) Dual core Titanium 2 1 G 500 Titanium 2 Titanium 100 M 400 Pentium 4 Pentium 3 10 M Pentium 2 300 Pentium 1 M 486 386 200 100 K 286 8086 100 10 K 8080 8008 4004 0 1 K year 1990 1995 2000 2005 2010 2015 2020 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Benefits of size reduction Power dissipation There are 2 main benefits of reducing transistor size: The main reason is related to power dissipation in CMOS integrated circuits, which is mainly due to two causes: 1. a higher number of gates that can fit on a chip; Dynamic power ( P d ) consumed during operation; 2. devices can operate at higher frequency. In fact, if the distance between gates is reduced, signals have Static power ( P s ) consumed when the circuit is off. to cover a shorter path, and the time for a state transition decreases, allowing a higher clock speed. Inverter V dd V dd However… P-MOS At the launch of Pentium 4, Intel expected single core chips to V V in V V in scale up to 10 GHz using gates below 90 nm. However, the out out fastest Pentium 4 never exceeded 4 GHz. C L N-MOS Gnd Why did that happen? 1
15/04/2015 Dynamic power Static power V dd V dd Dynamic power is mainly consumed Static power is due to a quantum during logic state transitions to phenomenon where mobile charge I sw charge and discharge the load carriers (electrons or holes) tunnel P-MOS capacitance C L . through an insulating region, creating a leakage current I lk I sc V in V V in V out out It can be expressed by: P V I C L C L s dd lk N-MOS I lk 2 P C f V It is independent of the switching d L dd activity and is always present if f = clock frequency the circuit is on. As devices scale down in size, gate oxide thicknesses decreases, resulting in a larger leakage current. Dynamic vs. static power Power and Heat A side effect of power consumption is heat, which, if Static Power significant at 90 nm 10 2 not properly dissipated, can damage the chip. Dynamic Power Normalized power 1 2 P C f V V I L dd dd lk Scaling down, both f and I lk increased 10 -2 Static Power (leakage) If processor performance would have improved by 10 -4 increasing the clock frequency, the chip temperature would have reached levels beyond the capability of 10 -6 current cooling systems. year 1990 1995 2000 2005 2010 2015 2020 Gate length (nm) : 500 350 250 180 130 90 65 45 22 Heating problem Keeping Moore’s Law alive Pentium Tejas The solution followed by the industry to keep the Power density (W/cm 2 ) cancelled! Moore’s law alive was to 1000 1000 Nuclear Reactor use a higher number of slower logic gates, P4 100 100 P3 building parallel devices that work at lower clock P2 Pentium P1 10 10 frequencies. 286 486 8086 386 8085 1 8080 In other words… 8008 4004 0.1 0.1 Switch to Multicore Systems! 72 76 80 84 88 92 96 00 04 08 Year Clock speed limited to less than 4 GHz 2
15/04/2015 Keeping Moore’s Law alive How to exploit multiple cores? # of transistors continued to increase according to Moore’s Law The efficient exploitation of multicore platforms poses clock speed and performance experienced a saturation effect a number of new problems that are still being addressed by the research community. 10 G # Transistors 10GHz When porting a real-time application from a single 1 G core to a multicore platform, the following key issues 1 GHz 100 M have to be addressed: 10 M 100 MHz Clock speed How to split the code into parallel segments that 1 M 10 MHz can be executed simultaneously? 100 K 1 MHz How to allocate such segments to the different 10 K 100 KHz cores? 1 K 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Expressing parallelism A big problem for industry Parallelizing legacy code implies a tremendous cost In a multicore system, sequential languages (as and effort for industries, mainly due to: C/C++) are no longer appropriate to specify re-design the application programs. re-writing the source code In fact, a sequential language hides the intrinsic updating the operating system concurrency that must be exploited to improve writing new documentation the performance of the system. testing the system software certification To really exploit hardware redundancy, To avoid such costs, the cheapest solution is to port the most of the code has to be parallelized. software on a multicore platform, but run it on a single core, disabling all the other cores. A big problem for industry Other problems However, due to the clock speed saturation effect, a In a single core system, concurrent tasks are core in a multicore chip is slower than a single core: sequentially executed on the processor, hence the access to physical resources is implicitly serialized Intel Core i7 (e.g., two tasks can never cause a contention for a Intel ON OFF simultaneous memory access). Pentium 4 Prescott OFF OFF In a multicore platform, different tasks can run simultaneously on different cores, hence several Clock: 3.8 GHz Clock: 2.5 GHz conflicts can arise while accessing physical resources. If the application workload was already high, running the application on a single core of a multicore chip creates an Such conflicts not only introduce interference on overload condition. task execution but also increase the Worst-Case To avoid such problems, avionic industries buy in advance Execution Time ( WCET ) of each tasks. enough components for ensuring maintenance for 30 years! 3
15/04/2015 The WCET issue WCET in multicore Test by Lockheed Martin Space Systems on 8-core platform The fundamental assumption WCET can be 6 times larger Benchmark Existing RT analysis assumes that the worst-case Cache locked (255 pages) 6 execution time (WCET) of a task is constant when Normalized WCET 5 it is executed alone or together with other tasks. competing with 1 core can double 4 the WCET 3 While this assumption is correct for single-core 2 chips, it is NOT true for multicore chips ! 1 0 1 2 3 4 5 6 7 8 Number of active cores 19 Questions There are multiple reasons 6 The WCET increases because of the competition Normalized WCET Benchmark 5 Cache locked (255 pages) among cores in using shared resources. 4 3 Main memory Competition creates extra delays 2 1 Memory-bus waiting for other tasks to release 0 1 2 3 4 5 6 7 8 the resource Last-level cache Number of active cores Why WCET increases up to 6 times? waiting for accessing the resource I/O devices Why WCET on 8 cores is lower than WCET on 7 cores? In a single CPU, only one task can run at a time, so applications cannot saturate memory and I/O bandwidth. What does this mean for system development, To better understand the interference causes, we need to integration and certification? take a quick look at the modern computer architectures. Types of Memory Primary Storage There are typically three types of memory used in a computer: It is referred to as main memory or internal memory, and is directly accessible to the CPU. Secondary It is volatile, which means that it loses its content if power is Primary storage storage removed. (DRAM) (Disk) Primary storage includes RAM (based on DRAM technology), Cache and CPU registers (based on SRAM technology): BUS DRAM (Dynamic random-access memory) requires to be periodically, refreshed (re-read and re-written) otherwise it Cache would vanish. (SRAM) SRAM (Static random-access memory) never needs to be CPU refreshed as long as power is applied. 4
Recommend
More recommend