programming of hierarchic array processors the physical
play

Programming of hierarchic array processors: The physical layer - PowerPoint PPT Presentation

Programming of hierarchic array processors: The physical layer February 18, 2013 Many core Problems of many core Cores need to communicate to the memory Memory bandwidth needs to be shared between cores in an optimal way Cores need to


  1. Programming of hierarchic array processors: The physical layer February 18, 2013 Many core

  2. Problems of many core Cores need to communicate to the memory Memory bandwidth needs to be shared between cores in an optimal way Cores need to communicate and synchronize with each other The solution to these problems gives rise to different many core architectures Many core

  3. Why architecture matters? General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture Many core

  4. Why architecture matters? General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture The more general an architecture, the more inefficient it tends to be Many core

  5. Why architecture matters? General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture The more general an architecture, the more inefficient it tends to be A strictly serial (and irreducible) algorithm on a parallel architecture can orders of magnitudes slower! Many core

  6. Why architecture matters? Access pattern of the memory has a huge impact on performance Some architectures have a very strict preferred access pattern Thread communication is limited, in many cases can be a bottleneck Many core

  7. Architectures Turing completeness Every architecture should be Turing complete (in a sense) to be useful for general purpose processing. It means that it could emulate any other Turing machines, or any other machines, and run any algorithm. In practice it is not enough, because it tells us nothing about how fast it can run an algorithm. Maximal algorithmic speed In this case, we are interested in the maximal achievable speed of a given algorithm on a given architecture. We usually allow any modifications on the algorithm as long as the outputs are matching for any inputs. Computing this speed is mathematically impossible, but in practice we will have to guess it. Knowing the fine details of the architecture is very important for any meaningful guess. Many core

  8. Programmed machines We use an Universal Turing Machine, so can store the program on the Tape We allow the machine to read from anywhere on the tape, we address the Tape We separate reading data from the Tape, and reading instructions. What we have got is effectively a Programmed Machine, where the Tape is the memory, and program is stored in instructions Many core

  9. Data and Instructions Data It is information which we want to process, or processed. We store any data in “registers” right before and after processing. The “register” is the fastest type of memory, which is the nearest to the processing units (ALU, FPU). Usually this is the most limited, least flexible, but by far fastest memory resource we have. Instructions Instructions define what the machine should do on the Data it was given. The Instruction Set is a very descriptive feature of an architecture. Data-flow instructions: only process Data, executed on Processing Units (like FPU, ALU) Control-flow instructions: change the flow of the control inside the program, (similar to branches and loops), executed on Control Units Memory access instructions: Read and/or write various kinds Many core

  10. Memory hierarchy Memory Most practical architectures employ multiply levels of memory, to mitigate the weak points of the physical realizations. The general idea is to combine small and fast memories with bigger but slower memories. Where the fast memory is generally orders of magnitudes more expensive than the slow one. Registers: innermost memory, fastest, but a very limited resource, runs at core clock speed, or double speed Local memory: addressable, but small memory but only accessible from nearby cores. L1 cache: Most architectures have it, an associative memory, used with cache coherence algorithm to simulate global access L2 cache: bigger and slower than L1 cache. System RAM: usually Dynamic Random Access Memory System storage: slowest/biggest memory Many core

  11. DRAM is not truly random access Dynamic RAM cells are slow to read, and reading is destructive Lines in the dynamic capacitor array need to be prepared for reading operation DRAMs can only achieve high bandwidths by reading/writing big parts of the array at the same time Knowing where the next access will happen is also helpful These requirements mean that we can only access the DRAM in big chunks (bursts) Architectures usually map these chunks into their memory address space in a predefined way Any efficient access pattern must follow this mapping! Many core

  12. Memory architectures DRAM: dynamic RAM, bits are stored in capacitors, reading is slow and destructive, must be refreshed SRAM: static RAM, bit are stored in bistable circuits, access is fast but SRAMs are expensive multi-port: SRAMs can be multiport, which means they can have more than port, where they can be read/written in parallel associative: usually caches are associative, fine tuned to the cache coherence algorithm, the granularity of the associativity is very important. For GPUs there are texture caches, where the granularity is 2D tiles of a texture. Many core

  13. Fetch-Decode-Execute cycle fetch Fetch the instruction from the memory to the instruction register. It is usually cached by the instruction cache. This step is not always trivial, because instructions can have different sizes. decode Often it is a simple lookup in a lookup table, where the table contains the breakdown of the instructions into parts to various processing units (register IO, FPU, ALU, Memory IO). For x86 this stage is very complex, because it needs to compile the x86 instruction into its inside RISC instruction set. For GPUs this stage is partially missing the the instruction is directly wired to the processing units with very little logic between Many core

  14. Fetch-Decode-Execute cycle execute Actually doing what the instruction instructs the core to do. In many cases it is preferred if this step takes only a single clock cycle. It can be usually broken down to: Read registers Read memory Do processing Write memory and/or registers Many core

  15. Pipelining Most of the time the steps of the FDE cycle cannot be done in a single cycle, it is sometimes more than 10 cycles The basic idea of pipelining is that all operations are done on different units, or different parts of units so they can be done in parallel Everything happens in the same time, but on different data, the data shifts between pipeline stages A fully pipelined logic does any operations effectively in a single cycle, as long as it has enough operations/data to fill it Many core

  16. Pipelining and scale-down Scale down, means smaller wires and transistors it decreases the speed on wires, but increases the speed of transistors wires are much slower than transistors at the moment signal speed strictly limits the distance accessible within a single cycle combined with increasing clock speeds, the amount of logic accessible within a single cycle decreases these factors force us to use longer and longer pipelines to fully utilize the processing power Many core

  17. Pipelining: the smart solution x86 does HUGE amounts of processing to figure out the intentions and future of the code it is running Loop are tracked, and statistics are collected about every loop and branch Knowing the future helps to predict to fill the pipeline with instructions yet to be executed It also does on-the-fly reordering of the instructions based on their data-flow graph to run the parallel on multi-ALU/FPUs It hides the parallelism inside its core architecture The algorithms to achieve this, is one of the most well protected trade secret of x86 technology The programmer has almost zero worries about pipelining Many core

  18. Pipelining: the dumb solution GPUs use a radically different approach from x86 architectures Pipelines are filled the most primitive way, by scheduling different threads for different stages of the pipeline It means that the minimum amount of threads needed to fully utilize the architecture is number of cores multiplied by pipeline depth It is easy to assure good pipeline fill this way, if we have enough threads Memory operations could hold up threads, but if we use bit more threads, we can feed the pipeline when some threads are waiting for the memory Global memory operations are either implicitly or explicitly implemented by messages, where we have a choice when to wait for completion. This way even inside a single thread memory and computation can be parallel Many core

  19. Many cores: Topology vs. hierarchy Topological many cores Memory, and inter core communication is ordered into a topology Very well fit for processing data with the given topology usually very computation efficient potentially high memory bandwidth Very poor fit for different or non-topological computations CNN, systolic arrays Many core

Recommend


More recommend