Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569
Contents ● Introduction + Benefits ○ Sophie Calland Parallelisation ● ○ Nathan Lam Memory Access ● ○ Stephen Webb 2
Introduction + ● Large scale computation Benefits of FPGAs ○ ○ Disadvantages of FPGAs Benefits Hybrid approaches ○ ● Example: Supercomputers ○ CPU-based Sophie Calland z5161776 Hybrid approach ○ ● Pipelining review 3
What is Large Scale Computation? ● Tool to speed up calculation of a ● Modern FPGAs can help to meet many complex problem, or process an of these requirements amount of data[4] ○ Provide acceleration for specific Useful in science, technology, functions ○ finance, space/defence, ○ Reprogrammable = cheap, flexible academia, etc. Hybrid CPU/FPGAs make large scale ● ● Solutions need to be large scale + computation accessible to embedded highly performant + cheap! systems ○ Amount of data requiring processing is only getting larger 4
FPGA Benefits for Large Scale Computation ● Smaller devices require the ability to perform complex calculations fast Low latency ● ○ Deterministic, very specialised ■ GPU = must communicate via a CPU, buses ■ CPU = 50 microseconds is good ■ FPGA = at or below 1 microsecond Very cool F-35s - contain FPGAs ○ No Operating System to go through Picture from https://nationalinterest.org/blog/the-buzz/lockheed- ○ Useful if you want quick calculation + response martins-f-35-how-the-joint-strike-fighter-becoming-2 4259 5
FPGA Benefits (cont) ● Reprogrammability ● Data connections Remove/fix bugs ○ ○ Data sources can be connected ○ Change accelerators per directly to the chip application ■ No intermediary bus or OS as ○ Reusable required by CPU/GPU designs ○ Potential for much higher bandwidth (and latency) 6
FPGA Disadvantages ● Memory locality and sharing can be more complex FPGA chips alone don’t have a lot of on-board ○ memory Larger data sets = might not be worth it alone ○ ● Engineering effort is greater ○ Cost, time ○ Might not be worth it ● Power increases relative to specialised Bitcoin mining; specialised ASICs are better than previously used FPGAs[7] ASICs ○ Bitcoin mining 7
Hybrid CPU/FPGA Approaches ● Can address some pitfalls in CPU or FPGA approaches alone Latency sensitive tasks and data processing delegated ○ to FPGA CPU has better memory locality ○ ○ Best of both worlds? Kind of … Power and engineering effort are still concerns ■ ● Embedded systems with high data throughput and lower space requirements can benefit CHREC Space Processor v1.0 board[10] Example: space computers, smart cameras ○ 8
Example - High Performance Computing ● Older supercomputers = Massively parallel CPU-based architecture Nodes communicate via ● interconnected bus ● High memory throughput Example: IBM’s Blue Gene[9] ● 9
Example - High Performance Computing ● Modern supercomputers provide a hybrid approach ● Allows for hardware acceleration via FPGA ○ Faster compute time ● Example: Cygnus supercomputer[8], OpenPOWER Foundation ● Pictured 10
Pipelining Review No pipelining = wasted time + slow Problems that can be broken up into ● independent tasks can benefit Pipelining = faster Figures from [6] Increases throughput ● ● Can reduce latency for concurrent and independent tasks 11
Types of Parallelisation ● ○ Inter-FPGA Parallelisation Intra-FPGA ○ ● Divide and Conquer Algorithms ○ Merge Sort Nathan Lam z5113345 ○ Paul’s Algorithm ○ Map-reduce 12
Parallelisation: Inter-FPGA Advantages Disadvantages Higher degree of parallelisation (distribute More challenging memory management ● ● the problem to more FPGAs) (synchronisation) Ability to handle larger amounts of data Larger overhead in coordinating the ● ● cluster of FPGA More potential bottlenecks (local network, ● CPU-FPGA bus) 13
Parallelisation: Intra-FPGA Advantages Disadvantages ● No overhead in managing the FPGA (data ● Limited to the computational power of a always goes through the same bus to the single FPGA same FPGA) ● Still can be challenging to manage ● Faster on smaller scales (less time memory (if datasets are too large for pre-processing) on-chip memory) 14
Parallelisation: Algorithm Examples Merge sort Map-reduce 15
Parallelisation: Merge Sort ● Using FPGA-only causes large overhead when transferring data to and from memory when merging ● Pipeline the algorithm into 3 parts: 1. CPU partition the data into sub-blocks 2. FPGA sorts sub-blocks of data (using quick-sort) 3. CPU merges the sorted sub-blocks back together 16
Parallelisation: Merge Sort ● Hybrid solution has the highest throughput ● Smaller datasets, higher execution time spent on the FPGA ● Larger dataset has higher percentage execution time for the CPU 17
Parallelisation: Map-Reduce [5] ● Hadoop Map-Reduce Algorithm ● One use-case is for calculating k-means algorithm which is an unsupervised machine learning model ● Use clusters of computers with FPGAs ● Each node in the cluster has its own CPU and FPGA resources connected by a PCIe driver Y. Choi and H. K. So, "Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster," 2014 IEEE 25th 18 International Conference on Application-Specific Systems, Architectures and Processors , Zurich, 2014, pp. 9-16.
Parallelisation: Map-Reduce [5] 19
Parallelisation: Map-Reduce [5] ● Measurements taken with 3 compute nodes and 1 head node. ● Up to 20.6x speedup compared to the software version on Hadoop ● Up to 16.3x speedup compared to the Mahout version on Hadoop ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA (due to reduced bandwidth requirement for each node) 20
Parallelisation: Map-Reduce [5] ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA ● Attributed to reduced bandwidth requirement for each node 21
● Overview of the issues with Memory memory access in LSC ● Paper 1 Access Problem Space ○ ○ Solution Paper 2 ● ○ Problem Space Stephen Webb z5075569 Paul’s Algorithm ○ ○ Solution ● Other Paper 22
Overview of the issues with memory access in LSC Need for large amount memory: Some Requirements for LSC: LSC is all about large data Need to be able to fetch the data in ● ● ● Dealing with tasks that have datasets in reasonable bandwidth the GB Fast Random Reads and Writes ● ● Unfeasible to store it all on FPGA memory ● Multi Parallel Reads and Writes (Usually 100’s of kB) ● System BUS is slow in comparison Using a direct algorithm conversion to hardware without regards to memory bandwidth saw a slowdown of 33x compared to a pure software solution in paper 2. [2] 23
Paper 1 High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform [1] 24
Paper 1: Problem Space ● CPU-FPGA heterogeneous platform Regular MultiCore CPU ● ● Coherent memory interfaces (Both FPGA and CPU) ● High speed interconnection DRAM is accessed through ● cache lines 25
Paper 1: Solution Shared Memory Using the CPU last level cache as a buffer ● Ensure the Block meets up with the cache lines ● Fetch Blocks data from the CPU cache line ● Sort the Block Write Back The Block to the ● cache line This technique has seen about a 2-3x improvement compared to and FPGA only implementation [2] 26 26
Paper 2 An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics [2] 27
Paper 2: Problem Space ● Separate Host Computer and FPGA. ● SMP config ● PCI will work ● Several 32-bit SRAM banks accessed independently and parallel 28
Paper 2: Paul’s Algorithm Algorithm Data Structure ● Designed to speed up priority Queues ● Ordered List of all segments ● Discreetly break up the time series data ● Each Segment is an unordered list of into different segments for different events ranges of Δt ● Segments are limited to a finite size ● Each Segment is stored as a unsorted list ● Both the ordered list and segments should ● Only sort the segment just before it about be stored as a link list to allow it to be to be used dynamic 29
Paper 2: Solution Data Structure Constraints Draw Backs ● Convert the Link Lists into discrete arrays ● Not flexible and cannot adapt well to ● Limit the max size of both the segments change and the ordered list ● Need to know beforehand segment size ● Have the size of the segments as small as possible (Around 20) This allows for prefetching and caching within the queue. 30
Paper 2: Solution FIFO Pre-Fetch/ Round-Robin Writeback During every cycle: Fetch the next segment from ● the next address of the ordered list from off chip memory ● Store the last Fetch Segment into SRAM ● Retrieve the oldest Segment in SRAM and send it to the Queue Sorter Retrieve the sorted queue from ● the Queue sorter and write it SRAM ● Write back a sorted queue from SRAM back to off chip memory Trading off latency for bandwidth 31 31
Recommend
More recommend