Large-scale Computation Nathan Lam z5113345 Sophie Calland - PowerPoint PPT Presentation

Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569

Contents ● Introduction + Benefits ○ Sophie Calland Parallelisation ● ○ Nathan Lam Memory Access ● ○ Stephen Webb 2

Introduction + ● Large scale computation Benefits of FPGAs ○ ○ Disadvantages of FPGAs Benefits Hybrid approaches ○ ● Example: Supercomputers ○ CPU-based Sophie Calland z5161776 Hybrid approach ○ ● Pipelining review 3

What is Large Scale Computation? ● Tool to speed up calculation of a ● Modern FPGAs can help to meet many complex problem, or process an of these requirements amount of data[4] ○ Provide acceleration for specific Useful in science, technology, functions ○ finance, space/defence, ○ Reprogrammable = cheap, flexible academia, etc. Hybrid CPU/FPGAs make large scale ● ● Solutions need to be large scale + computation accessible to embedded highly performant + cheap! systems ○ Amount of data requiring processing is only getting larger 4

FPGA Benefits for Large Scale Computation ● Smaller devices require the ability to perform complex calculations fast Low latency ● ○ Deterministic, very specialised ■ GPU = must communicate via a CPU, buses ■ CPU = 50 microseconds is good ■ FPGA = at or below 1 microsecond Very cool F-35s - contain FPGAs ○ No Operating System to go through Picture from https://nationalinterest.org/blog/the-buzz/lockheed- ○ Useful if you want quick calculation + response martins-f-35-how-the-joint-strike-fighter-becoming-2 4259 5

FPGA Benefits (cont) ● Reprogrammability ● Data connections Remove/fix bugs ○ ○ Data sources can be connected ○ Change accelerators per directly to the chip application ■ No intermediary bus or OS as ○ Reusable required by CPU/GPU designs ○ Potential for much higher bandwidth (and latency) 6

FPGA Disadvantages ● Memory locality and sharing can be more complex FPGA chips alone don’t have a lot of on-board ○ memory Larger data sets = might not be worth it alone ○ ● Engineering effort is greater ○ Cost, time ○ Might not be worth it ● Power increases relative to specialised Bitcoin mining; specialised ASICs are better than previously used FPGAs[7] ASICs ○ Bitcoin mining 7

Hybrid CPU/FPGA Approaches ● Can address some pitfalls in CPU or FPGA approaches alone Latency sensitive tasks and data processing delegated ○ to FPGA CPU has better memory locality ○ ○ Best of both worlds? Kind of … Power and engineering effort are still concerns ■ ● Embedded systems with high data throughput and lower space requirements can benefit CHREC Space Processor v1.0 board[10] Example: space computers, smart cameras ○ 8

Example - High Performance Computing ● Older supercomputers = Massively parallel CPU-based architecture Nodes communicate via ● interconnected bus ● High memory throughput Example: IBM’s Blue Gene[9] ● 9

Example - High Performance Computing ● Modern supercomputers provide a hybrid approach ● Allows for hardware acceleration via FPGA ○ Faster compute time ● Example: Cygnus supercomputer[8], OpenPOWER Foundation ● Pictured 10

Pipelining Review No pipelining = wasted time + slow Problems that can be broken up into ● independent tasks can benefit Pipelining = faster Figures from [6] Increases throughput ● ● Can reduce latency for concurrent and independent tasks 11

Types of Parallelisation ● ○ Inter-FPGA Parallelisation Intra-FPGA ○ ● Divide and Conquer Algorithms ○ Merge Sort Nathan Lam z5113345 ○ Paul’s Algorithm ○ Map-reduce 12

Parallelisation: Inter-FPGA Advantages Disadvantages Higher degree of parallelisation (distribute More challenging memory management ● ● the problem to more FPGAs) (synchronisation) Ability to handle larger amounts of data Larger overhead in coordinating the ● ● cluster of FPGA More potential bottlenecks (local network, ● CPU-FPGA bus) 13

Parallelisation: Intra-FPGA Advantages Disadvantages ● No overhead in managing the FPGA (data ● Limited to the computational power of a always goes through the same bus to the single FPGA same FPGA) ● Still can be challenging to manage ● Faster on smaller scales (less time memory (if datasets are too large for pre-processing) on-chip memory) 14

Parallelisation: Algorithm Examples Merge sort Map-reduce 15

Parallelisation: Merge Sort ● Using FPGA-only causes large overhead when transferring data to and from memory when merging ● Pipeline the algorithm into 3 parts: 1. CPU partition the data into sub-blocks 2. FPGA sorts sub-blocks of data (using quick-sort) 3. CPU merges the sorted sub-blocks back together 16

Parallelisation: Merge Sort ● Hybrid solution has the highest throughput ● Smaller datasets, higher execution time spent on the FPGA ● Larger dataset has higher percentage execution time for the CPU 17

Parallelisation: Map-Reduce [5] ● Hadoop Map-Reduce Algorithm ● One use-case is for calculating k-means algorithm which is an unsupervised machine learning model ● Use clusters of computers with FPGAs ● Each node in the cluster has its own CPU and FPGA resources connected by a PCIe driver Y. Choi and H. K. So, "Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster," 2014 IEEE 25th 18 International Conference on Application-Specific Systems, Architectures and Processors , Zurich, 2014, pp. 9-16.

Parallelisation: Map-Reduce [5] 19

Parallelisation: Map-Reduce [5] ● Measurements taken with 3 compute nodes and 1 head node. ● Up to 20.6x speedup compared to the software version on Hadoop ● Up to 16.3x speedup compared to the Mahout version on Hadoop ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA (due to reduced bandwidth requirement for each node) 20

Parallelisation: Map-Reduce [5] ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA ● Attributed to reduced bandwidth requirement for each node 21

● Overview of the issues with Memory memory access in LSC ● Paper 1 Access Problem Space ○ ○ Solution Paper 2 ● ○ Problem Space Stephen Webb z5075569 Paul’s Algorithm ○ ○ Solution ● Other Paper 22

Overview of the issues with memory access in LSC Need for large amount memory: Some Requirements for LSC: LSC is all about large data Need to be able to fetch the data in ● ● ● Dealing with tasks that have datasets in reasonable bandwidth the GB Fast Random Reads and Writes ● ● Unfeasible to store it all on FPGA memory ● Multi Parallel Reads and Writes (Usually 100’s of kB) ● System BUS is slow in comparison Using a direct algorithm conversion to hardware without regards to memory bandwidth saw a slowdown of 33x compared to a pure software solution in paper 2. [2] 23

Paper 1 High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform [1] 24

Paper 1: Problem Space ● CPU-FPGA heterogeneous platform Regular MultiCore CPU ● ● Coherent memory interfaces (Both FPGA and CPU) ● High speed interconnection DRAM is accessed through ● cache lines 25

Paper 1: Solution Shared Memory Using the CPU last level cache as a buffer ● Ensure the Block meets up with the cache lines ● Fetch Blocks data from the CPU cache line ● Sort the Block Write Back The Block to the ● cache line This technique has seen about a 2-3x improvement compared to and FPGA only implementation [2] 26 26

Paper 2 An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics [2] 27

Paper 2: Problem Space ● Separate Host Computer and FPGA. ● SMP config ● PCI will work ● Several 32-bit SRAM banks accessed independently and parallel 28

Paper 2: Paul’s Algorithm Algorithm Data Structure ● Designed to speed up priority Queues ● Ordered List of all segments ● Discreetly break up the time series data ● Each Segment is an unordered list of into different segments for different events ranges of Δt ● Segments are limited to a finite size ● Each Segment is stored as a unsorted list ● Both the ordered list and segments should ● Only sort the segment just before it about be stored as a link list to allow it to be to be used dynamic 29

Paper 2: Solution Data Structure Constraints Draw Backs ● Convert the Link Lists into discrete arrays ● Not flexible and cannot adapt well to ● Limit the max size of both the segments change and the ordered list ● Need to know beforehand segment size ● Have the size of the segments as small as possible (Around 20) This allows for prefetching and caching within the queue. 30

Paper 2: Solution FIFO Pre-Fetch/ Round-Robin Writeback During every cycle: Fetch the next segment from ● the next address of the ordered list from off chip memory ● Store the last Fetch Segment into SRAM ● Retrieve the oldest Segment in SRAM and send it to the Queue Sorter Retrieve the sorted queue from ● the Queue sorter and write it SRAM ● Write back a sorted queue from SRAM back to off chip memory Trading off latency for bandwidth 31 31

Large-scale Computation Nathan Lam z5113345 Sophie Calland - PowerPoint PPT Presentation

Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569 Contents Introduction + Benefits Sophie Calland Parallelisation Nathan Lam Memory Access Stephen Webb 2

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

MongoDB large scale data-centric architectures QConSF 2012 Kenny Gorman Founder, ObjectRocket

INCORPORATING LARGE-SCALE CITIZEN INCORPORATING LARGE-SCALE CITIZEN DELIBERATION INTO

Workshop Workshop on Large on Large- -Scale Disaster Recovery Scale Disaster Recovery i i

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Large Scale I nternational I Pv6 Pilot Large Scale I nternational I Pv6 Pilot Network (6NET)

Deploying Large Scale AVB/TSN Networks Jeff Koftinoff, Meyer Sound Laboratories, Inc. June 19,

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Enabling Future Enabling Future Technology Technology Ultra-Large-Scale Systems

A Large Scale, MHD Resonant A Large Scale, MHD Resonant Instability in a Galactic-Like Disk

Convex discretization of functionals involving the Monge-Amp` ere operator Quentin M erigot

Optical Rings and Hybrid Mesh Rings Optical Networks draft-papadimitriou-optical-rings-00.txt

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Spectra of C* algebras, classification. Eberhard Kirchberg HU Berlin Lect.2, Copenhagen, 09

On Stable Convex Sets Colloquium of the Pure Mathematics Research Centre Queens University

Convex discretization of functionals involving the Monge-Amp` ere operator Quentin M erigot

Cofinality of classes of ideals with respect to Kat etov and Kat etov-Blass orders Hiroshi

Lecture 16: Testing & Review 2015-07-13 Prof. Dr. Andreas Podelski, Dr. Bernd Westphal 16

Large-scale Computation Nathan Lam z5113345 Sophie Calland - PowerPoint PPT Presentation

Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569 Contents Introduction + Benefits Sophie Calland Parallelisation Nathan Lam Memory Access Stephen Webb 2

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

MongoDB large scale data-centric architectures QConSF 2012 Kenny Gorman Founder, ObjectRocket

INCORPORATING LARGE-SCALE CITIZEN INCORPORATING LARGE-SCALE CITIZEN DELIBERATION INTO

Workshop Workshop on Large on Large- -Scale Disaster Recovery Scale Disaster Recovery i i

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Large Scale I nternational I Pv6 Pilot Large Scale I nternational I Pv6 Pilot Network (6NET)

Deploying Large Scale AVB/TSN Networks Jeff Koftinoff, Meyer Sound Laboratories, Inc. June 19,

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Enabling Future Enabling Future Technology Technology Ultra-Large-Scale Systems

A Large Scale, MHD Resonant A Large Scale, MHD Resonant Instability in a Galactic-Like Disk

Convex discretization of functionals involving the Monge-Amp` ere operator Quentin M erigot

Optical Rings and Hybrid Mesh Rings Optical Networks draft-papadimitriou-optical-rings-00.txt

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Spectra of C* algebras, classification. Eberhard Kirchberg HU Berlin Lect.2, Copenhagen, 09

On Stable Convex Sets Colloquium of the Pure Mathematics Research Centre Queens University

Convex discretization of functionals involving the Monge-Amp` ere operator Quentin M erigot

Cofinality of classes of ideals with respect to Kat etov and Kat etov-Blass orders Hiroshi

Lecture 16: Testing &amp; Review 2015-07-13 Prof. Dr. Andreas Podelski, Dr. Bernd Westphal 16

Lecture 16: Testing & Review 2015-07-13 Prof. Dr. Andreas Podelski, Dr. Bernd Westphal 16