Pattern-guided Big Data Processing on Hybrid Parallel Architectures Fahad Khalid, Frank Feinbube, and Andreas Polze Operating Systems and Middleware Group
Motivation • Insights from developing simulations for, – Enumeration of Elementary Flux Modes in Metabolic Networks – Prediction of aftershocks following earthquakes – Prediction of volcanic events – Adiabatic Quantum Computing • Collaborations – Max Planck Institute of Molecular Plant Physiology – GFZ German Research Center for Geosciences September 25, 2014 Frank Feinbube | BigSys 2014 2
Motivation • Complications with Hybrid Architectures – Memory hierarchy per processor type – Designed for high FLOP/s, not Big Data • Then, assuming the hardware available is hybrid, – How can we improve both performance and productivity of a simulation that requires processing of very large data sets? September 25, 2014 Frank Feinbube | BigSys 2014 3
Definitions • Performance – Significant speedup • Productivity – Ease of development • Hybrid Architecture – One or more CPUs = Host – One or more accelerators, e.g., GPUs = Device September 25, 2014 Frank Feinbube | BigSys 2014 4
Efficient Hybrid-Resource Utilization ( EHRU ) • Design Approach – Hierarchical application of patterns for parallel programming • Expected Outcome – Improved simulation performance – Improved productivity, by serving as foundation for: • Frameworks • Automation tools September 25, 2014 Frank Feinbube | BigSys 2014 5
Parallel Pipeline Pattern 3 1 7 0 4 9 5 4 3 1 7 0 4 9 5 4 ⋯ Serial processing of stages 𝑇 1 𝑇 2 𝑇 3 𝑇 1 𝑇 2 𝑇 3 ⋯ Pipelined processing of stages 𝑇 1 𝑇 2 𝑇 3 𝑇 1 𝑇 2 𝑇 3 𝑇 1 𝑇 2 𝑇 3 September 25, 2014 Frank Feinbube | BigSys 2014 6
Parallel Pipeline Pattern • Simulation as Pipeline Read input data from file Analytical solutions to 3D Partial Differential Equations in Vectors Numerical solution to a System of Linear Equations Write output data to file September 25, 2014 Frank Feinbube | BigSys 2014 7
Data Partitioning • Motivation – Main memory and Cache sizes are limited P 1,1 Out of OK • Factors affecting partitioning Memory P 1,2 P 1 OK – Total memory required/available P 1,3 OK – Impact of partition size on pipeline performance Complete Partition 0 Partition 0 Dataset ⋯ Partition 1 Partition 1 Chunk ⋮ ⋮ September 25, 2014 Frank Feinbube | BigSys 2014 8
EHRU Pattern Hierarchy September 25, 2014 Frank Feinbube | BigSys 2014 9
Hybrid Pipeline ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ • Uses of Hybrid Pipelining – Overlapping computation and communication – Load balancing and optimal resource utilization – Kernel placement based on architecture September 25, 2014 Frank Feinbube | BigSys 2014 10
Hybrid Pipeline Framework ( HyPi ) • HyPi Stages – DeviceFilter : CUDA Device kernel – CallbackFilter : D2H Communication – PostProcessFilter : Host processing Device Callback PostProcess ⋯ ⋯ Filter Filter Filter Device Callback PostProcess ⋯ ⋯ Filter Filter Filter Device Callback PostProcess ⋯ ⋯ Filter Filter Filter September 25, 2014 Frank Feinbube | BigSys 2014 11
HyPi & EHRU – Evaluation 60 CPU-only Parallel Custom Pipeline HPF Pipeline 55 50 45 40 35 Time (seconds) 30 25 20 15 10 5 0 500 million 2 billion 2.5 billion 3.5 billion 4.5 billion 6.3 billion 8.1 billion No. of candidate vectors generated September 25, 2014 Frank Feinbube | BigSys 2014 12
Feasibility and Limitations of EHRU • Suitable for • Not suitable for – Dense Linear Algebra – Sparse Linear Algebra – Structured Grids – Unstructured Grids – Monte Carlo – Graph Traversal September 25, 2014 Frank Feinbube | BigSys 2014 13
Architecture-based Algorithm Decomposition • Decompose the algorithm into two parts: Pattern 1 1. Suitable for execution on the GPU Accelerator 2. Suitable for execution on the CPU Pattern 2 • CPUs support a diverse range of Algorithm ⋮ kernels – Everything goes, except for massive Pattern 𝑜 − 1 CPU parallelism • How do we decide which part of Pattern 𝑜 the algorithm is suitable for GPUs? September 25, 2014 Frank Feinbube | BigSys 2014 14
Characteristics of Computational Kernels • Degree of Parallelism (DoP) – The amount of parallelism exposed by the kernel • Arithmetic Intensity – Ratio of No. of arithmetic instructions to the No. of memory access instructions • Control Divergence – No. and complexity of conditional statements September 25, 2014 Frank Feinbube | BigSys 2014 15
Design Patterns and Algorithm Decomposition • Patterns suitable for GPUs – Map – Stencil • Patterns NOT suitable for GPUs – Reduce – Scan – Dynamic Programming • This categorization is based on Degree of Parallelism September 25, 2014 Frank Feinbube | BigSys 2014 16
Program Flow with Algorithm Decomposition << Map >> GPU Kernel Intermediate Result << Reduce >> CPU Kernel September 25, 2014 Frank Feinbube | BigSys 2014 17
Tool-guided Parallelization for Hybrid Architectures • Motivation – Automatically discerning patterns from serial code – Efficient mapping of parallel code with EHRU • How? – Dependence Analysis to discern patterns – Developer feedback to improve affine transformations • This is work in progress September 25, 2014 Frank Feinbube | BigSys 2014 18
Future Work • Information Theoretic approach to improve serial to parallel transformations • Partitioning for Complex Data structures • Automated tool for architecture-based algorithm decomposition Th Thank Yo You! September 25, 2014 Frank Feinbube | BigSys 2014 19
Recommend
More recommend