Handling the data deluge Data stored in data centers is growing at a - PowerPoint PPT Presentation

Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia § , Rodrigo Rodrigues § , Akshat Verma ¶ § MPI-SWS, Germany ¶ IBM Research-India USENIX ¡FAST ¡2012 ¡

Handling the data deluge • Data stored in data centers is growing at a fast pace • Challenge: How to store and process this data ? • Key technique: Redundancy elimination • Applications of redundancy elimination • Incremental storage: data de-duplication • Incremental computation: selective re-execution Pramod Bhatotia 2

Redundancy elimination is expensive Duplicate ¡ Chunks ¡ File ¡ Hash ¡ Yes ¡ Chunking ¡ Hashing ¡ Matching ¡ No ¡ Content-based chunking [SOSP’01] Fingerprint ¡ 0 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡0 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡0 ¡ ¡1 ¡ ¡0 ¡ ¡ ¡File ¡ Content ¡ Marker ¡ For large-scale data, chunking easily becomes a bottleneck Pramod Bhatotia 3

Accelerate chunking using GPUs GPUs have been successfully applied to compute-intensive tasks 3 ¡ 20Gbps ¡(2.5GBps) ¡ ¡ Storage ¡ 2.5 ¡ Servers ¡ 2 ¡ ? ¡ GBps ¡ 1.5 ¡ 1 ¡ 2X ¡ 0.5 ¡ 0 ¡ MulEcore ¡ GPU ¡based ¡design ¡ Using GPUs for data-intensive tasks presents new challenges Pramod Bhatotia 4

Rest of the talk • Shredder design • Basic design • Background: GPU architecture & programming model • Challenges and optimizations • Evaluation • Case studies • Computation: Incremental MapReduce • Storage: Cloud backup Pramod Bhatotia 5

Shredder basic design CPU ¡(Host) ¡ GPU ¡(Device) ¡ Transfer ¡ Chunking ¡ Reader ¡ kernel ¡ Store ¡ Data ¡for ¡ ¡ Chunked ¡ chunking ¡ data ¡ Pramod Bhatotia 6

GPU architecture GPU ¡(Device) ¡ Host ¡memory ¡ MulE-‑processor ¡ ¡N ¡ Device ¡ ¡ ¡ global ¡ ¡ MulE-‑processor ¡ ¡2 ¡ memory ¡ MulE-‑processor ¡ ¡1 ¡ CPU ¡ ¡ (Host) ¡ Shared ¡memory ¡ PCI ¡ ¡ ¡ SP ¡ SP ¡ SP ¡ SP ¡ Pramod Bhatotia 7

GPU programming model GPU ¡(Device) ¡ Host ¡memory ¡ MulE-‑processor ¡ ¡N ¡ Input ¡ Device ¡ ¡ ¡ global ¡ ¡ MulE-‑processor ¡ ¡2 ¡ memory ¡ MulE-‑processor ¡ ¡1 ¡ CPU ¡ ¡ (Host) ¡ Shared ¡memory ¡ PCI ¡ ¡ ¡ Output ¡ SP ¡ SP ¡ SP ¡ SP ¡ Threads ¡ Pramod Bhatotia 8

Scalability challenges 1. Host-device communication bottlenecks 2. Device memory conflicts (See ¡paper ¡for ¡details) ¡ 3. Host bottlenecks Pramod Bhatotia 9

Challenge ¡# ¡1 ¡ Host-device communication bottleneck CPU ¡(Host) ¡ GPU ¡(Device) ¡ Main ¡ Device ¡ ¡ memory ¡ Chunking ¡ Transfer ¡ global ¡ ¡ kernel ¡ PCI ¡ memory ¡ Reader ¡ I/O ¡ Synchronous data transfer and kernel execution • Cost of data transfer is comparable to kernel execution • For large-scale data it involves many data transfers Pramod Bhatotia 10

Asynchronous execution GPU ¡(Device) ¡ CPU ¡(Host) ¡ Asynchronous ¡ Device ¡global ¡memory ¡ copy ¡ Main ¡ Transfer ¡ Buffer ¡1 ¡ Buffer ¡2 ¡ memory ¡ Copy ¡to ¡ ¡ Copy ¡to ¡ ¡ Pros: Buffer ¡1 ¡ Buffer ¡2 ¡ Time ¡ + Overlaps communication with computation + Generalizes to multi-buffering Compute ¡ Compute ¡ Buffer ¡1 ¡ Buffer ¡2 ¡ Cons: - Requires page-pinning of buffers at host side Pramod Bhatotia 11

Circular ring pinned memory buffers CPU ¡(Host) ¡ Memcpy ¡ GPU ¡(Device) ¡ Asynchronous ¡ copy ¡ Device ¡global ¡ ¡ memory ¡ Pinned circular Ring buffers Pageable buffers Pramod Bhatotia 12

Challenge ¡# ¡2 ¡ Device memory conflicts CPU ¡(Host) ¡ GPU ¡(Device) ¡ Main ¡ Device ¡ ¡ memory ¡ Chunking ¡ Transfer ¡ global ¡ ¡ kernel ¡ PCI ¡ memory ¡ Reader ¡ I/O ¡ Pramod Bhatotia 13

Accessing device memory Device ¡global ¡memory ¡ 400-‑600 ¡ Thread-‑1 ¡ Thread-‑3 ¡ Thread-‑2 ¡ Cycles ¡ Thread-‑4 ¡ Few ¡ Device ¡shared ¡memory ¡ cycles ¡ ¡ ¡ SP-‑1 ¡ SP-‑2 ¡ SP-‑3 ¡ SP-‑4 ¡ MulE-‑processor ¡ ¡ Pramod Bhatotia 14

Memory bank conflicts Device ¡global ¡memory ¡ Un-coordinated accesses to global memory lead to a large number of memory bank conflicts Device ¡shared ¡memory ¡ ¡ ¡ SP ¡ SP ¡ SP ¡ SP ¡ MulE-‑processor ¡ ¡ Pramod Bhatotia 15

Accessing memory banks Interleaved ¡memory ¡ Bank ¡2 ¡ Bank ¡3 ¡ Bank ¡0 ¡ Bank ¡1 ¡ 3 ¡ 2 ¡ 1 ¡ Memory ¡ 0 ¡ 7 ¡ 6 ¡ 5 ¡ address ¡ 4 ¡ 8 ¡ Chip ¡ enable ¡ OR ¡ MSBs ¡ LSBs ¡ Data ¡out ¡ Address ¡ Pramod Bhatotia 16

Memory coalescing Device ¡global ¡memory ¡ Thread ¡# ¡ 3 ¡ 4 ¡ 1 ¡ 2 ¡ Time ¡ Memory ¡ coalescing ¡ Device ¡shared ¡memory ¡ Pramod Bhatotia 17

Processing the data Thread-‑1 ¡ Thread-‑2 ¡ Thread-‑3 ¡ Thread-‑4 ¡ Device ¡shared ¡memory ¡ Pramod Bhatotia 18

Outline • Shredder design • Evaluation • Case-studies Pramod Bhatotia 19

Evaluating Shredder • Goal: Determine how Shredder works in practice (See ¡paper ¡for ¡details) ¡ • How effective are the optimizations? • How does it compare with multicores? • Implementation • Host driver in C++ and GPU in CUDA • GPU: NVidia Tesla C2050 cards • Host machine: Intel Xeon with12 cores Pramod Bhatotia 20

Shredder vs. Multicores 2.5 ¡ 2 ¡ 1.5 ¡ GBps ¡ 5X ¡ ¡ 1 ¡ 0.5 ¡ 0 ¡ MulEcore ¡ GPU ¡Basic ¡ GPU ¡Async ¡ GPU ¡Async ¡+ ¡ Coalescing ¡ Pramod Bhatotia 21

Outline • Shredder design • Evaluation • Case studies • Computation: Incremental MapReduce (See ¡paper ¡for ¡details) ¡ • Storage: Cloud backup Pramod Bhatotia 22

Incremental MapReduce Read input Map tasks Reduce tasks Write ¡output ¡ Pramod Bhatotia 23

Unstable input partitions Read input Map tasks Reduce tasks Write ¡output ¡ Pramod Bhatotia 24

GPU accelerated Inc-HDFS Input ¡file ¡ ¡ copyFromLocal ¡ Shredder ¡ HDFS ¡Client ¡ ¡ Content-‑based ¡chunking ¡ Split-‑3 ¡ Split-‑1 ¡ Split-‑2 ¡ Pramod Bhatotia 25

Related work • GPU-accelerated systems • Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10] • SSLShader[NSDI’11], PacketShader[SIGCOMM’10], … • Incremental computations • Incoop[SOCC’11], Nectar[OSDI’10], Percolator[OSDI’10],… Pramod Bhatotia 26

Conclusions • GPU-accelerated framework for redundancy elimination • Exploits massively parallel GPUs in a cost-effective manner • Shredder design incorporates novel optimizations • More data-intensive than previous usage of GPUs • Shredder can be seamlessly integrated with storage systems • To accelerate incremental storage and computation Pramod Bhatotia 27

Thank Y ou!

Handling the data deluge Data stored in data centers is growing at a - PowerPoint PPT Presentation

Shredder GPU-Accelerated Incremental Storage and Computation Pramod Bhatotia , Rodrigo Rodrigues , Akshat Verma MPI-SWS, Germany IBM Research-India USENIX FAST 2012 Handling the data deluge Data stored in data

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

Material Handling Chapter 5 Designing material handling systems Overview of material

Handling City Data Deluge Challenges and Applications Veli Bicer IBM Research, Ireland IBM -

Safety Enhancements and OPEX Savings from Appropriate Installation of Flexible Deluge Pipework

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

Wide Area Distributed File Systems Tevfik Kosar, Ph.D. Week 1: January 16, 2013 Data Deluge

Parallel and Distributed File Systems Tevfik Kosar, Ph.D. Week 1: January 29, 2014 Data Deluge

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Matt Bellis Debbie

CSCI 135: DIVING INTO THE DELUGE OF DATA LECTURE 6 strings, formatting, and sequences SEQUENCES

CSCI 135: DIVING INTO THE DELUGE OF DATA LECTURE 5 functions, parameters, arguments, and modules

Computer Science 135: Diving into the Deluge of Data Contact Information: Brent Heeringa Email:

CSCI 135: DIVING INTO THE DELUGE OF DATA LECTURE 4 functions, conditionals, and modules def

Safe and Reliable Test Results Handling Running a practice session on results handling How to

Performance on DX11 Hardware Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA Why DirectCompute?

Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with

Using Linux Media Controller for Wayland/Weston Renderer Technology Consulting Company

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

ExternalSorting Chapter13

Database Systems External Sort Based on slides by Feifei Li, University of Utah Whats external

The Xen Para-virtual Frame Buffer Markus Armbruster, 2007 armbru@redhat.com Red Hat GmbH What,

Caller Frame Arguments 7+ Return Addr Old %rbp Saved Shared Registers Libraries + Local