GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift
GPUs for Telcos • Fast query-time • Quickly identify network problems No time to index data • Respond fast to customers • Geospatial visualization • Take advantage of GPU visualization capabilities SMS Hub traffic *Picture taken from: http://www.vizualytics.com/Solutions/Telecom/Telecom.html 2
GPUs for Social Media Analytics Search terms: debate Match regexp: “/\B#\w*[a-zA-Z]+\w*/ Filter location 3
Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 4
Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 5
Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 6
CPU-GPU Analogies Goal: High throughput Goal: Low latency (overlapping different instructions) CPU thread GPU warp RAM Global memory T ens of threads Thousands of threads Hundreds of GBs capacity Few tens of GB 7
GPU Architecture K40: 15 Stream Multiprocessors GPU Thread if(condition) … … … 1 1 a++; else … … b++; … … … 1 1 endif CUDA Kernel 8
GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … a++; else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 9
GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 10
GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 11
GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif 1 1 1 1 1 1 warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 12
Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 13
Text Query Applications ACGTACCTGATCGTAGGATCCCAAGTACATCATTTC ACC Input GENOMIC DATA Search Pattern Wild card searches Id Address “* 3rdAve*New York* ” 3 “9 Front St, Washington DC, 20001” Search Pattern 8 “3338 A Lockport Pl #6, Margate City, NJ, 8402” 9 “18 3rd Ave, New York, NY, 10016” 15 “88 Sw 28th T er, Harrison, NJ, 7029” DATABASE COLUMNS 16 “87895 Concord Rd, La Mesa, CA, 91142” Q2,9,13,14,16,20 of TPC-H contain expensive LIKE predicates 14
Wildcard Search Challenges • Approaches simplifying search cannot be applied • String indexes, e.g. suffix trees • For query ‘%customer%complaints’ multiple queries need be issued • ’%customer%’ AND ‘%complaints%’ • Confirm results • Dictionary compression • Wildcard searches not simplified using dictionaries • String data need to be decompressed 15
Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 16
Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch ACACA T ACCTACTTTACGTACGT Step 7 i=5 ACA C ACG j=1 Shift pattern -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 17
GPU Limiting factor: Cache Pressure Threads matching different strings Warp size: 32 x Stream Multiprocessors: 15 #Warps in each SM : 64 Cache footprint : 30720 cache lines >> Tesla K40 architecture L2 Capacity : 12288 cache lines Smaller cache size per thread than CPUs: Need improved locality! 18
Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality Initially: Each warp loads a cache line (128 bytes) CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Partial solution: Threads might progress in different rate 19
Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality In presence of partial matches some threads might fall “behind” CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Memory divergence! Partial solution: Threads might progress in different rate 20
Transform Control Flow of KMP -10 01 2 3 4 Shift pattern table Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch While Loop ACACA T ACCTACTTTACGTACGT i=5 j=3 ACA C ACG Mismatch à Shift pattern ACACA T ACCTACTTTACGTACGT i=5 j=1 A C ACACG Mismatch à Shift pattern … ACACA T ACCTACTTTACGTACGT Step 7 i=6 ACACA C G j=0 Shift pattern KMP Hybrid: Advance input in pivoted piece size 21
GPU vs. CPU Comparison select s_suppkey from supplier where s_comment like ’%Customer%Complaints%’ – Performance Metrics – Price ($) – Performance (GB/s) – Performance per $ – Estimated energy consumption – Evaluate three systems – CPU only system – GPU only system – CPU+GPU combined system 22
GPU vs. CPU Comparison GPU CPU (Boost BM) CPU (CMPISTRI) CPU+GPU Price ($) 3100 952 952 4052 Performance (GB/s) 98.7 40.75 43.1 138.7 Energy consumed (J) 1.27 2.49 2.35 1.78 Performance/$ 31.89 42.8 45.28 34.25 CPU: Dual-socket E5-2620 – Band. 102.4 GB/s Circle best column value per row GPU: Tesla K40 – Band. 288 GB/s Design system by choosing the desired trade-offs 23
Outline o CPU vs GPU introduction o Accelerating wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 24
Example: Why Use Compression? A) Reduce basic S3 costs Cloud Warehouse Amazon S3 Data lakes Databases Query Engine B) Reduce query costs Database Decompression speed more important than compression speed 25
Background: LZ77 Compression Input characters Output … 0 1 2 3 ATTACTAGAATGT (2,5)… ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG Backreferences Literals (Position, Length) Unmatched characters 26
Recommend
More recommend