Not Just a Universal Crutch: Other Useful Things To Do With atomicCAS Mitglied der Helmholtz-Gemeinschaft S6220 - Elmar Westphal - Forschungszentrum Jülich
Overview • atomicCAS and the “Universal Crutch” • Parallel hashing on GPUs using atomicCAS • Example 1: Counting unique keys in a block • Example 2: Group by keys within a warp Mitglied der Helmholtz-Gemeinschaft • Conclusions • Addendum: Sample source codes
On atomicCAS • From the CUDA C Programming Manual: “ int atomicCAS(int* address, int compare, int val); … reads … old … located … in global or shared memory, computes (old == compare ? val : old) Mitglied der Helmholtz-Gemeinschaft and stores the result back to memory at the same address. … The function returns old (Compare And Swap).”
“The Universal Crutch” __device__ double atomicAdd(double* address, double val) { • According to said guide, unsigned long long int* address_as_ull = (unsigned long long int*)address; “any atomic operation can unsigned long long int old = *address_as_ull, assumed; do { be implemented based on assumed = old; old = atomicCAS(address_as_ull, assumed, atomicCAS()” __double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); • Example: double precision } atomicAdd Mitglied der Helmholtz-Gemeinschaft
“The Universal Crutch” __device__ double atomicAdd(double* address, double val) { • According to said guide, unsigned long long int* address_as_ull = (unsigned long long int*)address; “any atomic operation can unsigned long long int old = *address_as_ull, assumed; do { be implemented based on assumed = old; old = atomicCAS(address_as_ull, assumed, atomicCAS()” __double_as_longlong(val + __longlong_as_double(assumed))); } while (assumed != old); return __longlong_as_double(old); • Example: double precision } atomicAdd Mitglied der Helmholtz-Gemeinschaft “any atomic operation”, like that? Great, then we’re done here. Thank you for your time!
Mitglied der Helmholtz-Gemeinschaft There is more! But wait!
Overview • atomicCAS and the “Universal Crutch” • Parallel hashing on GPUs using atomicCAS • Example 1: Counting unique keys in a block • Example 2: Group by keys within a warp Mitglied der Helmholtz-Gemeinschaft • Conclusions • Addendum: Sample source codes
Origin & Motivation • Originally developed as part of building (partial) linked lists in shared memory (see GTC 2012, S2036) • original use became obsolete with Kepler’s faster atomics • general idea became useful again with Maxwell’s native shared Mitglied der Helmholtz-Gemeinschaft atomics
Hashing using atomicCAS • atomicCAS can be used to implement parallel hashing functions • Works very efficiently in shared memory on Maxwell • Building block for several useful counting and grouping operations • Works best at warp- or block-level with N keys << N threads Mitglied der Helmholtz-Gemeinschaft • Hashing function chosen must fit data properties: • Constantly strided keys may lead to repeated collisions
Mitglied der Helmholtz-Gemeinschaft A New Building Block
How Does It Work? • The loop maps the threads’ arbitrary keys to hash indices within the warp or block • The hash index within a scope (warp/block) is then assigned to all threads with the same key • atomicCAS tries to claim the calculated hash index for its thread’s key (“my_key”) • There are three possible outcomes for the return value of atomicCAS: 1. UNCLAIMED: this thread is the first to claim a hash index, success Mitglied der Helmholtz-Gemeinschaft 2. Same key as my_key: hash index claimed by same key from different thread, success 3. Key different from my_key: hash index claimed by different key (hash collision), try again with new hash index
Example, 13 Unique Keys in 16 Threads Slot 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Key U U U U U U U U U U U U U U U U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 3 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 7 7 6 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 3 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 7 7 6 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 1 Hash function: (key+5*hash_iteration)%BLOCK_SIZE (simple to show, but ineffective for data with stride of BLOCK_SIZE!) assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash 2 3 7 2 8 7 9 7 14 11 15 5 3 4 7 6 Key from U 19 7 2 U 7 U U U U U U U U 7 U Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 8 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 12 7 6 6 Key from Key from U - 19 - 7 - 2 - U - 7 - U - U - U - U - U - U - U - U - 7 - U - Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Key Key U U U U U 2 19 U 20 U 21 U 22 U U 7 U 8 U 9 U U 11 U U U U U 14 U 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash 2 2 8 3 7 7 2 2 8 8 7 7 9 9 7 7 14 14 11 11 15 15 5 5 3 3 4 4 12 7 6 6 Key from Key from U - 19 - 7 - 2 - U - 7 - U - U - U - U - U - U - U - U - 7 - U - Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision
Example, 13 Unique Keys in 16 Threads Slot Slot Slot 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 Key Key Key U U U U U U U 2 2 19 19 U 20 20 U 21 21 U 22 22 U U 7 7 U 8 8 U 9 9 U U U 11 11 U 23 U U U U U 14 14 U 15 15 U Key 2 3 7 2 8 7 9 7 14 11 15 21 19 20 23 22 Mitglied der Helmholtz-Gemeinschaft Hash Hash Hash 2 2 2 8 3 8 7 7 7 2 2 2 8 8 8 7 7 7 9 9 9 7 7 7 14 14 14 11 11 11 15 15 15 5 5 5 3 3 3 4 4 4 12 12 7 6 6 6 Key from Key from Key from U - - 19 8 - 7 - - 2 - - U - - 7 - - U - - U - - U - - U - - U - - U - - U - - U - - U 7 - U - - Slot Slot Slot Iteration 2 Hash function: (key+5*hash_iteration)%BLOCK_SIZE assigned unassigned collision
Recommend
More recommend