Optimization of Scalable Concurrent Pool Based on Diffraction Trees - PowerPoint PPT Presentation

Optimization of Scalable Concurrent Pool Based on Diffraction Trees Anenkov Alexandr Siberian State University of Telecommunications and Information Sciences, Novosibirsk E-mail: alex.anenkov@outlook.com Paznikov Alexey Saint Petersburg Electrotechnical University “LETI” Siberian State University of Telecommunications and Information Sciences, Novosibirsk Rzhanov Institute of Semiconductor Physics Siberian Branch of RAS, Novosibirsk E-mail: apaznikov@gmail.com The first summer school on practice and theory of concurrent computing (SPTCC) Saint Petersburg, ITMO University July 3-7, 2017

Concurrent pool Let there be multicore (NUMA, SMP) computer system. The set of threads in random moments executes push() and pop() operations. We have to maximize efficiency of pool access for threads. Concurrent pool Memory push push pop pop pop push Thread Thread Thread Thread Thread Thread Threads 1 2 3 4 5 6 Processor Core Core Core Core Core Core Core Core cores 1 2 3 4 5 6 7 8 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Cache L3 L3 L3 L3 RAM NUMA node 1 NUMA node 2 QuickPath / Hyper-Transport 2

Concurrent pool Let there be multicore (NUMA, SMP) computer system. The set of threads in random moments executes push() and pop() operations. We have to maximize efficiency of pool access for threads. As the efficiency indicator Concurrent pool we used pool’s throughput Memory 𝑐 = 𝑂 / 𝑢 , where 𝑂 – total number of push push pop pop pop push executed operations of push and pop operations, Thread Thread Thread Thread Thread Thread Threads 𝑢 – time of modelling. 1 2 3 4 5 6 Throughput shows how Processor Core Core Core Core Core Core Core Core many operations has been cores 1 2 3 4 5 6 7 8 done in 1 second. L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Cache L3 L3 L3 L3 RAM NUMA node 1 NUMA node 2 QuickPath / Hyper-Transport 3

Approaches for implementation of concurrent pool 1. Concurrent queue based on thread locking (PThread mutex, MCS, CLH, CAS spinlocks, Flat Combining, Oyama Locks, RCL). 2. Lock-free concurrent linear lists. 3. Lists based on elimination backoff, combining trees and other methods. 4. Using of multiple lists, combined with diffraction tree. 5. and so on. 4

Approaches for implementation of concurrent pool 1. Concurrent queue based on thread locking (PThread mutex, MCS, CLH, CAS spinlocks, Flat Combining, Oyama Locks, RCL). Drawbacks: lack of scalability for large number of threads and high intensity of pool operations. 2. Lock-free concurrent linear lists. Drawbacks: heads of lists become bottlenecks, which cause to increase of access contention and decrease of efficiency of cache. 3. Lists based on elimination backoff, combining trees and other methods. Drawbacks: threads must wait for complementary operations (active waiting or using of system timer), algorithms have to be optimized by parameters: waiting interval, acceptable number of collisions, etc 4. Using of multiple lists, combined with diffraction tree. 5

Diffracting tree Diffracting tree is the binary tree with height ℎ , each node of which contains bits, determining the directions of thread passing. h – tree height push Objects arriving Thread 1 5 from threads 1 b push 2 2 0 Thread 4 2 1 … b 3 5 … 3 1 push b Thread 4 p Tree nodes (balancers) redirects requests arriving from threads for pushing and popping of elements in turn to one of the child nodes: • if 𝑐 = 0 , then a thread address to the node of right subtree, • if 𝑐 = 1 , then a thread address to the node of left subtree. After passing of the node thread inverts bit in it. Shavit N., Zemach A. Di Diff ffractin ing tree trees // ACM Transactions on Computer Systems (TOCS). 1996. vol. 14. no. 4. pp. 385 – 428. 6

Concurrent pool based on diffracting tree Pool is several concurrent queues 𝑟 = {1, 2, … , 2 ℎ } , which are accessed by means of diffracting tree. h – tree height Objects arriving from threads push / pull Thread Queue 1 Concurrent queues 1 b 2 4 push / pull Queue 2 0 1 Thread 3 2 … 5 b … 1 push / pull b Thread Queue 2 h p Requests for push- thread’s bit the objects b 1 Array for elimination of complementary b 2 pop- thread’s bit operations Afek Y., Korland G., Natanzon M., Shavit N. Sc Scalable le Prod roducer-Con onsumer Poo ools based on on El Elimin ination-Dif iffr fraction Tre rees // European Conference on Parallel Processing, 2010. pp. 151-162. 7

Concurrent pool LocOptDTPool based on diffracting tree Each node of developed pool LocOptDTPool on level 𝑘 of diffracting tree contains two arrays of atomic bits (for producer threads and consumer threads) of size 𝑛 𝑘 ≤ 𝑞 instead of two atomic bits. h – tree height push / pull Thread Queue 1 Concurrent queues 1 b 2 4 push / pull Queue 2 0 1 Thread 3 2 … 5 b … 1 push / pull b Thread Queue 2 h p Array of push-threads b 11 b 12 b 13 b 14 b 1 mi … b 21 b 22 b 23 b 24 b 2 m i … Array of pop-threads 𝑛 𝑘 – number of atomic bits, 𝑛 𝑗 = 𝑛 1 /2 𝑘 , where 𝑘 – tree level with the node 8

Concurrent pool LocOptDTPool based on diffracting tree Each node of developed pool LocOptDTPool on level 𝑘 of diffracting tree contains two arrays of atomic bits (for producer threads and consumer threads) of size 𝑛 𝑘 ≤ 𝑞 instead of two atomic bits. h – tree height Queue 1 b 11 b 21 Concurrent queues Thread b 1 b 12 b 22 Queue 2 0 Thread b 13 b 23 … b 2 ℎ 𝑗 = 𝑗 mod 𝑛 b 14 b 24 1 … b … … Queue 2 h Thread p b 1 mi b 2 mi At each visit of tree node, thread choose the cell in the array of atomic bits by means of hash-function ℎ 𝑗 = 𝑗 mod 𝑛 where 𝑗 ∈ {1, 2, … , 𝑞} – serial number of current thread, 𝑛 – size of array of atomic bits. 9

Concurrent pool LocOptDTPool based on diffracting tree processor core 1 Queue 1 Concurrent queues Thread b 1 processor core 2 Queue 2 0 Thread … b processor core 𝑜 2 1 … … b Queue 2 h Thread p Each processor core 𝑙 ∈ {1, 2, … , 𝑜} corresponds to the queues 𝑟 𝑘 = {𝑘2 ℎ /𝑜, 𝑘2 ℎ /𝑜 + 1, … , (𝑘 + 1)2 ℎ /𝑜 − 1} . All the objects, pushing (popping) into the pool by thread 𝑗 , affined to the core 𝑘 ( 𝑏(𝑗) = {𝑘} ), are distributed among the queues 𝑟 𝑘 only. 10

Concurrent pool LocOptDTPool based on diffracting tree processor core 1 Queue 1 Concurrent queues Thread b 1 processor core 2 Queue 2 0 Thread … b processor core 𝑜 2 1 … … b Queue 2 h Thread p During the operation push During the operation pop Queue, in which the object is being inserted: Queue, from which the object is being removed: 𝑟 𝑘 = (𝑞 × 𝑚 mod (2 ℎ / 𝑞) + 𝑏(𝑘)) 𝑟 𝑘 = (𝑞𝛽 + 𝑏(𝑘)) where 𝑞 – total number of processor cores, where 𝛽 – coefficient of shift. 𝑚 – tree leaf, visited by thread, 𝑏(𝑘) – core number, to which the thread 𝑘 is affined, 2 ℎ – total number of queues in the pool. 11

Concurrent pool LocOptDTPool: description of data structure LocOptDTPool { Node tree BitArray prod_bits [ m ], cons _ bits [ m ] ThreadSafeQueue queues [ n ] AffinityManager af_mgr AtomicInt prod_num , cons_num push( data ) pop() } 12

Concurrent pool LocOptDTPool: description of data structure LocOptDTPool { Node { Node tree int index , level BitArray prod_bits [ m ], Node children [2] cons _ bits [ m ] int traverse(BitArray bits ) ThreadSafeQueue queues [ n ] } AffinityManager af_mgr AtomicInt prod_num , cons_num push( data ) pop() } 13

Concurrent pool LocOptDTPool: description of data structure LocOptDTPool { Node { Node tree int index , level BitArray prod_bits [ m ], Node children [2] cons _ bits [ m ] int traverse(BitArray bits ) ThreadSafeQueue queues [ n ] } AffinityManager af_mgr AtomicInt prod_num , cons_num BitArray { push( data ) Bit bits_array [ n ][ m ] pop() int flip( tree _ level , node _ index ) { } return bits_array[ tree_level ] [ node_index ].flip() } Bit { AtomicInt bit int flip() { return bit .atomic_xor(1) } } 14

Concurrent pool LocOptDTPool: description of data structure LocOptDTPool { Node { Node tree int index , level BitArray prod_bits [ m ], Node children [2] cons _ bits [ m ] int traverse(BitArray bits ) ThreadSafeQueue queues [ n ] } AffinityManager af_mgr AtomicInt prod_num , cons_num BitArray { push( data ) Bit bits_array [ n ][ m ] pop() int flip( tree _ level , node _ index ) { } return bits_array[ tree_level ] [ node_index ].flip() } AffinityManager { thread_local int core, queue_offset Bit { AtomicInt next_core, next_offset AtomicInt bit set_core() int flip() { get_core() return bit .atomic_xor(1) get_queue_offset() } } } 15

Pool LocOptDTPool: algorithms of pushing and popping of elements Algorithm push of inserting of elements into pool 1. Increase counter prod_num . 2. Set thread affinity to processor core, if it have not done. 3. Choose the queue, into which the object will be inserted: 𝑟 𝑘 = (𝑞 × 𝑚 mod (2 ℎ / 𝑞) + 𝑏(𝑘)) where 𝑞 – number of processor cores, 𝑚 – tree leaf, visited by thread, 𝑏(𝑘) – core number, to which thread 𝑘 is affined, 2 ℎ – total number of queues in pool. 16

Optimization of Scalable Concurrent Pool Based on Diffraction Trees - PowerPoint PPT Presentation

Optimization of Scalable Concurrent Pool Based on Diffraction Trees Anenkov Alexandr Siberian State University of Telecommunications and Information Sciences, Novosibirsk E-mail: alex.anenkov@outlook.com Paznikov Alexey Saint Petersburg

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Scalable Global Optimization via Local Bayesian Optimization David Eriksson Uber AI

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Bayesian Optimization CSC2541 - Topics in Machine Learning Scalable and Flexible Models of

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! Research and ANU

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

Informatik II Tutorial 12 Mihai Bce mihai.bace@inf.ethz.ch Mihai Bce | | December 14,

multi-platform, multi-os client/server Client/Server Communication Suppose we send data

A Very Quick Introduction to CUDA Burak Himmetoglu Supercomputing Consultant Enterprise

Java Concurrency 1 Shell CSCE 314 TAMU The World is Concurrent Concurrent programs: more than

Simultaneous Multi- Threaded Design Virendra Singh Associate Professor C omputer A rchitecture

Funcons for threads and processes Peter D. Mosses Swansea University (emeritus) TU Delft

Distributed Systems Principles and Paradigms Chapter 03 (version February 11, 2008 ) Maarten van

CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012