A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim
Era of multicore machines 2
Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly? Ordering Becomes scalability bottleneck ... 3
Example: Read Log Update (RLU) ● Extension of RCU ● Modifes objects in a thread’s local log ● Clock maintains correct snapshot (old vs new) ● Frees objects via epoch-based reclamation 4
Read Log Update (RLU) operation A A log/bu g/bufe fer to s o stor ore Globa obal Clock P copi pies (pe per-thread) d) (22) Log Log B’ D’ RLU RLU heade der A B C D E Re Read d on s start Q Loc Local Cloc ock (22)
RLU commit operation Globa Globa obal Clock obal Clock Write Clock Write Clock 1. P upda pdates c clocks 1. P upda pdates c clocks (22) (23) (23) (∞) ∞) 2. P executes RCU-epoc och 2. P executes RCU-epoc och P Waits fo Waits fo for Q Q to fnish for Q Q to fnish B’ D’ C’ Logical Clock maintains correctness/ordering A B C D E Maintained via atomic instructions FAA/CAS → Q Local Clock Lo (22) Q will read Q wi d only y old d obj objects
Issue with logical clock ● RLU sufers from global clock contention – Cache-line contention due to atomic instructions – Possible to circumvent with our approach Phi Phi ARM ARM 180 180 160 160 How can we achieve ordering 150 150 120 120 120 120 with minimal timestamping overhead? Ops/usec Ops/usec 90 90 80 80 60 60 Atomic Atomic 40 40 30 30 Ordo 0 0 0 0 0 0 64 64 128 128 192 192 256 256 0 0 16 16 32 32 48 48 64 64 80 80 96 96 #cores #cores #cores #cores 7
Our proposed ordering primitive: Ordo ● Exposes a monotonically increasing clock – Current hardware already provides – rdtscp (X86), cntvct (ARM), stick (Sparc) ● Relies on a per-core invariant hardware clock – Monotonically increases with constant skew regardless of dynamic frequency and voltage scaling 8
Challenges with Ordo ● Comparing two clocks – Clocks are not synchronized – Cores receive RESET signal at varying times ● Application: – Modifying algorithms to use Ordo – Able to compare between two timestamps 9
Embracing the invariant clocks ● Measure a global uncertainty window – Ensure a new timestamp once a window is over – Provides a notion of globally synchronized clock ● Measured ofset MUST have the invariant: Measured ofset is greater than the physical ofset – Physical ofset: ofset due to RESET signal – Measured ofset: physical ofset + one-way delay 10
Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path 1) Calculate C 1 timestamp C C 1 2 2) Notify C 2 via memory C 1 C C → 2 3) Get C 2 timestamp T ( C ) : 0 20 1 4) Repeat steps 1-3 to get the minimum T ( C ) : 2 0 2 time 11
Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path ● Repeat prior steps in C C 1 2 opposite direction ● Do not know which clock T ( C ) : 5 0 2 is ahead of the other C 2 C C → 1 30 T ( C ) : 8 0 1 time 12
Calculating global uncertainty window: ORDO_BOUNDARY ● Repeat steps for each pair of cores from C 1 to C n ● The maximum ofset is the ORDO_BOUNDARY C C C C C 1 C C → 1 2 1 2 2 20 T ( C ) : 5 0 T ( C ) : 0 2 C 2 C C → 1 1 30 C 1 C ←→ 2 30 T ( C ) : 2 0 2 time T ( C ) : 8 0 1 13
Ordo application ● Applicable to any timestamp-based algorithm ● Expose Ordo API for these algorithms – get_time(): Current hardware timestamp – cmp_time(t 1 , t 2 ): Compare two timestamps with uncertainty, if |t 1 -t 2 | < ORDO_BOUNDARY – new_time(t): Return t new > (t + ORDO_BOUNDARY) ● Catch: Algorithms should handle uncertainty 14
Algorithms with Ordo handling uncertainty ● Physical to logical timestamping: – Rely on c to compare two timestamps m p _ t i m e ( ) – Either defer or revert if comparison is uncertain – Use n to guarantee new time e w _ t i m e ( ) ● Physical timestamping: – Use new _ to access the global clock t i m e ( ) 15
Read Log Update (RLU Ordo ) operation Glo loba bal ofs fset P’s l loc ocal P (30) cloc ock ( (22) Log Log B’ D’ A B C D E Re Read d on s start Q Loc Local Cloc ock Q’s cor Q’s ore (50) cloc ock ( (50)
RLU Ordo commit operation Write Clock Write Clock 1. P P u upd pdates own c cloc ock 1. P P u upd pdates own c cloc ock (150) (∞) ∞) 2. P P e executes RC RCU-epo poch 2. P P e executes RC RCU-epo poch Glo loba bal ofs fset P Waits fo Waits fo for Q Q to fnish for Q Q to fnish (30) B’ D’ C’ A B C D E Q Local Clock Lo (50) Q wi Q will read d only y old d obj objects
Algorithms modifed with Ordo ● RLU See our paper ● Transactional Locking (TL2) in STM ● Database concurrency control: OCC, MVCC ● Oplog used in Linux forking functionality 18
Evaluation ● Questions: – Measured global ofset (ORDO_BOUNDARY) – Maximum scalability of Ordo – Ordo’s impact on algorithms ● Machines confguration: – 240 core, 8 socket Intel Xeon machine (Xeon) – 256 core, Intel Xeon Phi (Phi) – 96 core, 2 socket ARM machine (ARM) – 32 core, 8 socket AMD machine (AMD) 19
Ofset between clocks ● Empirically measured ofset after reboots ● ORDO_BOUNDARY is the maximum ofset Machine Minimum (ns) Maximum (ns) Intel Xeon 70 276 Intel Xeon phi 90 270 ARM 100 1,100 AMD 93 203 20
Timestamping with Ordo ● Ordo relies on hardware timestamping ● 17.4 – 285.5x faster than atomic increments 12 12 12 12 Ops/usec/core Ops/usec/core Xeon(Atomic) Xeon(Atomic) Phi(Atomic) Phi(Atomic) 8 8 8 8 Xeon(Ordo) Phi(Ordo) 4 4 4 4 0 0 0 0 0 0 60 60 120 120 180 180 240 240 0 0 64 64 128 128 192 192 256 256 12 12 12 12 Ops/usec/core Ops/usec/core ARM(Atomic) ARM(Atomic) AMD(Atomic) AMD(Atomic) 8 8 8 8 ARM(Ordo) AMD(Ordo) 4 4 4 4 0 0 0 0 0 0 16 16 32 32 48 48 64 64 80 80 96 96 0 0 4 4 8 12 16 20 24 28 32 8 12 16 20 24 28 32 21 #core #core #core #core
Scaling RLU with Ordo ● RLU Ordo is 2.1x faster on an average ● Still sufers from object copy and its locking RLU 2% RLU(Ordo) 2% 150 180 150 120 Ops/usec 120 90 90 60 60 30 Xeon Phi 30 0 0 0 60 120 180 240 0 64 128 192 256 160 80 120 60 Ops/usec 80 40 40 20 ARM AMD 0 0 0 16 32 48 64 80 96 0 8 16 24 32 #core #core 22
Discussion and limitations ● Simplifes the design and understanding of algorithms ● Not a panacea – Applicable when clock is contentious ● No skew consideration ● Thread ID-based timestamp comparison has its limitation 23
Conclusion ● Ordo is a scalable timestamping primitive – Relies on invariant hardware clocks ● Exposes time-based API to the user ● Applied Ordo to fve concurrent algorithms ● Improves the scalability of algorithms by at most 39.7x across architectures 24
Backup Slides
Ofset between clocks ● Clocks are not synchronized – 8 th socket in Xeon and 2 nd socket in ARM – Results remain consistent even after reboots and measuring after a period of time Arm Xeon 96 120 900 225 72 Ofset between clocks 90 600 150 48 60 300 75 24 30 0 0 0 0 0 30 60 90 120 0 24 48 72 96 26 # core # core
Sensitivity of ORDO_BOUNDARY ● Varying ORDO_BOUNDARY from 1/8x – 8x ● Cycles increases from 32.2–18K on Xeon machine 1.08 1.04 Normalized throughput 1.00 0.96 0.92 1-core 1-socket 8-sockets 27
Physical timestamping: Oplog ● Improves Exim performance by 1.9x at 240 cores 120k Stock Oplog(Ordo) 100k 80k Messages/sec 60k 40k 20k 0k 30 60 90 120 150 180 210 240 28 #core
Scaling database concurrency control ● Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB) ● OCC Ordo 1.24x faster than Tictoc and Silo (TPC-C) OCC MVCC OCC (Ordo) MVCC (Ordo) 180 100 Xeon Phi 150 80 Txns/usec 120 60 90 40 60 20 30 0 0 0 60 120 180 240 0 64 128 192 40 35 ARM AMD 32 28 Txns/usec 24 21 16 14 8 7 0 0 29 0 16 32 48 64 80 96 0 8 16 24 32 # core # core
Cannot use clock synchronization protocols ● No information on minimum bounds on message delivery between/among clocks ● Protocols introduce various errors ● Can lead to mis-synchronized clocks – Larger or smaller than the actual physical ofset Lead to incorrect implementation of concurrent algorithms 30
Recommend
More recommend