Speeding Up Thread-Local Storage Access from Dynamic Libraries - PowerPoint PPT Presentation

Speeding Up Thread-Local Storage Access from Dynamic Libraries Alexandre Oliva http://www.lsd.ic.unicamp.br/ ∼ oliva/ aoliva@redhat.com oliva@lsd.ic.unicamp.br Red Hat University of Campinas March, 2008

Summary • TLS?!? • Dynamic libraries • Thread-Local Storage • Optimizations • Performance numbers • ARM Port • Relaxations

Background • Per-thread data • Stack, automatic variables • pthread [gs]etspecific • TLS: thread variables

Dynamic Libraries extern int i, g(void); int f(void) { return g() + i; } PDC (exec) PIC (shared lib) copy next pc to %ebx addl $ G O T - ., %ebx call g call g@PLT movl i@GOT(%ebx), %edx addl i, %eax addl (%edx), %eax

Thread-Local Storage thread int x; extern thread int y; Static TLS Block Offset Dynamic TLS Blocks x z TP y Offset TP offsets DTV Module Index

Local Exec thread int x; int getx() { return x; } • movl %gs:x@NTPOFF, %eax Static TLS Block Offset Dynamic TLS Blocks x z TP y Offset TP offsets DTV Module Index

Initial Exec extern thread int y; int gety() { return y; } • movl y@GOTNTPOFF(%ebx), %eax • movl %gs:(%eax), %eax G O T + y@GOTNTPOFF: • .word y@NTPOFF

General Dynamic thread int z; int getz() { return z; } • leal z@TLSGD(,%ebx,1), %eax • call tls get addr@PLT • movl (%eax), %eax void * tls get addr(struct { long index, offset; } *); G O T + z@TLSGD: • .word index, offset

tls get addr • If generation count is not current, update() DTV • If dtv[index] not allocated, allocate() it • Return dtv[index] + offset Static TLS Block Offset Dynamic TLS Blocks x z TP y Offset TP offsets DTV Module Index

Local Dynamic static thread int z1, z2; int getz() { return z1 + z2; } • leal z1@TLSLDM(%ebx), %eax • call tls get addr@PLT • movl %eax, %esi • movl z1@DTPOFF(%eax), %eax • addl z2@DTPOFF(%esi), %eax

TLS Descriptor-based General Dynamic thread int yz; int getyz() { return yz; } • leal yz@TLSDESC(%ebx), %eax • call *yz@TLSCALL(%eax) ;; == call *(%eax) • movl %gs:(%eax), %eax G O T + yz@TLSDESC: • .word resolver, argument

Static Descriptor G O T + y@TLSDESC: • .word sresolver, y@NTPOFF sresolver: • movl 4(%eax), %eax • ret

Dynamic Descriptor G O T + z@TLSDESC: • .word dresolver, dyndesc(z) dresolver: • If GC is current enough and dtv[index] is allocated, return dtv[index] + offset - TP • Call tls get addr preserving registers, subtract TP dyndesc(z): (allocated by the dynamic loader) • .word index, offset, generation

Lazy Descriptor G O T + yz@TLSDESC: • .word lresolver, reloc lresolver: • Acquire loader lock • If not resolved yet, – Apply relocation preserving registers • Release lock • Return into final resolver

Speedups: Static t (CK) × ((MinSt, MaxSt) × (P3, A64/32, A64/64)) 70 3.4x 5.2x 60 1.6x 1.6x 50 40 4.6x 5.3x 2.7x 26.0x 30 3.4x 3.8x 5.2x 20 2.9x 10 0

Speedups: Dynamic t (CK) × ((MinSt, MaxSt) × (P3, A64/32, A64/64)) 70 60 1.2x 1.4x 50 1.1x 1.2x 40 1.9x 1.6x 1.8x 1.6x 30 1.5x 1.5x 1.1x 20 1.0x 10 0

ARM Port Original Optimized ldr r0, .Lt0 ldr r0, .Lt0 .L1: add r0, pc, r0 bl tls get addr(PLT) bl foo(tlscall) ldr r0, [r0] ldr r0, [$tp, r0] .Lt0: .word foo(tlsgd) \ .word foo(tlsdesc) \ + (. - .L1 - 8) + (. - .L1)

ARM Port (cont) Original Optimized bl tga(PLT) bl tramp .word foo(tlsgd) \ .word foo(tlsdesc) \ + (. - .L1 - 8) + (. - .L1 - 4 ) add ip, pc, tga(got)[24:31] add r0, lr, r0 add ip, ip, tga(got)[16:23] ldr r1, [r0, #4] ldr pc, [ip, tga(got)[0:15]]! bx r1

Relaxations GD IE LE ldr r0, .Lt0 ldr r0, .Lt0 ldr r0, .Lt0 bl foo(tlscall) ldr r0, [pc, r0] nop .word foo(tlsdesc) \ foo(gottpoff) \ foo(tpoff) + (. - .L1 - 4) + (. - .L1 - 8) + 0

Inlining the Trampoline ldr rt, .Lt1 ldr rt, .Lt1 ldr rt, .Lt1 add rx, pc, rt add rx, pc, rt mov rx, rt ldr ry, [rx, #4] ldr ry, [rx] nop mov r0, rx mov r0, rx mov r0, rx blx ry mov r0, ry nop .word foo(tlsdesc) \ foo(gottpoff) \ foo(tpoff) + (. - .L1 - 8) + (. - .L1 - 8) + 0

Conclusions • Major speedups in the most common case • Small speedups even in the dlopen case – Compiler improvements could reduce them – Generation count – Calling conventions • Smaller code, same data space in static case • Lazy relocation • Ported to x86, x86 64, ARM and FR-V

Speeding Up Thread-Local Storage Access from Dynamic Libraries - PowerPoint PPT Presentation

Speeding Up Thread-Local Storage Access from Dynamic Libraries Alexandre Oliva http://www.lsd.ic.unicamp.br/ oliva/ aoliva@redhat.com oliva@lsd.ic.unicamp.br Red Hat University of Campinas March, 2008 Summary TLS?!? Dynamic

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning Andrew Habib,

What is a Thread? A thread lives within a process; A process can have several threads.

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

XNA and C# XNA Why? Project management Visual Studio 2010 Framework

20 years of CONFIG_VT David Herrmann <dh.herrmann@gmail.com> According to slashdot: My

OKLAHOMA STATE INNOVATION MODEL (OSIM): KICKOFF WEBINAR March 18, 2015 11am-12pm CST

Innovative Special Project of National Significance (SPNS): Fusing Part A, B, C, & D Data for

Activity Presentations Dr. Steven Bullock Part Time Chemistry Professor-KSU April 4, 2017

On the Minimization Over Sparse Symmetric Sets: Projections, Optimality Conditions and Algorithms

Effective Operators In Top Quark Production and Decay Cen Zhang Department of Physics

Miscellanea: CXB and GDE Andrea Goldwurm ( APC Paris / CEA Saclay) Laurent Bouchet ( IRAP,

Speeding Up Thread-Local Storage Access from Dynamic Libraries - PowerPoint PPT Presentation

Speeding Up Thread-Local Storage Access from Dynamic Libraries Alexandre Oliva http://www.lsd.ic.unicamp.br/ oliva/ aoliva@redhat.com oliva@lsd.ic.unicamp.br Red Hat University of Campinas March, 2008 Summary TLS?!? Dynamic

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning Andrew Habib,

What is a Thread? A thread lives within a process; A process can have several threads.

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

XNA and C# XNA Why? Project management Visual Studio 2010 Framework

20 years of CONFIG_VT David Herrmann &lt;dh.herrmann@gmail.com&gt; According to slashdot: My

OKLAHOMA STATE INNOVATION MODEL (OSIM): KICKOFF WEBINAR March 18, 2015 11am-12pm CST

Innovative Special Project of National Significance (SPNS): Fusing Part A, B, C, &amp; D Data for

Activity Presentations Dr. Steven Bullock Part Time Chemistry Professor-KSU April 4, 2017

On the Minimization Over Sparse Symmetric Sets: Projections, Optimality Conditions and Algorithms

Effective Operators In Top Quark Production and Decay Cen Zhang Department of Physics

Miscellanea: CXB and GDE Andrea Goldwurm ( APC Paris / CEA Saclay) Laurent Bouchet ( IRAP,

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

20 years of CONFIG_VT David Herrmann <dh.herrmann@gmail.com> According to slashdot: My

Innovative Special Project of National Significance (SPNS): Fusing Part A, B, C, & D Data for