concurrent programming
play

Concurrent Programming Performance (vs. VAX-11/780) 10000 with - PDF document

11/21/16 COMP 530: Operating Systems COMP 530: Operating Systems Uniprocessor Performance Not Scaling Concurrent Programming Performance (vs. VAX-11/780) 10000 with Threads: 20% /year 1000 Why you should care deeply 52% /year 100 Don


  1. 11/21/16 COMP 530: Operating Systems COMP 530: Operating Systems Uniprocessor Performance Not Scaling Concurrent Programming Performance (vs. VAX-11/780) 10000 with Threads: 20% /year 1000 Why you should care deeply 52% /year 100 Don Porter Portions courtesy Emmett Witchel 10 25% /year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Graph by Dave Patterson 1 COMP 530: Operating Systems COMP 530: Operating Systems Power and Heat Lay Waste to CPU Makers What about Moore ’ s law? • Intel P4 (2000-2007) – 1.3GHz to 3.8GHz, 31 stage pipeline – “ Prescott ” in 02/04 was too hot. Needed 5.2GHz to beat 2.6GHz Athalon • Intel Pentium Core, (2006-) – 1.06GHz to 3GHz, 14 stage pipeline – Based on mobile (Pentium M) micro-architecture • Power efficient • 2% of electricity in the U.S. feeds computers • Number of transistors double every 24 months – Doubled in last 5 years – Not performance! COMP 530: Operating Systems COMP 530: Operating Systems Transistor Budget Multi-Core is Here: Plain and Simple • We have an increasing glut of transistors • Raise your hand if your laptop is single core? – (at least for a few more years) • Your phone? • But we can’t use them to make things faster – Techniques that worked in the 90s blew up heat faster • That’s what I thought than we can dissipate it • What to do? – Use the increasing transistor budget to make more cores! 5 6 1

  2. 11/21/16 COMP 530: Operating Systems COMP 530: Operating Systems Multi-Core Programming == Essential Skill Threads: OS Abstraction for Concurrency • Process abstraction combines two concepts – Concurrency • Hardware manufacturers betting big on • Each process is a sequential execution stream of instructions multicore – Protection • Each process defines an address space • Software developers are needed • Address space identifies all addresses that can be touched by the program • Threads • Writing concurrent programs is not easy – Key idea: separate the concepts of concurrency from protection • You will learn how to do it in this class – A thread is a sequential execution stream of instructions – A process defines the address space that may be shared by multiple threads – Threads can execute on different cores on a multicore CPU (parallelism for performance) and can communicate with other threads by updating memory Still treated like a bonus: Don’t graduate without it! 8 COMP 530: Operating Systems COMP 530: Operating Systems Practical Difference Programmer’s View • With processes, you coordinate through nice abstractions (relatively speaking – e.g., lab 1) void fn1(int arg0, int arg1, …) {…} – Pipes, signals, etc. • With threads, you communicate through data main() { structures in your process virtual address space … – Just read/write variables and pointers tid = CreateThread(fn1, arg0, arg1, …); … } At the point CreateThread is called, execution continues in parent thread in main function, and execution starts at fn1 in the child thread, both in parallel (concurrently) 9 COMP 530: Operating Systems COMP 530: Operating Systems Implementing Threads: Example Redux How can it help? • How can this code take advantage of 2 threads? for(k = 0; k < n; k++) Virtual Address Space a[k] = b[k] * c[k] + d[k] * e[k]; hello libc.so Linux • Rewrite this code fragment as: heap stk1 stk2 do_mult(l, m) { for(k = l; k < m; k++) 0 0xffffffff a[k] = b[k] * c[k] + d[k] * e[k]; • 2 threads requires 2 stacks in the process } main() { • No problem! CreateThread(do_mult, 0, n/2); • Kernel can schedule each thread separately CreateThread(do_mult, n/2, n); – Possibly on 2 CPUs • What did we gain? – Requires some extra bookkeeping 2

  3. 11/21/16 COMP 530: Operating Systems COMP 530: Operating Systems How Can Threads Help? Overlapping I/O and Computation • Consider a Web server Request 1 Request 2 Create a number of threads, and for each thread do Thread 1 Thread 2 v get network message from client v get network message v get URL data from disk (URL) from client v get network message v send data over network v get URL data from disk (URL) from client (disk access latency) v get URL data from disk • What did we gain? (disk access latency) v send data over network v send data over network Total time is less than request 1 + request 2 Time COMP 530: Operating Systems COMP 530: Operating Systems Why threads? (summary) Threads vs. Processes Processes Threads • Computation that can be divided into concurrent chunks • A thread has no data segment A process has code/data/heap & other or heap segments – Execute on multiple cores: reduce wall-clock exec. time • A thread cannot live on its own, There must be at least one thread in a it must live within a process – Harder to identify parallelism in more complex cases process • There can be more than one Threads within a process share • Overlapping blocking I/O with computation thread in a process, the first code/data/heap, share I/O, but each thread calls main & has the – If my web server blocks on I/O for one client, why not work has its own stack & registers process ’ s stack If a process dies, its resources are on another client’s request in a separate thread? • If a thread dies, its stack is reclaimed & all threads die reclaimed – Other abstractions we won’t cover (e.g., events) • Inter-thread communication via Inter-process communication via OS memory. and data copying. • Each thread can run on a Each process can run on a different different physical processor physical processor • Inexpensive creation and Expensive creation and context switch context switch COMP 530: Operating Systems COMP 530: Operating Systems Implementing Threads Thread Life Cycle Process ’ s TCB for address space • Processes define an address • Threads (just like processes) go through a sequence of start , Thread1 space; threads share the mapped segments ready , running , waiting , and done states address space PC DLL ’ s SP State • Process Control Block (PCB) Heap Start Done Registers contains process-specific … information – Owner, PID, heap pointer, priority, active thread, and TCB for Stack – thread2 Ready Running pointers to thread information Thread2 • Thread Control Block (TCB) PC Stack – thread1 contains thread-specific SP information State Initialized data Waiting Registers – Stack pointer, PC, thread state … (running, …), register values, a Code pointer to PCB, … 3

  4. 11/21/16 COMP 530: Operating Systems COMP 530: Operating Systems Threads have the same Threads have their own…? scheduling states as processes 1. CPU 1. True 2. Address space 2. False 3. PCB 4. Stack In fact, OSes generally schedule threads to CPUs, not processes 5. Registers Yes, yes, another white lie in this course COMP 530: Operating Systems COMP 530: Operating Systems Lecture Outline Performance: Latency vs. Throughput • What are threads? • Latency: time to complete an operation • Small digression: Performance Analysis • Throughput: work completed per unit time • Multiplying vector example: reduced latency – There will be a few more of these in upcoming lectures • Web server example: increased throughput • Why are threads hard? • Consider plumbing – Low latency: turn on faucet and water comes out – High bandwidth: lots of water (e.g., to fill a pool) • What is “ High speed Internet? ” – Low latency: needed to interactive gaming – High bandwidth: needed for downloading large files – Marketing departments like to conflate latency and bandwidth… 21 COMP 530: Operating Systems COMP 530: Operating Systems Latency and Throughput Latency, Throughput, and Threads • Can threads improve throughput? • Latency and bandwidth only loosely coupled – Yes, as long as there are parallel tasks and CPUs available – Henry Ford: assembly lines increase bandwidth without • Can threads improve latency? reducing latency – Yes, especially when one task might block on another task’s • My factory takes 1 day to make a Model-T ford. IO – But I can start building a new car every 10 minutes • Can threads harm throughput? – At 24 hrs/day, I can make 24 * 6 = 144 cars per day – Yes, each thread gets a time slice. – A special order for 1 green car, still takes 1 day – If # threads >> # CPUs, the %of CPU time each thread gets – Throughput is increased, but latency is not. approaches 0 • Latency reduction is difficult • Can threads harm latency? • Often, one can buy bandwidth – Yes, especially when requests are short and there is little I/O – E.g., more memory chips, more disks, more computers – Big server farms (e.g., google) are high bandwidth Threads can help or hurt: Understand when they help! 4

Recommend


More recommend