microsoft research
play

Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research The free lunch is over. Muticores are here. We have to program them. This is hard. Yada-yada-yada. Programming parallel computers Plan A . Start with a language whose computational fabric


  1. Simon Peyton Jones Microsoft Research

  2. The free lunch is over. Muticores are here. We have  to program them. This is hard. Yada-yada-yada. Programming parallel computers   Plan A . Start with a language whose computational fabric is by-default sequential, and by heroic means make the program parallel  Plan B . Start with a language whose computational fabric is by-default parallel Every successful large-scale application of parallelism  has been largely declarative and value-oriented  SQL Server  LINQ  Map/Reduce  Scientific computation Plan B will win . Parallel programming will increasingly  mean functional programming

  3.  “Just use a functional language and your troubles are over”  Right idea:  No side effects Limited side effects  Strong guarantees that sub-computations do not interfere  But far too starry eyed. No silver bullet:  one size does not fit all  need to “think parallel”: if the algorithm has sequential data dependencies, no language will save you!

  4. A “cost model” gives  Different problems need the programmer some different solutions. idea of what an  Shared memory vs distributed memory operation costs, without burying her in  Transactional memory details  Message passing  Data parallelism Examples:  Locality Send message: copy • data or swing a  Granularity pointer?  Map/reduce Memory fetch: •  ...on and on and on... uniform access or do cache effects  Common theme: dominate? Thread spawn: tens  the cost model matters – you can’t • of cycles or tens of just say “leave it to the system” thousands of cycles?  no single cost model is right for all Scheduling: can a • thread starve?

  5.  Goal: express the “natural structure” of a program involving lots of concurrent I/O (eg a web serer, or responsive GUI, or download lots of URLs in parallel)  Makes perfect sense with or without multicore  Most threads are blocked most of the time Usually done with   Thread pools  Event handler  Message pumps Really really hard to get right, esp when combined with  exceptions, error handling NB: Significant steps forward in F#/C# recently: Async<T> See http://channel9.msdn.com/blogs/pdc2008/tl11

  6.  Sole goal: performance using multiple cores  …at the cost of a more complicated program  #include “StdTalk.io”  Clock speeds not increasing  Transistor count still increasing  Delivered in the form of more cores  Often with an inadequate memory bandwidth  No alternative: the only way to ride Moore’s law is to write parallel code

  7.  Use a functional language  But offer many different approaches to parallel/concurrent programming, each with a different cost model  Do not force an up-front choice:  Better one language offering many abstractions  …than many languages offer one each  (HPF, map/reduce, pthreads …)

  8. This talk Lots of different concurrent/parallel Multicore programming paradigms (cost models) in Haskell Use Haskell! Semi-implicit Task parallelism Data parallelism parallelism Explicit threads, Operate simultaneously on synchronised via locks, bulk data Evaluate pure messages, or STM functions in parallel Massive parallelism Modest parallelism Modest parallelism Easy to program Hard to program Single flow of control Implicit synchronisation Easy to program Implicit synchronisation Slogan: no silver bullet: embrace diversity

  9. Multicore Parallel programming essential Task parallelism Explicit threads, synchronised via locks, messages, or STM

  10.  Lots of threads, all performing I/O  GUIs  Web servers (and other servers of course)  BitTorrent clients  Non-deterministic by design  Needs  Lightweight threads  A mechanism for threads to coordinate/share  Typically: pthreads/Java threads + locks/condition variables

  11.  Very very lightweight threads  Explicitly spawned, can perform I/O  Threads cost a few hundred bytes each  You can have (literally) millions of them  I/O blocking via epoll => OK to have hundreds of thousands of outstanding I/O requests  Pre-emptively scheduled  Threads share memory  Coordination via Software Transactional Memory (STM)

  12. main = do { putStr (reverse “yes”) ; putStr “no” } • Effects are explicit in the type system – (reverse “yes”) :: String -- No effects – (putStr “no”) :: IO () -- Can have effects • The main program is an effect-ful computation – main :: IO ()

  13. newRef :: a -> IO (Ref a) readRef :: Ref a -> IO a writeRef :: Ref a -> a -> IO () Reads and main = do { r <- newRef 0 ; incR r writes are ; s <- readRef r 100% explicit! ; print s } You can’t say incR :: Ref Int -> IO () (r + 6), because incR r = do { v <- readRef r r :: Ref Int ; writeRef r (v+1) }

  14. forkIO :: IO () -> IO ThreadId  forkIO spawns a thread  It takes an action as its argument webServer :: RequestPort -> IO () webServer p = do { conn <- acceptRequest p ; forkIO (serviceRequest conn) ; webServer p } serviceRequest :: Connection -> IO () serviceRequest c = do { … interact with client … } No event-loop spaghetti!

  15.  How do threads coordinate with each other? main = do { r <- newRef 0 ; forkIO (incR r) ; incR r ; ... } Aargh! A race incR :: Ref Int -> IO () incR r = do { v <- readRef r ; writeRef r (v+1) }

  16. A 10-second review:  Races : due to forgotten locks  Deadlock : locks acquired in “wrong” order.  Lost wakeups: forgotten notify to condition variable  Diabolical error recovery : need to restore invariants and release locks in exception handlers  These are serious problems. But even worse...

  17. Scalable double-ended queue: one lock per cell No interference if ends “far enough” apart But watch out when the queue is 0, 1, or 2 elements long!

  18. Difficulty of concurrent Coding style queue Sequential code Undergraduate

  19. Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables

  20. Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables Atomic blocks Undergraduate

  21. atomically { ... sequential get code ... }  To a first approximation, just write the sequential code, and wrap atomically around it  All-or-nothing semantics: Atomic commit  Atomic block executes in Isolation A C I D  Cannot deadlock (there are no locks!)  Atomicity makes error recovery easy (e.g. exception thrown inside the get code)

  22. atomically :: IO a -> IO a main = do { r <- newRef 0 ; forkIO (atomically (incR r)) ; atomically (incR r) ; ... }  atomically is a function, not a syntactic construct  A worry: what stops you doing incR outside atomically?

  23. atomically :: STM a -> IO a newTVar :: a -> STM (TVar a)  Better idea: readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM () incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } main = do { r <- atomically (newTVar 0) ; forkIO (atomically (incT r)) ; atomic (incT r) ; ... }

  24. atomic :: STM a -> IO a newTVar :: a -> STM (TVar a) readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM ()  Can’t fiddle with TVars outside atomic block [good]  Can’t do IO inside atomic block [sad, but also good]  No changes to the compiler (whatsoever). Only runtime system and primops.  ...and, best of all...

  25. incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } Composition incT2 :: TVar Int -> STM () is THE way incT2 r = do { incT r; incT r } we build big programs foo :: IO () that work foo = ...atomically (incT2 r)...  An STM computation is always executed atomically (e.g. incT2). The type tells you.  Simply glue STMs together arbitrarily; then wrap with atomic  No nested atomic. (What would it mean?)

  26.  MVars for efficiency in (very common) special cases  Blocking (retry) and choice (orElse) in STM  Exceptions in STM

  27.  A very simple web server written in Haskell  full HTTP 1.0 and 1.1 support,  handles chunked transfer encoding,  uses sendfile for optimized static file serving,  allows request bodies and response bodies to be processed in constant space  Protection for all the basic attack vectors: overlarge request headers and slow-loris attacks  500 lines of Haskell (building on some amazing libraries: bytestring, blaze-builder, iteratee)

  28.  A new thread for each user request  Fast, fast Pong requests/sec

  29.  Again, lots of threads: 400-600 is typical  Significantly bigger program: 5000 lines of Haskell – but way smaller (Not shown: Vuse 480k lines) 80,000 than the loc competition Erlang Haskell  Built on STM  Performance: roughly competitive

  30.  So far everything is shared memory  Distributed memory has a different cost model  Think message passing…  Think Erlang …

Recommend


More recommend