Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research

The free lunch is over. Muticores are here. We have  to program them. This is hard. Yada-yada-yada. Programming parallel computers   Plan A . Start with a language whose computational fabric is by-default sequential, and by heroic means make the program parallel  Plan B . Start with a language whose computational fabric is by-default parallel Every successful large-scale application of parallelism  has been largely declarative and value-oriented  SQL Server  LINQ  Map/Reduce  Scientific computation Plan B will win . Parallel programming will increasingly  mean functional programming

 “Just use a functional language and your troubles are over”  Right idea:  No side effects Limited side effects  Strong guarantees that sub-computations do not interfere  But far too starry eyed. No silver bullet:  one size does not fit all  need to “think parallel”: if the algorithm has sequential data dependencies, no language will save you!

A “cost model” gives  Different problems need the programmer some different solutions. idea of what an  Shared memory vs distributed memory operation costs, without burying her in  Transactional memory details  Message passing  Data parallelism Examples:  Locality Send message: copy • data or swing a  Granularity pointer?  Map/reduce Memory fetch: •  ...on and on and on... uniform access or do cache effects  Common theme: dominate? Thread spawn: tens  the cost model matters – you can’t • of cycles or tens of just say “leave it to the system” thousands of cycles?  no single cost model is right for all Scheduling: can a • thread starve?

 Goal: express the “natural structure” of a program involving lots of concurrent I/O (eg a web serer, or responsive GUI, or download lots of URLs in parallel)  Makes perfect sense with or without multicore  Most threads are blocked most of the time Usually done with   Thread pools  Event handler  Message pumps Really really hard to get right, esp when combined with  exceptions, error handling NB: Significant steps forward in F#/C# recently: Async<T> See http://channel9.msdn.com/blogs/pdc2008/tl11

 Sole goal: performance using multiple cores  …at the cost of a more complicated program  #include “StdTalk.io”  Clock speeds not increasing  Transistor count still increasing  Delivered in the form of more cores  Often with an inadequate memory bandwidth  No alternative: the only way to ride Moore’s law is to write parallel code

 Use a functional language  But offer many different approaches to parallel/concurrent programming, each with a different cost model  Do not force an up-front choice:  Better one language offering many abstractions  …than many languages offer one each  (HPF, map/reduce, pthreads …)

This talk Lots of different concurrent/parallel Multicore programming paradigms (cost models) in Haskell Use Haskell! Semi-implicit Task parallelism Data parallelism parallelism Explicit threads, Operate simultaneously on synchronised via locks, bulk data Evaluate pure messages, or STM functions in parallel Massive parallelism Modest parallelism Modest parallelism Easy to program Hard to program Single flow of control Implicit synchronisation Easy to program Implicit synchronisation Slogan: no silver bullet: embrace diversity

Multicore Parallel programming essential Task parallelism Explicit threads, synchronised via locks, messages, or STM

 Lots of threads, all performing I/O  GUIs  Web servers (and other servers of course)  BitTorrent clients  Non-deterministic by design  Needs  Lightweight threads  A mechanism for threads to coordinate/share  Typically: pthreads/Java threads + locks/condition variables

 Very very lightweight threads  Explicitly spawned, can perform I/O  Threads cost a few hundred bytes each  You can have (literally) millions of them  I/O blocking via epoll => OK to have hundreds of thousands of outstanding I/O requests  Pre-emptively scheduled  Threads share memory  Coordination via Software Transactional Memory (STM)

main = do { putStr (reverse “yes”) ; putStr “no” } • Effects are explicit in the type system – (reverse “yes”) :: String -- No effects – (putStr “no”) :: IO () -- Can have effects • The main program is an effect-ful computation – main :: IO ()

newRef :: a -> IO (Ref a) readRef :: Ref a -> IO a writeRef :: Ref a -> a -> IO () Reads and main = do { r <- newRef 0 ; incR r writes are ; s <- readRef r 100% explicit! ; print s } You can’t say incR :: Ref Int -> IO () (r + 6), because incR r = do { v <- readRef r r :: Ref Int ; writeRef r (v+1) }

forkIO :: IO () -> IO ThreadId  forkIO spawns a thread  It takes an action as its argument webServer :: RequestPort -> IO () webServer p = do { conn <- acceptRequest p ; forkIO (serviceRequest conn) ; webServer p } serviceRequest :: Connection -> IO () serviceRequest c = do { … interact with client … } No event-loop spaghetti!

 How do threads coordinate with each other? main = do { r <- newRef 0 ; forkIO (incR r) ; incR r ; ... } Aargh! A race incR :: Ref Int -> IO () incR r = do { v <- readRef r ; writeRef r (v+1) }

A 10-second review:  Races : due to forgotten locks  Deadlock : locks acquired in “wrong” order.  Lost wakeups: forgotten notify to condition variable  Diabolical error recovery : need to restore invariants and release locks in exception handlers  These are serious problems. But even worse...

Scalable double-ended queue: one lock per cell No interference if ends “far enough” apart But watch out when the queue is 0, 1, or 2 elements long!

Difficulty of concurrent Coding style queue Sequential code Undergraduate

Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables

Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables Atomic blocks Undergraduate

atomically { ... sequential get code ... }  To a first approximation, just write the sequential code, and wrap atomically around it  All-or-nothing semantics: Atomic commit  Atomic block executes in Isolation A C I D  Cannot deadlock (there are no locks!)  Atomicity makes error recovery easy (e.g. exception thrown inside the get code)

atomically :: IO a -> IO a main = do { r <- newRef 0 ; forkIO (atomically (incR r)) ; atomically (incR r) ; ... }  atomically is a function, not a syntactic construct  A worry: what stops you doing incR outside atomically?

atomically :: STM a -> IO a newTVar :: a -> STM (TVar a)  Better idea: readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM () incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } main = do { r <- atomically (newTVar 0) ; forkIO (atomically (incT r)) ; atomic (incT r) ; ... }

atomic :: STM a -> IO a newTVar :: a -> STM (TVar a) readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM ()  Can’t fiddle with TVars outside atomic block [good]  Can’t do IO inside atomic block [sad, but also good]  No changes to the compiler (whatsoever). Only runtime system and primops.  ...and, best of all...

incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } Composition incT2 :: TVar Int -> STM () is THE way incT2 r = do { incT r; incT r } we build big programs foo :: IO () that work foo = ...atomically (incT2 r)...  An STM computation is always executed atomically (e.g. incT2). The type tells you.  Simply glue STMs together arbitrarily; then wrap with atomic  No nested atomic. (What would it mean?)

 MVars for efficiency in (very common) special cases  Blocking (retry) and choice (orElse) in STM  Exceptions in STM

 A very simple web server written in Haskell  full HTTP 1.0 and 1.1 support,  handles chunked transfer encoding,  uses sendfile for optimized static file serving,  allows request bodies and response bodies to be processed in constant space  Protection for all the basic attack vectors: overlarge request headers and slow-loris attacks  500 lines of Haskell (building on some amazing libraries: bytestring, blaze-builder, iteratee)

 A new thread for each user request  Fast, fast Pong requests/sec

 Again, lots of threads: 400-600 is typical  Significantly bigger program: 5000 lines of Haskell – but way smaller (Not shown: Vuse 480k lines) 80,000 than the loc competition Erlang Haskell  Built on STM  Performance: roughly competitive

 So far everything is shared memory  Distributed memory has a different cost model  Think message passing…  Think Erlang …

Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research The free lunch is over. Muticores are here. We have to program them. This is hard. Yada-yada-yada. Programming parallel computers Plan A . Start with a language whose computational fabric

Z3 - a Tutorial Leonardo de Moura Nikolaj Bjrner Microsoft Research Microsoft Research

5.2 Microsoft Excel Microsoft Excel Microsoft Excel is the spreadsheet component of the

Microsoft Access 2010 Powerpoint Presentation 2003 Microsoft Access 2010 is a software program

Formal Methods and Tools for Distributed Systems Thomas Ball Microsoft

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

Microsoft Powerpoint Presentation 2010 For Windows 7 microsoft powerpoint windows 7 free download

3 Ways To Sell With Microsoft Microsoft field sellers facilitate engagements between partners +

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

GESTURE SENSORS Microsoft Kinect V1 24M - 2013 Microsoft Kinect V2 20M - 2016 + VR + GESTURE

The Microsoft Software The Microsoft Software Development Process Development Process Scott

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

ML.NET Presented by: Markus Weimer Markus.Weimer@Microsoft.com https://dot.net/ml Brought to

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

Large-scale deployment of statistical machine translation Example Microsoft

SMT@Microsoft AFM 2007 Leonardo de Moura and Nikolaj Bjrner { leonardo, nbjorner }

On the E ffi ciency of the Walrasian Mechanism Moshe Babaio ff Brendan Lucier (Microsoft

The Effect of Estimation in Highdimensional Portfolios Luitgard A. M. Veraart Joint work with

thermal transport from first principles Stefano Baroni Scuola Internazionale Superiore di Studi

Operating Systems Operating Systems Hot Topics Hot Topics http://d3s.mff.cuni.cz Martin Dck

Verifying Deadlock Freedom Duy-Khanh LE , Wei-Ngan CHIN, Yong-Meng TEO {leduykha,chinwn,teoym}

Identity inference: generalizing person re-identification scenarios Svebor Karaman Andrew D.

Event Detection from Video using Answer Set Programming Authors: Abdullah khan, Luciano Serafini,

Iterative Multi-document Neural Attention for Multiple Answer Prediction URANIA Workshop Genova

Silicon Heterojunction Solar Cells Screen-printing: PECVD: intrinsic Ag front electrode PECVD: p