Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computing Basics, Semantics Landau’s 1st Rule of Education Rubin H Landau Sally Haerer, Producer-Director Based on A Survey of Computational Physics by Landau, Páez, & Bordeianu with Support from the National Science Foundation Course: Computational Physics II 1 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Problems Basic and Assigned Impressive parallel ( � ) computing hardware advances Beyond � I/O, memory, internal CPU � : multiple processors, single problem Software stuck in 1960s Message passing = dominant, = too elementary Need sophisticated compilers (OK cores) Understanding hybrid programming models Problem: Parallelize simple program’s parameter space Why do? faster, bigger, finer resolutions, different 2 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude � Computation Example, Matrix Multiplication Need Communication, Synchronization, Math [ B ] = [ A ][ B ] (1) N � B i , j = A i , k B k , j (2) k = 1 Each LHS B i , j � Each LHS row, column [ B ] � RHS B k , j = old, before mult values ⇒ communicate [ B ] = [ A ][ B ] = data dependency, order matters [ C ] = [ A ][ B ] = data parallel 3 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Computer Categories Nodes, Communications, Instructions & Data CPU-CPU, mem-mem networks Internal (2) & external Node = processor location Node: 1-N CPUs Single-instruction, single-data Single-instruction, multiple-data I/O Node Gigabyte Internet Multiple instructs, multiple data Fast Ethernet FPGA MIMD: message-passing JTAG MIMD: no shared mem cluster Compute Nodes MIMD: Difficult program, $ 4 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Relation to MultiTasking Locations in Memory (s) Much � on PC, Unix Multitasking ∼ � Indep progs B A B A A simultaneously in RAM C C D D Round robin processing SISD: 1 job/t MIMD: multi jobs/same t 5 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Categories Granularity Coarse-grain: Separate programs & computers B A B A A C e.g. MC on 6 Linux PCs C D D Medium-grain: Several simultaneous processors Grain = measure Bus = communication channel computational work Parallel subroutines ∆ CPUs = computation / Fine-grain: custom compiler communication e.g. � for loops 6 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Distributed Memory � via Commodity PCs Clusters, Multicomputers, Beowulf, David Values of Parallel Processing Values of Parallel Processing Mainframe PC Beowulf Mini Work station Vector Computer Dominant coarse-medium grain = Stand-alone PCs, hi-speed switch, messages & network Req: data chunks to indep busy ea processor Send data to nodes, collect, exchange, ... 7 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Parallel Performance: Amdahl’s law Simple Accounting of Time Clogged ketchup bottle in cafeteria line Slowest step determines p = infinity 8 reaction rate Speedup 6 p 1 = u � serial, communication = p d Amdahl's Law e e ketchup p S 4 Need ∼ 90% parallel p = 2 Need ∼ 100% for massive 0 0 40% 60% 80% 20% Need new problems Parallel Fraction Percent Parallel 8 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Amdahl’s Law Derivation p = no. of CPUs T 1 = 1-CPU time , T p = p -CPU time (1) S p = max parallel speedup = T 1 T p → p (2) Not achieved: some serial, data & memory conflicts Communication, synchronization of the processors f = � fraction of program ⇒ T s = ( 1 − f ) T 1 (serial time) (3) T p = f T 1 (parallel time) (4) p T 1 1 Speedup S p = T s + T p = (Amdahl’s law) (5) 1 − f + f / p 9 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Amdah’s Law + Communication Overhead Include Communication Time; Simple & Profound Latency = T c = time to move data T 1 S p ≃ < p (1) T 1 / p + T c For communication time not to matter T 1 ⇒ p ≪ T 1 p ≫ T c (2) T c As ↑ number processors p , T 1 / p → T c Then, more processors ⇒ slower Faster CPU irrelevant 10 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude How Actually Parallelize Main task program Main routine Serial subroutine a Parallel sub 1 Parallel sub 2 Parallel sub 3 Summation task User creates tasks Avoid storage conflicts Task assigns processor threads ↓ Communication, Main: master, controller synchronization Subtasks: parallel subroutines, Don’t sacrifice science to speed slaves 11 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Practical Aspects of Message Passing; Don’t Do It More Processors = More Challenge Only most numerically intensive � Legacy codes often Fortran90 Rewrite (N months) vs Modify serial ( ∼ 70 % )? Steep learning curve, failures, hard debugging Preconditions: run often, for days, little change Need higher resolution, more bodies Problem affects parallelism: data use, problem structure Perfectly (embarrassingly) parallel: (MC) repeats Fully synchronous: Data � (MD), tightly coupled Loosely synchronous: (groundwater diffusion) Pipeline parallel: (data → images → animations) 12 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude High-Level View of Message Passing 4 Simple Communication Commands Master Compute Simple basics Create Slave 1 compute Create C, Fortran + 4 Slave 2 compute Compute communications send e m send i compute T send: named message compute Receive send send Receive compute receive: any sender compute Receive receive Compute send receive myid: ID processor Receive compute send Send send numnodes compute send 13 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude � MP: What Can Go Wrong? Hardware Communication = Problematic Master Task cooperation, division Compute Create Slave 1 Correct data division compute Create Slave 2 compute Compute send Many low-level details e m send i compute T Distributed error messages compute Receive send send Receive compute Wrong messages order compute Receive receive Race conditions: order Compute send receive dependent Receive compute send Send send compute Deadlock: wait forever send 14 / 15
Title Intro Semantics Distributed Mem Performance Amdahl Strategy Practical Messages Conclude Conclude: IBM Blue Gene = � by Committee Performance/watt Peak = 360 teraflops (10 12 ) On, off chip mem Medium speed CPU 2 core CPU 5.6 Gflop (cool) 1 Core compute, 1 512 chips/card, 16 communicate cards/Board 65,536 (2 16 ) nodes Control: MPI 15 / 15
Recommend
More recommend