Pipelining (part 1) 1
Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 2
Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 2
Waste (1) whites wasted time! wasted time! sheets sheets sheets colors colors colors whites Washer whites 14:00 13:00 12:00 11:00 Table Folding Dryer 3
Waste (1) whites wasted time! wasted time! sheets sheets sheets colors colors colors whites Washer whites 14:00 13:00 12:00 11:00 Table Folding Dryer 3
Waste (2) whites sheets sheets sheets colors colors colors whites whites Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 4
Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5
Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5
Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5
Throughput — Rate of Many colors time between starts (0.83 h) loads/h h load time between fjnishes (0.83 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6
Throughput — Rate of Many Washer time between starts (0.83 h) time between fjnishes (0.83 h) sheets sheets sheets colors colors colors whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6 1 load 0 . 83 h = 1 . 2 loads/h
Throughput — Rate of Many Washer time between starts (0.83 h) time between fjnishes (0.83 h) sheets sheets sheets colors colors colors whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6 1 load 0 . 83 h = 1 . 2 loads/h
times three circuit 7 10 results/ns throughput 100 ps latency 100 ps 50 ps 0 ps 21 14 add add ADD ADD ADD ADD 7 A 2 × A 3 × A
times three circuit 7 10 results/ns throughput 100 ps latency 100 ps 50 ps 0 ps 21 14 7 ADD ADD ADD ADD A 2 × A 3 × A A add A + A 2 × A add 2 A + A 3 × A
times three circuit 7 100 ps 50 ps 0 ps 21 14 7 ADD ADD ADD ADD A 2 × A 3 × A 100 ps latency = ⇒ 10 results/ns throughput A add A + A 2 × A add 2 A + A 3 × A
times three and repeat 2 21 17 34 51 4 8 12 1 3 7 23 46 69 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps 14 add 8 2 7 14 17 34 4 8 add 1 23 46 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps A add A + A 2 × A add 2 A + A 3 × A
times three and repeat 2 21 17 34 51 4 8 12 1 3 7 23 46 69 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps 14 8 2 23 7 14 17 34 4 8 1 46 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps A add A + A 2 × A add 2 A + A 3 × A A add A + A 2 × A add 2 A + A 3 × A
pipelined times three ( 34 17 17 21 14 7 7 ) ( ) ) ( ) ( ADD ADD ADD ADD 9 A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 )
pipelined times three 7 34 17 17 21 14 7 9 ADD ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) A ( t + 2 ) A ( t + 1 ) 2 × A ( t + 1 ) 3 × A ( t + 0 )
register tolerances register output register input output changes input must not change register delay 10
register tolerances register output register input output changes input must not change register delay 10
register tolerances register output register input output changes input must not change register delay 10
times three pipeline timing throughput: G operations/sec ps 11 ADD ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) 10 ps 50 ps 10 ps 50 ps 10 ps
times three pipeline timing ADD throughput: 11 ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) 10 ps 50 ps 10 ps 50 ps 10 ps 1 60 ps ≈ 16 G operations/sec
deeper pipeline ps Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps throughput: exercise: throughput now? ps ps ADD ADD ADD ADD ps 12 ps ps ps ps ps A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 )
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 12 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 12 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps exercise: throughput now? 12 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 1 35 ps ≈ 28 G ops/sec
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 13 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps
diminishing returns: register delays . 10 ps . . . . . . . logic (3/3) . . . . 1 ps 11 ps per cycle … 33 ps 10 ps logic (all) 33 ps 100 ps 110 ps per cycle 10 ps logic (1/2) 50 ps 60 ps per cycle 10 ps logic (2/2) 50 ps 10 ps logic (1/3) 33 ps 43 ps per cycle 10 ps logic (2/3) 14 10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps
diminishing returns: register delays number of stages time per completion (ps) 15 120 100 80 60 40 20 0 2 4 6 8 10 12 14
diminishing returns: register delays register delay time per completion (ps) number of stages 15 120 100 80 60 40 20 0 2 4 6 8 10 12 14
diminishing returns: register delays register delay time per completion (ps) number of stages 1.02x speedup 1.83x speedup 15 120 100 80 60 40 20 0 2 4 6 8 10 12 14
diminishing returns: register delays 1.83x throughput throughput (ops/ns) number of stages 1.02x throughput 16 100 80 60 40 20 0 2 4 6 8 10 12 14
diminishing returns: register delays 1.83x throughput throughput (ops/ns) number of stages max. rate of register updates 1.02x throughput 16 100 80 60 40 20 0 2 4 6 8 10 12 14
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 17 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 18 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps
deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results throughput: exercise: throughput now? (didn’t split second add evenly) 20 ps 30 ps G ops/sec ps exercise: throughput now? 18 ADD ADD ADD ADD A ( t + 4 ) 2 × A 2 × A ( t + 2 ) 3 × A 3 × A ( t + 0 ) A ( t + 2 ) A ( t + 3 ) 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 1 40 ps ≈ 25 G ops/sec
Recommend
More recommend