Range Partition key X Y Z 10 15 8 20 27 20 30 52 29 splitters 16 31 Columbia University 15 Sunday, July 28, 2013 15
Range Partition key partitions <= 8 > X Y Z 10 15 8 20 27 20 30 52 29 splitters 16 31 Columbia University 15 Sunday, July 28, 2013 15
Range Partition key partitions <= 8 > X Y Z 10 15 15 <= 20 8 20 > 16 27 <= 20 27 30 > 52 29 29 52 splitters 16 31 31 Columbia University 15 Sunday, July 28, 2013 15
HARP Microarchitecture From SB in HARP ISA set_splitter partition_start To SB out partition_stop Columbia University 16 Sunday, July 28, 2013 16
HARP Microarchitecture Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 16 Sunday, July 28, 2013 16
Step 1: HARP Configuration Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 17 Sunday, July 28, 2013 17
Step 1: HARP Configuration 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 17 Sunday, July 28, 2013 17
Step 2: Signal HARP to Start Processing 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 18 Sunday, July 28, 2013 18
Step 2: Signal HARP to Start Processing 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 18 Sunday, July 28, 2013 18
Step 3: Serialize SBin Cachelines into Records 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 19 Sunday, July 28, 2013 19
Step 3: Serialize SBin Cachelines into Records 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 19 Sunday, July 28, 2013 19
Step 4: Comparator Conveyor 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 20 Sunday, July 28, 2013 20
Step 4: Comparator Conveyor 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in 15 15 15, part2 WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 20 Sunday, July 28, 2013 20
Step 5: Merge Output Records to SBout 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 21 Sunday, July 28, 2013 21
Step 5: Merge Output Records to SBout 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 21 Sunday, July 28, 2013 21
Step 6: Drain In-Flight Records and Signal HARP to Stop Processing 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 22 Sunday, July 28, 2013 22
Step 6: Drain In-Flight Records and Signal HARP to Stop Processing 10 20 30 Serializer 1 Conveyor 2 = = = < < < From SB in WE WE WE WE HARP ISA set_splitter Merge 3 partition_start To SB out partition_stop Columbia University 22 Sunday, July 28, 2013 22
Streaming Framework Architecture HARP HARP Core Core SB out SB out L1 L1 SB in SB in L2 L2 Inspired by Jouppi’s work Memory Improving direct-mapped cache performance by the Memory addition of a small fully-associative cache and Controller prefetch buffers. In ISCA, 1990. Columbia University 23 Sunday, July 28, 2013 23
Streaming Framework Architecture HARP HARP Core Core SB out SB out Software- L1 L1 SB in SB in controlled data L2 streaming in/out L2 Inspired by Jouppi’s work Memory Improving direct-mapped cache performance by the Memory addition of a small fully-associative cache and Controller prefetch buffers. In ISCA, 1990. Columbia University 23 Sunday, July 28, 2013 23
Step 1: Issue Core HARP L1 sbload from Store SB out Buffer Core L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 24 Sunday, July 28, 2013 24
Step 1: Issue Core HARP L1 sbload from Store SB out Buffer Core L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 24 Sunday, July 28, 2013 24
Step 2: Send Core HARP L1 sbload from Store Req Buffer to SB out Buffer Memory L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 25 Sunday, July 28, 2013 25
Step 2: Send ✗ Core HARP L1 sbload from Store Req Buffer to SB out Buffer Memory ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 25 Sunday, July 28, 2013 25
Step 2: Send ✗ Core HARP L1 sbload from Store Req Buffer to SB out Buffer Memory ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 25 Sunday, July 28, 2013 25
Step 3: Data ✗ Core HARP L1 Return from Store SB out Buffer Memory to SBin ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 26 Sunday, July 28, 2013 26
Step 3: Data ✗ Core HARP L1 Return from Store SB out Buffer Memory to SBin ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 26 Sunday, July 28, 2013 26
Step 3: Data ✗ Core HARP L1 Return from Store SB out Buffer Memory to SBin ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 26 Sunday, July 28, 2013 26
Step 4: HARP ✗ Core HARP L1 Pulls Data from Store SBin and Pushes SB out Buffer Data to SBout ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 27 Sunday, July 28, 2013 27
Step 4: HARP ✗ Core HARP L1 Pulls Data from Store SBin and Pushes SB out Buffer Data to SBout ✗ L2 SB in SB ISA ✗ LLC Req sbload Buffer sbstore C: Cache S: SB sbsave Memory sbrestore Columbia University 27 Sunday, July 28, 2013 27
Step 5: Issue Core HARP L1 sbstore from Store SB out Buffer Core L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 28 Sunday, July 28, 2013 28
Step 5: Issue Core HARP L1 sbstore from Store SB out Buffer Core L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 28 Sunday, July 28, 2013 28
Step 6: Data Core HARP L1 Copied from Store head of SBout SB out Buffer to Store Buffer L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 29 Sunday, July 28, 2013 29
Step 6: Data Core HARP L1 Copied from Store head of SBout SB out Buffer to Store Buffer L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 29 Sunday, July 28, 2013 29
Step 7: Data Core HARP L1 Written Back to Store Memory via Existing SB out Buffer Store Datapath L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 30 Sunday, July 28, 2013 30
Step 7: Data Core HARP L1 Written Back to Store Memory via Existing SB out Buffer Store Datapath L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 30 Sunday, July 28, 2013 30
Interrupts and Core HARP L1 Context Store SB out Buffer Switches L2 SB in SB ISA LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 31 Sunday, July 28, 2013 31
Interrupts and Core HARP L1 Context Store SB out Buffer Switches L2 SB in SB ISA Architectural LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 31 Sunday, July 28, 2013 31
Interrupts and Core HARP L1 Context Store SB out Buffer Switches L2 SB in SB ISA Architectural LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 31 Sunday, July 28, 2013 31
Interrupts and Core HARP L1 Context Store SB out Buffer Switches L2 SB in SB ISA Architectural LLC Req sbload Buffer sbstore sbsave Memory sbrestore Columbia University 31 Sunday, July 28, 2013 31
Accelerator Integration Choice • Tightly coupled and software controlled: • area/power savings • coherence • utilize hardware prefetchers • software-managed data layout • address-free domain for accelerators Columbia University 32 Sunday, July 28, 2013 32
Remainder of the Talk • Brief System Overview • HARP UArch HARP HARP Core Core • Streaming Framework SB out SB out L1 L1 UArch SB in SB in • HARP and Streaming L2 L2 Framework Evaluation • Discussion and DSE Memory Memory Controller Columbia University 33 Sunday, July 28, 2013 33
Evaluation Methodology Columbia University 34 Sunday, July 28, 2013 34
Evaluation Methodology • HARP • Bluespec System Verilog implementation • Cycle-accurate simulation in BlueSim • Synthesis, P&R with Synopsys (32nm std cells) Columbia University 34 Sunday, July 28, 2013 34
Evaluation Methodology • HARP • Bluespec System Verilog implementation • Cycle-accurate simulation in BlueSim • Synthesis, P&R with Synopsys (32nm std cells) • Streaming framework • 3 versions of 1GB table memcpy: c-lib, ASM (scalar), ASM(vector) • Conservative area/power estimates with CACTI Columbia University 34 Sunday, July 28, 2013 34
Area and Power Overheads Area (% Xeon core) HARP Stream Buffers 15% 11% 8% 4% 0% 15 31 63 127 255 511 Number of Partitions Columbia University 35 Sunday, July 28, 2013 35
Area and Power Overheads Area (% Xeon core) HARP Stream Buffers 15% 11% 8% 4% 0% 15 31 63 127 255 511 Number of Partitions Power (% Xeon core) 10% 8% 6% 4% 2% 0% 15 31 63 127 255 511 Columbia University 35 Sunday, July 28, 2013 35
SW Partitioning Performance Partitioning Throughput (GB/s) 8 1 thread 6 16 threads 4 2 0 0 150 300 450 600 Number of Partitions Columbia University 36 Sunday, July 28, 2013 36
Performance Evaluation Partitioning Throughput (GB/s) 8 1 thread 6 16 threads 1 thread + HARP 4 2 0 0 150 300 450 600 Number of Partitions Columbia University 37 Sunday, July 28, 2013 37
Performance Evaluation Partitioning Throughput (GB/s) 8 1 thread 6 16 threads 1 thread + HARP 4 7 .8x 2 0 0 150 300 450 600 Number of Partitions Columbia University 37 Sunday, July 28, 2013 37
Performance Evaluation Partitioning Throughput (GB/s) 8 1 thread 6 16 threads 1 thread + HARP 4 8.8x 7 .8x 2 0 0 150 300 450 600 Number of Partitions Columbia University 37 Sunday, July 28, 2013 37
Streaming Framework Provides Sufficient BW to Feed HARP? Partitioning Throughput (GB/s) 7 5.25 3.5 1 thread + HARP 1.75 0 0 150 300 450 600 Number of Partitions Columbia University 38 Sunday, July 28, 2013 38
Recommend
More recommend