Bridging the High-level and Implementation Divide: Mission - PDF document

Bridging the High-level and Implementation Divide: Mission Impossible? Victor Konrad April 2002 R 4/25/2002 Agenda � Background � Philosophy � Experiments in speedup of HLM � Conclusions � Disclaimer: view of HLM from the (narrow) vantage point of large general-purpose microprocessors R 4/25/2002 Page 1 1

A somewhat pessimistic report from two projects � Yosemite (R.I.P.) HLM – Project cancelled � IA 64 project X (HLM R.I.P.) – Project still alive, but no HLM � Information is 1.5 – 2 yrs old – But still valid R 4/25/2002 The Yosemite project HLM � Yosemite was to be next-generation Itanium – Was developed alongside Itanium for several years – Extended honeymoon period � Yosemite HLM – Structural iHDL – Written by architects with a microarchitectural bend – Many thousands of lines of code – Clock accurate – Intention: match RTL on major signals � Objectives (and wishful thinking) – Architects’ sandbox – Mixed-level simulation with RTL (plug-and-play) – Checkers for RTL validation developed early – FV… R 4/25/2002 Page 2 2

A word about iHDL � Developed ~15 years ago, in continuous use since – Slowly being phased out in favor of Verilog � Explicitly synchronous, logic-design oriented – “glorified netlist” – Some high-level built-in constructs (*,+) – very rich bit-vector manipulation features » a[5:2] & b[16:11] & '1::(b-a) := (c[b:a] & '0::9) + $CVN(31); – Missing basic high-level capabilities (e.g. user-defined behavioral procedures) » De-prioritized due to lack of interest from users – Underlying timing paradigm: FSM (no explicit concurrency, threads etc.) � Very slow simulation speed – Itanium ran at ~1Hz – Yosemite HLM very slow for an “ architect’s sandbox” » Harder to use word-level parallelism, NetBatch & other techniques from validation R 4/25/2002 Approaches to speedup � Use C/C++ - based modeling (SystemC, CynApps) – Was making its first strides when the project was cancelled – Showed good speedup, though probably insufficient � Library of high-level models – Encompass ubiquitous structures – Use simple simulation paradigm (“Execute this code in 3 cycles”) – Tuned for simulation performance � Classification of library elements – Simple logic design structures – Elaborate logic design structures – Micro-architectural primitives R 4/25/2002 Page 3 3

Simple logic design primitives (compiler enhancements? Macros?) � Paradigm: c:= a+b, c:=a*b etc. – Language built-ins – Software executable model bears no resemblance to the final hardware � A ubiquitous primitive: “find first” In: x = (0 1 0 0 1 1 0 1 0 0) Out: y = (0 0 0 0 0 0 0 1 0 0) Quasi-C solution: y=0 if (x[0]==‘1) y[0]=‘1; else if (x[1]==‘1) y[1]=‘1; else if (x[2]==‘1) y[1]=‘1; … …. Better: y = (-x) & x Proof: ~x = (10 1 1 0 0 1 0 1 1) (~x) + 1 = (10 1 1 0 0 1 1 0 0) x AND ((~x) + 1) = (0 1 0 0 1 1 0 1 0 0) AND (10 1 1 0 0 1 1 0 0) = (0 0 0 0 0 0 0 1 0 0) R 4/25/2002 Elaborate logic structures: PLA � PLA: y 1 = f 1 (x 1 , x 2 , …x n ) y 2 = f 2 (x 1 , x 2 , …x n ) … y m = f m (x 1 , x 2 , …x n ), where f k are sums of products Simple-minded approach: iHDL Equations C-code compiler In iHDL A better approach: Optimized iHDL equations Espresso C-code Equations compiler (fewer literals, terms) Problem: Espresso is a hardware optimizer, not tuned to improve performance of the software model! R 4/25/2002 Page 4 4

New algorithm: PLOP Optimized equations Espresso Optimized PLOP Equations C-code (fewer literals, terms) � At the core of PLOP is a procedure similar to the recursive splitting which occurs in TAUTOLOGY of Espresso – Different heuristic of choosing the splitting variable � Allows flexible memory/speedup tradeoffs � Heuristic, but a very good one: � Tested on ~40 PLAs from Itanium and Banias: speedups of 3x-20x achieved � A modification of this algorithm reduces significantly dynamic PLA power: patent pending R 4/25/2002 Microarchitectural primitives � CAMs, FIFOs, LIFOs, pipelines – Modeled in (more or less) straightfoward ways � TLB – Structure found in all microprocessor designs – Used for mapping of virtual address to physical address – Given a virtual address, determine if the data is contained within a page which is currently in memory – Hardware scans a list of pages and determines if the given address is contained in any of them. » If so, translate; if not found, page fault – The hardware scan is in parallel on all entries of the array, but this algorithm, if done in software, is linear in the number of entries in page table » Very time-consuming in simulation: TLB activated for every memory access Appears to be another application of hashing, but it is not: Determining if an address is within a given page requires a masking operation, and the mask is specific to each page. R 4/25/2002 Page 5 5

Better software algorithms for TLB simulation � Found and implemented an algorithm which implements the procedure in logarithmic time � Discovery: Problem is isomorphic to that of fast subnet-based routing – A very important problem in networking � See: “Fast Longest Prefix Matching: Algorithm, Analysis, and Applications”, Marcel Waldvogel, Ph.D. Thesis, ETH Zurich, 2000 R 4/25/2002 When the project was cancelled… � …we pretty much convinced ourselves that, speedwise, HLM was doable – Combination of C/C++ programming and faster models � But the other, more important business/methodology issues remain unresolved: – Can we make the HLM a clock/signal accurate golden model for RTL? – Can we make its modules plug-and-play interchangeable with RTL? And, most importantly, Is there sufficient return on the investment of building and maintaining this model (and keeping it in synch with the RTL)? R 4/25/2002 Page 6 6

HLM lessons from project X (McKinley follow-on) � Project X RTL based on existing McKinley, changes in certain units. � No high-level model for McKinley � Question: Is there a sufficient ROI in retrofitting an HLM to existing RTL? � Answer: no – Thus HLM was canned, but not before I had lots of fun reverse- engineering the McKinley RTL and writing a high-level model for it…. R 4/25/2002 Summary � Each successive generation of Intel microprocessors has toyed with high-level modeling, with limited or no success. – Hope springs eternal: even as we speak, new experiments are underway � Missing is the bridge between HLM and RTL – No progress unless we have an HLM model which » Is orders of magnitude faster than RTL » Is cycle- and major signal- compatible with RTL » Its modules can be plugged seamlessly into an RTL model � No ROI for proliferations � High-level synthesis? – Likely doable for narrower, special-purpose domains of applications (DSP etc.) – Not yet there for microprocessors R 4/25/2002 Page 7 7

Bridging the High-level and Implementation Divide: Mission - PDF document

Bridging the High-level and Implementation Divide: Mission Impossible? Victor Konrad April 2002 R 4/25/2002 Agenda Background Philosophy Experiments in speedup of HLM Conclusions Disclaimer: view of HLM from the (narrow)

Divide-Conquer-Glue Algorithms Divide-and-conquer. Divide up problem into several subproblems.

Divide and conquer 1 The main idea for the divide and conquer is trying to divide a problem into

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

Divide-Conquer-Glue Algorithms Divide-and-conquer. Mergesort and Counting Inversions Divide

Bridging the Language Divide Alex Waibel and the InterACT Team Carnegie Mellon University

Week 2 Growth of Functions Divide-and- Divide and Conquer Conquer Min-Max- Problem Tutorial

Divide and Conquer Algorithm Design Techniques Greedy Divide and Conquer Dynamic Programming

Divide and Conquer Algorithms Divide-and-Conquer The most-well known algorithm design strategy:

CSC 151 Spring 2020 Topic: Merge Sort May 4, 2020 Day 39 Self Checks Divide and Conquer

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

of Change What is Happening to Our Planet? Ecological divide (1.5) Social-- economic divide

Bridging the Digital Divide with Net-Centric Tactical Services (NCTS) Agenda Introduction

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

Bridging the Partnership Divide HINARI Over Eight Years Kimberly Parker World Health

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright

Outline 1 Introduction 2 Andersens Analysis The Algorithm Constraints Complexity 3

Software Engineering Large Practical: Testing Android applications Stephen Gilmore School of

MILP Modeling for (Large) S-boxes to Optimize Probability of Differential Characteristics Ahmed

Coulomb interactions: P 3 M, MMMxD, ELC and ICC Axel Arnold

Bridging the High-level and Implementation Divide: Mission - PDF document

Bridging the High-level and Implementation Divide: Mission Impossible? Victor Konrad April 2002 R 4/25/2002 Agenda Background Philosophy Experiments in speedup of HLM Conclusions Disclaimer: view of HLM from the (narrow)

Divide-Conquer-Glue Algorithms Divide-and-conquer. Divide up problem into several subproblems.

Divide and conquer 1 The main idea for the divide and conquer is trying to divide a problem into

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

Divide-Conquer-Glue Algorithms Divide-and-conquer. Mergesort and Counting Inversions Divide

Bridging the Language Divide Alex Waibel and the InterACT Team Carnegie Mellon University

Week 2 Growth of Functions Divide-and- Divide and Conquer Conquer Min-Max- Problem Tutorial

Divide and Conquer Algorithm Design Techniques Greedy Divide and Conquer Dynamic Programming

Divide and Conquer Algorithms Divide-and-Conquer The most-well known algorithm design strategy:

CSC 151 Spring 2020 Topic: Merge Sort May 4, 2020 Day 39 Self Checks Divide and Conquer

Divide and Conquer Summary Divide Identify one or more subproblems Conquer Solve

of Change What is Happening to Our Planet? Ecological divide (1.5) Social-- economic divide

Bridging the Digital Divide with Net-Centric Tactical Services (NCTS) Agenda Introduction

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

Bridging the Partnership Divide HINARI Over Eight Years Kimberly Parker World Health

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright

Outline 1 Introduction 2 Andersens Analysis The Algorithm Constraints Complexity 3

Software Engineering Large Practical: Testing Android applications Stephen Gilmore School of

MILP Modeling for (Large) S-boxes to Optimize Probability of Differential Characteristics Ahmed

Coulomb interactions: P 3 M, MMMxD, ELC and ICC Axel Arnold

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2