bespoke processors for applications with ultra low area
play

Bespoke Processors for Applications with Ultra-low Area and Power - PowerPoint PPT Presentation

Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley Overview Motivation Intro to Bespoke Benchmarks and Results Discussion General Purpose CPUs in ULP


  1. Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA ‘17 Jielun Tan, Tim Wesley

  2. Overview Motivation Intro to Bespoke Benchmarks and Results Discussion

  3. General Purpose CPUs in ULP Ultra-Low Power applications (IoT, wearables, implantables) typically use small, general purpose microprocessors ● Amortized cost of development ● Most capabilities of these processors are never used by the application ○ Unused gates still drain power and take up area

  4. What about ASICs and FPGAs ? ● Both are expensive to develop ● ASICs ○ IPs required for different applications ○ Expensive at small scales ● FPGA ○ Often larger than needed, to accommodate programmability ○ May still use too much power

  5. Algorithm Usage Examples

  6. Bespoke Processors--Tuning Process ● Bespoke processor design flow: ○ First use traditional module-level removal ○ Next use Input-Independent Gate Activity Analysis ○ Finally, cut-and-stitch the netlist to form the final design

  7. Input-Independent Gate Activity Analysis 1. Load binary into memory 2. Set application inputs to Xs 3. After each cycle is simulated, the toggled gates are marked “keep” 4. If an X propagates to the PC, we have a possible branch a. Explore all possible branch paths, depth-first b. Remember the most conservative state (most Xs) i. Take union of gates of branches if most conservative is missing a few c. If branch is re-encountered i. Skip check if this state is a substate of that most conservative state ii. Merge lists of activated gates and make the result the new conservative state 5. Lists of all gates that are never toggled, along with their constant values, are passed to the cut-and-stitch function

  8. Cutting and Stitching 1. After X propagation, untoggled gates are removed from the netlist and replaced by a constant voltage 2. Rerun logical synthesis for further optimizations a. Typically gates that have constant inputs can reduced to even simpler logic 3. Place and route (this is not any further optimized)

  9. Input-independent Gate Activity Analysis Example

  10. Benchmarks ● Baseline ○ openMSP430 with TSMC 65nm ○ Operating @1V @100MHz ○ Bare metal simulation or FreeRTOS ○ Either completely general purpose, or traditionally optimized for an application by removing modules ● Each benchmark is then run on a Bespoke processor optimized for that benchmark ○ All unused modules are removed ○ X propagation and cut-and-stitch are performed

  11. Used Gates per Benchmark

  12. Results ● Reduction in gate count, area and power for a bespoke design vs. unmodified baseline

  13. Results ● Reduction in gate count, area, and power in bespoke design vs. module optimized baseline

  14. Results

  15. Multiple Programs ● Multiple programs? ○ Run bespoke tuning process on each and take the union of the results ● Ceiling at 80%... test suite does not activate all gates

  16. In-Field Updates ● Bug fixes may need to be deployed, which may change the toggled gates ● Milu mutation testing tool used to emulate changes in the program for future updates ○ Type I: conditional operator changes (AND -> OR) ○ Type II: computation operator mutants (add -> multiply) ○ Type III: loop conditional operator mutants (less than -> less than or equal to)

  17. Coverage for In-Field Updates ● Between 25% and 100% of mutants for each type are covered ● 70% of all mutants of all types of covered ● If mutants are significantly different, then they can be considered as independent programs ● Overhead of between 1% and 40% ● Total area reductions between 23% and 66%, total power reductions between 13% and 53%

  18. Coverage for in-Field Updates cont. ● An instruction that can be executed in one program is not necessarily executable in another program ○ A particular ADD instruction may only use 16 bits out of a 32 bit ALU ● A tailored bespoke processor can support arbitrary software updates by supporting a Turing complete instruction (e.g. subneg) or a set of them ○ A program written using Turing complete instruction can be consisted solely of that instruction

  19. System Code ● Application analysis of system code for FreeRTOS shows 57% of the gates are never used by the OS ● When benchmarks are evaluated individually with FreeRTOS ○ 37% unused in the worst case ○ 49% unused on average ● Running 15 benchmarks on top of FreeRTOS still shows 27% of gates unused

  20. Generality and Limitations ● Hardware with non-deterministic behaviors need additional techniques to be Bespoke tuned ○ Branch predictors ○ Caches ○ Speculative operations ○ Out-of-order cores ● Xs need to be injected as the results of ○ ...branch predictions ○ ...tag checks ○ ...values where speculation may be used ● Extending the X-prop process to explore data flow graphs may allow analysis of OoO to work

  21. Discussion Points 1. All of the examples they tested are just algorithms such as binary search or FFT. But actual applications, even in IoT and smaller, typically do more than just, e.g., binary search. Do Bespoke tuned processors have any value for real-world programs? 2. Is using Milu and adding mutations representative of what in-field updates would actually change? 3. Can the Bespoke tuning process be used for lowering power consumption of high-performance accelerators? 4. Is Bespoke tuning better or worse for certain cases than technologies such as HLS, Simulate-and-Eliminate, or just making an ASIC design?

  22. Related Works ● High-Level Synthesis ○ Additional development costs ■ New high-level specs of application behavior needs to be defined ■ High-level spec needs to also be verified ○ C to ASICs is very difficult to do, especially to do efficiently ○ Unlikely to support multiple applications nor in-field updates ● Simulate-and-Eliminate ○ Simulates the target application with a user-provided set of inputs on multiple base designs ■ Require significant user input ■ Only considers high-level, manually-identified components ■ Relies on user inputs to determine unused components--user may forget a test case!

Recommend


More recommend