Execution Time Prediction for Energy- Efficient Hardware Accelerators Tao Chen, Alex Rucker, and G. Edward Suh Computer Systems Laboratory Cornell University
Accelerators in Interactive Computing Systems • Interactive systems have response time requirements and often use hardware accelerators • Observation : Finishing earlier than the requirement is usually not needed • Goal : Perform DVFS for hardware accelerators to save energy while meeting response time requirements 2 Cornell University Tao Chen
DVFS for Interactive Computing Systems • Save energy by running slower (lower frequency/voltage) Job 0 Job 1 deadline deadline Time Job 0 Job 1 deadline deadline Time Predict and Set DVFS Level • Requirement • Correctly predict each job’s execution time 3 Cornell University Tao Chen
Opportunity and Challenge Execution time of a video decoding accelerator deadline • Opportunity: Most jobs finish earlier than the deadline • Challenge: Irregular variations in job execution time 4 Cornell University Tao Chen
Conventional DVFS Controllers • History-based execution time prediction • Example: PID controller • Problem of history-based prediction • Reactive — decisions lag behind changes 5 Cornell University Tao Chen
Predictive DVFS Framework for Accelerators • Approach: Build a predictor hardware for each accelerator that uses job input data to predict execution time • Design Time: Build predictor and train prediction model • Identify features related to execution time • Generate a hardware slice that can calculate features quickly • Train a prediction model that maps features to execution time • Run Time: Run predictor to inform DVFS decisions Job Job Hardware Job Execution DVFS DVFS Execution Input Slice Features Time Model Model Level Time 6 Cornell University Tao Chen
Features to Capture Execution Time Variation • Source of variation : input-dependent control decisions initial state S1 Job 1 S1 S2 S4 S1 S2 S4 S1 S2 S3 Job 2 S1 S2 S4 S1 S3 S4 S1 S4 Time • Feature : State Transition Count 𝑇𝑈𝐷 = [𝑡𝑢 (,* , 𝑡𝑢 (,, , 𝑡𝑢 *,- , 𝑡𝑢 ,,- , 𝑡𝑢 -,( ] 2 0 2 0 2 Job 1 1 1 1 1 2 Job 2 7 Cornell University Tao Chen
Features to Capture Execution Time Variation • Variable state latency initial state ! done init done init done S1 4 3 2 1 0 2 1 0 S2 S3 Job 3 S1 S3 S4 S1 S3 S4 S1 Time S4 done • Feature : Counter Average Initial Value init 𝐵𝐽𝑊 = [𝑗𝑤 3, ] FSM Counter done 3 Job 3 • Other counter features in the paper 8 Cornell University Tao Chen
Identifying and Extracting Features • Automated flow based on RTL analysis • Identify FSM and counter features in RTL • Instrument RTL to extract features • More details in the paper Job Job Hardware Job Execution DVFS DVFS Execution Input Slice Features Time Model Model Level Time 9 Cornell University Tao Chen
Hardware Slicing • Need to obtain features before running the accelerator • Create a minimal version of the accelerator • Program slicing on accelerator RTL code Accelerator Logic Control Unit Datapath Hardware slice • Optimize hardware slice to run fast init done init done Counter 4 3 2 1 0 2 1 0 FSM S1 S3 S4 S1 S3 S4 S1 Time 10 Cornell University Tao Chen
Execution Time Prediction Model features execution model time coefficients Linear model : 𝑧 5 = 𝑌𝑐 • Train model using convex optimization • Reduce the number of features • Prioritize meeting deadlines over saving energy Job Job Hardware Job Execution DVFS DVFS Execution Input Slice Features Time Model Model Level Time 11 Cornell University Tao Chen
Evaluation Methodology • Vertically integrated evaluation methodology • Circuit-level simulation: obtain voltage-frequency relationship • Gate-level modeling: obtain area, power and energy numbers • Register-transfer-level simulation: obtain execution time • Benchmark accelerators Name Description h264 Video decoding cjpeg Image encoding djpeg Image decoding aes Cryptography sha Cryptography md Molecular dynamics stencil Image processing • Deadline: 16.7 ms 12 Cornell University Tao Chen
Results: Energy and Deadline Misses • 36.7% energy savings on average • 0.4% deadline misses 13 Cornell University Tao Chen
Results: Overheads of Slice-Based Predictor • 5.1% area overhead • 1.5% energy overhead • 3.5% execution time overhead 14 Cornell University Tao Chen
More Evaluation Results in Paper • More detailed experimental results • Prediction Accuracy Analysis • Results with Predictor Overheads Removed • Sensitivity Study on Varying Deadlines • Platform extensions • DVFS with Voltage Boosting • Results for FPGA-based Accelerators • Results for Accelerators Generated by HLS 15 Cornell University Tao Chen
Summary Observation : Finishing faster than the deadline is not needed Goal : DVFS for accelerators with response time requirements Solution : Prediction-based DVFS • Execution time depends on input-dependent control decisions • Hardware features can be used to capture control decisions • Proposed a framework to generate predictors automatically Results : Highly accurate DVFS for accelerators 16 Cornell University Tao Chen
Questions? Execution Time Prediction for Energy- Efficient Hardware Accelerators Tao Chen, Alex Rucker, and G. Edward Suh Computer Systems Laboratory Cornell University
Recommend
More recommend