Load Value Approximation Joshua San Miguel Mario Badr Natalie Enright Jerger
Accessing Memory main memory shared caches, directory, network-on-chip L1 cache processor core 2
Accessing Memory main memory shared caches, directory, network-on-chip miss L1 cache processor core 3
Accessing Memory main memory Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache! shared caches, directory, network-on-chip miss L1 cache processor core 4
Accessing Memory main memory Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache! shared caches, directory, network-on-chip Higher efficiency via Approximate Computing… miss L1 cache processor core 5
Approximate Computing Not all computations need to be precise. Data mining Computer vision Audio and video processing http://www.zentut.com/ http://www.cc.gatech.edu/~cnieto6/ http://themusicparlour.blogspot.ca/ Gaming Machine learning Dynamical simulation http://www.businessweek.com/ http://www.analyticbridge.com/ http://www.scientific-computing.com/ 6
Approximate Computing execution time energy 7
Approximate Computing execution time energy error 8
Approximate Computing execution time energy error 9
Approximate Computing Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate [Sampson, MICRO 2013]. 10
Approximate Computing Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate [Sampson, MICRO 2013]. Approximate value locality: Many data values are similar to or can be approximated from previously seen values. 11
Outline • Load Value Approximation • Non-Speculative Operation • Approximator Design • Relaxed Confidence Windows • Approximation Degree • Methodology • Evaluation 12
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache processor core 13
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache approximator processor core 14
Load Value Approximation main memory shared caches, directory, network-on-chip load miss A L1 cache approximator A? processor core 15
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache approximator generate A_approx A? processor core 16
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache approximator A_approx processor core 17
Load Value Approximation main memory No speculation, no rollbacks. shared caches, directory, network-on-chip L1 cache approximator A_approx processor core 18
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache approximator A_approx processor core 19
Load Value Approximation main memory fetch A_actual shared caches, directory, network-on-chip L1 cache approximator A_approx processor core 20
Load Value Approximation main memory shared caches, directory, network-on-chip L1 cache approximator train with A_actual A_approx processor core 21
Load Value Approximation main memory Learns past values. Estimates future values. Improves performance and saves energy. shared caches, directory, network-on-chip L1 cache approximator A_approx processor core 22
Approximator Design approximator table tag conf degree LHB global history buffer instruction ℎ , address local history buffer 𝑔 23
Approximator Design time approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 24
Approximator Design load miss A time approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 25
Approximator Design load miss A time approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 local history buffer 𝑔 4.1 3.9 4.0 26
Approximator Design load miss A time approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 local history buffer 𝑔 4.1 3.9 4.0 (4.1 + 3.9 + 4.0) / 3 A_approx = 4.0 27
Approximator Design load miss A do_work(A_approx) time approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 28
Approximator Design load miss A do_work(A_approx) time request(A_actual) approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 29
Approximator Design load miss A do_work(A_approx) time request(A_actual) A_actual = 4.2 approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 30
Approximator Design load miss A do_work(A_approx) time request(A_actual) A_actual = 4.2 approximator table tag conf degree LHB global history buffer instruction ℎ , 2.2 3.1 4.2 address local history buffer 𝑔 3.9 4.0 4.2 31
Approximator Design – Other Considerations • Floating-point precision • History buffer sizes • Stale values More details in paper. 32
Approximator Design Relaxed Confidence Windows How do we avoid making bad approximations? Trade-off performance and error. Approximation Degree Do we need to fetch the actual value from memory every time? Trade-off energy and error. 33
Relaxed Confidence Windows load miss A do_work(A_approx) time request(A_actual) approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 A_approx = 4.0 34
Relaxed Confidence Windows load miss A do_work(A_approx) time request(A_actual) A_actual = 9.0! approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 A_approx = 4.0 35
Relaxed Confidence Windows tag conf degree LHB When approximating: if conf >= 0: use A_approx else: don’t use A_approx When updating: if A_approx , A_actual differ by <= CONF_WINDOW% : conf ++ else: conf- - 36
Relaxed Confidence Windows – Output Error Varying CONF_WINDOW %: 0% 5% 10% 20% infinite 100% 80% output error 60% 40% 20% 0% 37
Relaxed Confidence Windows – L1-D MPKI Varying CONF_WINDOW %: 1.0 normalized L1-D MPKI 0.8 0.6 0.4 0.2 0.0 0% 5% 10% 20% infinite CONF_WINDOW% 38
Approximator Design Relaxed Confidence Windows How do we avoid making bad approximations? Trade-off performance and error. Approximation Degree Do we need to fetch the actual value from memory every time? Trade-off energy and error. 39
Approximation Degree load miss A do_work(A_approx) time request(A_actual) approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 A_approx = 4.0 40
Approximation Degree load miss A do_work(A_approx) time request(A_actual) A_actual = 4.0 approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 A_approx = 4.0 41
Approximation Degree load miss A do_work(A_approx) time request(A_actual) A_actual = 4.0 approximator table tag conf degree LHB global history buffer instruction ℎ , 1.0 2.2 3.1 address local history buffer 𝑔 4.1 3.9 4.0 A_approx = 4.0 42
Approximation Degree tag conf degree LHB When approximating: if degree == APPROX_DEGREE : fetch A_actual else: don’t fetch A_actual When updating: if degree == APPROX_DEGREE : degree = 0 else: degree ++ 43
Approximation Degree – Output Error Varying APPROX_DEGREE : 0 1 2 4 8 16 100% 80% output error 60% 40% 20% 0% 44
Approximation Degree – L1-D Fetches Varying APPROX_DEGREE : 1 normalized L1-D fetches 0.8 0.6 0.4 0.2 0 0 1 2 4 8 16 APPROX_DEGREE 45
Methodology Multi-threaded approximate applications PARSEC benchmark suite [Bienia, Princeton 2011] Programmer annotations and ISA extensions [Esmaeilzadeh, ASPLOS 2012] Approximator design space exploration Pin dynamic binary instrumentation tool [Luk, PLDI 2005] Full-system simulation FeS2 cycle-level x86 simulator [Neelakantam, ASPLOS 2008] Approximator, cache and memory energy consumption CACTI modeling tool [Thoziyoor, HP 2008] 46
Evaluation application speedup energy savings 16% 14% 12% 10% 8% 6% 4% 2% 0% 0 4 16 APPROX_DEGREE 47
Evaluation application speedup energy savings 16% Up to 28% speedup 14% 12% 10% 8% 6% 4% 2% 0% 0 4 16 APPROX_DEGREE 48
Recommend
More recommend