CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, - PowerPoint PPT Presentation

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25

Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 2 / 25

These slides contain/adapt materials developed by ◮ Hardware for Machine Learning, Shao Spring 2020 @ UCB ◮ 8-bit Inference with TensorRT ◮ Junru Wu et al. (2018). “Deep k -Means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions”. In: Proc. ICML ◮ Shijin Zhang et al. (2016). “Cambricon-x: An accelerator for sparse neural networks”. In: Proc. MICRO . IEEE, pp. 1–12 ◮ Jorge Albericio et al. (2016). “Cnvlutin: Ineffectual-neuron-free deep neural network computing”. In: ACM SIGARCH Computer Architecture News 44.3, pp. 1–13 3 / 25

Scientific Notation Decimal representation mantissa exponent 6.02 10 x 10 23 radix (base) decimal point • Normalized form: no leadings 0s (exactly one digit to left of decimal point) • Alternatives to representing 1/1,000,000,000 • Normalized: 1.0 x 10 -9 • Not normalized: 0.1 x 10 -8 ,10.0 x 10 -10 4 / 25

Scientific Notation Binary representation mantissa exponent 1.01 two x 2 -1 radix (base) “binary point” • Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers 5 / 25

Normalized Form ◮ Floating Point Numbers can have multiple forms, e.g. 0 . 232 × 10 4 = 2 . 32 × 10 3 = 23 . 2 × 10 2 = 2320 . × 10 0 = 232000 . × 10 − 2 ◮ It is desirable for each number to have a unique representation => Normalized Form ◮ We normalize Mantissa’s in the Range [ 1 .. R ) , where R is the Base, e.g.: ◮ [ 1 .. 2 ) for BINARY ◮ [ 1 .. 10 ) for DECIMAL 6 / 25

Floating-Point Representation • Normal format: +1.xxx…x two *2 yyy…y two 31 30 23 22 0 S Exponent Significand 1 bit 8 bits 23 bits • S represents Sign • Exponent represents y’s • Significand represents x’s • Represent numbers as small as 2.0 x 10 -38 to as large as 2.0 x 10 38 7 / 25

Floating-Point Representation (FP32) • IEEE 754 Floating Point Standard • Called Biased Notation , where bias is number subtracted to get real number • IEEE 754 uses bias of 127 for single prec. • Subtract 127 from Exponent field to get actual value for exponent • 1023 is bias for double precision • Summary (single precision, or fp32): 31 30 23 22 0 S Exponent Significand 1 bit 8 bits 23 bits • (-1) S x (1 + Significand) x 2 (Exponent-127) 8 / 25

Floating-Point Representation (FP16) • IEEE 754 Floating Point Standard • Called Biased Notation , where bias is number subtracted to get real number • IEEE 754 uses bias of 15 for half prec. • Subtract 15 from Exponent field to get actual value for exponent • Summary (half precision, or fp15): 15 15 10 9 0 S Exponent Significand 1 bit 5 bits 10 bits • (-1) S x (1 + Significand) x 2 (Exponent-15) 9 / 25

Question: What is the IEEE single precision number 40C0 0000 16 in decimal? 10 / 25

Question: What is -0.5 10 in IEEE single precision binary floating point format? 11 / 25

Fixed-Point Arithmetic Fixed-Point Arithmetic • Integers with a binary point and a bias • “slope and bias”: y = s*x + z • Qm.n: m (# of integer bits) n (# of fractional bits) • Qm.n: m (# of integer bits) n (# of fractional bits) s = 1, z = 0 s = 1/4, z = 0 s = 4, z = 0 s = 1.5, z =10 s = 1, z = 0 s = 1/4, z = 0 s = 4, z = 0 s = 1.5, z =10 2^2 2^1 2^0 Val 2^0 2^-1 2^-2 Val 2^4 2^3 2^2 Val 2^2 2^1 2^0 Val 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.5*0 +10 0 0 1 1 0 0 1 1/4 0 0 1 4 0 0 1 1.5*1 +10 0 1 0 2 0 1 0 2/4 0 1 0 8 0 1 0 1.5*2 +10 0 1 1 3 0 1 1 3/4 0 1 1 12 0 1 1 1.5*3 +10 1 0 0 4 1 0 0 1 1 0 0 16 1 0 0 1.5*4 +10 1 0 1 5 1 0 1 5/4 1 0 1 20 1 0 1 1.5*5 +10 1 1 0 6 1 1 0 6/4 1 1 0 24 1 1 0 1.5*6 +10 1 1 1 7 1 1 1 7/4 1 1 1 28 1 1 1 1.5*7 +10 12 / 25

Hardware Implications Multipliers ! " # Fixed-point multiplier Floating-point multiplier 13 / 25

Greedy Layer-wise Quantization 1 Quantization flow ◮ For a fixed-point number, it representation is: bw − 1 B i · 2 − f l · 2 i , � n = i = 0 where bw is the bit width and f l is the fractional length which is dynamic for different layers and feature map sets while static in one layer. ◮ Weight quantization: find the optimal f l for weights: � f l = arg min | W float − W ( bw , f l ) | , f l where W is a weight and W ( bw , f l ) represents the fixed-point format of W under the given bw and f l . 1 Jiantao Qiu et al. (2016). “Going deeper with embedded fpga platform for convolutional neural network”. In: Proc. FPGA , pp. 26–35. 14 / 25

Greedy Layer-wise Quantization Quantization flow Input images CNN model Weight quantization phase Weight dynamic range analysis namic range analysis ◮ Feature quantization: find the optimal f l for features: � | x + float − x + ( bw , f l ) | , Weight quantization configuration zation f l = arg min f l Data quantization Data quantization phase Fixed-point CNN model Fixed-point CNN Floating-point CNN model CNN model where x + represents the result of a layer Laye Layer 1 Layer 1 yer 1 Feature maps Feature ma ure Feature maps ature e maps e ma when we denote the computation of a Dynamic range analysis and finding … … layer as x + = A · x . optimal quantization strategy Feature maps Feature maps Layer Layer N Layer Layer N Weight and data quantization configuration uantiz 15 / 25 29

�� Dynamic-Precision Data Quantization Results � �� 16 / 25

Industrial Implementations – Nvidia TensorRT Quantization No Saturation Quantization – INT8 Inference No saturation: map |max| to 127 Saturate above |threshold| to 127 ● ● -|max| 0.0 +|max| -|T| 0.0 +|T| -127 127 -127 0 0 127 ◮ Map the maximum value to 127, with unifrom step length. ● Significant accuracy loss, in general ● Weights: no accuracy improvement ● Activations: improved accuracy ◮ Suffer from outliers. ● Which |threshold| is optimal? 17 / 25

Quantization Industrial Implementations – Nvidia TensorRT Saturation Quantization – INT8 Inference No saturation: map |max| to 127 Saturate above |threshold| to 127 ● ● -|max| +|max| -|T| +|T| 0.0 0.0 -127 127 -127 0 0 127 ◮ Set a threshold as the maxiumum value. Significant accuracy loss, in general Weights: no accuracy improvement ● ● ● Activations: improved accuracy ◮ Divide the value domain into 2048 groups. ● Which |threshold| is optimal? ◮ Traverse all the possible thresholds to find the best one with minimum KL divergence. 18 / 25

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, - PowerPoint PPT Presentation

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview Fixed-Point Representation Non-differentiable Quantization Differentiable Quantization Reading List 2 / 25 Overview Fixed-Point Representation

Quantization, after Souriau Prequantization Quantization? Group algebra Classical Franois

LOW PRECISION INFERENCE ON GPU Hao Wu, NVIDIA OUTLINE Performance motivation for quantization

Same, Same But Different Recovering Neural Network Quantization Error Through Weight

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

Quantization of Poisson-Lie Hamiltonian systems Chiara Esposito Julius Maximilian University of

Adiabatic limits, Theta functions, and Geometric Quantization 2019 CMS Winter Meeting Takahiko

From Martingales in Finance to Quantization for pricing Giorgia Callegaro Universit di Padova

Quantization of group-valued moment maps III Eckhard Meinrenken June 4, 2011 Eckhard Meinrenken

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

CMSC5743 Lab05 Introduction to Distiller Qi Sun (Latest update: October 13, 2020) Fall 2020 1

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 /

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 /

Lead Student Lesson Plan L05: PLP #1 Presentation Objectives Below are the outcomes for this

CS3505/5020 Software Practice II Teams reminder Finish rotation example Sound CS 3505 L05 - 1

Brownian motion (cont.) 18.S995 - L05 1.2 Brownian motion Diffusion equation with constant

CBOR (RFC 7049) Concise Binary Object Representation See also: IETF94 CBOR lightning tutorial

Number representation in Java Scientific notation Overview topics Binary representation of

Text Processing We have seen that preprocessing the pattern speeds up pattern matching

CPSC 121: Models of Computation Instructor: Bob Woodham woodham@cs.ubc.ca Department of Computer

Welcome! Todays Agenda: Introduction Float to Fixed Point and Back Operations

i radix point (binary point) assumed to be to the right of i 0 the rightmost digit.

Chapter 4: outline 4.1 Overview of Network 4.4 Generalized Forwarding layer and SDN data

Network Layer Addressing, forwarding, routing Why do we need a Network layer? Cannot afford