new circuit minimization techniques for smaller and
play

New Circuit Minimization Techniques for Smaller and Faster AES - PowerPoint PPT Presentation

New Circuit Minimization Techniques for Smaller and Faster AES SBoxes Alexander Maximov and Patrik Ekdahl Ericsson Research Patrik Ekdahl Ericsson Research 2019-08-26 Ericsson Internal | 2018-02-21 Plaintext Preliminaries 128 128


  1. New Circuit Minimization Techniques for Smaller and Faster AES SBoxes Alexander Maximov and Patrik Ekdahl Ericsson Research Patrik Ekdahl Ericsson Research 2019-08-26 Ericsson Internal | 2018-02-21

  2. Plaintext Preliminaries 128 128 Roundkey 1 AES Round Function Mux • SubBytes is the only non-linear part SubBytes • 16 8x8 SBoxes needed for a full implementation ShiftRows • Forward only or combined SBox MixColumns • In ASICs • Look-up table • Mux Gate implementation 128 Roundkey n What to remember: Registers — New improved methods for circuit minimization. — New SBox architecture which improves the critical path. 128 Ciphertext Ericsson Internal | 2018-02-21

  3. Preliminaries Basic flow of AES SBox Affine transformation Input U Output R xM +b Inversion GF(2^8) Linear Constant Direct implementation of inversion over Rijndael field is very complex. Ericsson Internal | 2018-02-21

  4. Previous work (low area) Rijmen [Rij00] proposed (based on Itoh and Tsujii [IT88]) to use a composite field and do the inversion in GF(2^4) instead. 4 4 ( )2 v Input U Output R 8 8 8 X-1 ( )-1 X xM +b 4 Base conversion 4 Base back-conversion matrix matrix Inversion over GF(2 4 ) — Satoh et al [SMT01] reduced inversion to GF(2 2 ). — Canright [Can05] investigated the importance of subfield representation. Ericsson Internal | 2018-02-21

  5. Previous work (low depth) Boyar, Peralta et al ([BP10a,BP10b,BP12,BFP18]) used a normal base A=a 0 Y + a 1 Y 16 and A -1 = (AA 16 ) -1 A 16 (also based on Itoh and Tsujii [IT88]) to derive another implementation. 4 Input U Output R 8 8 8 ( )17 ( )-1 X-1 X xM +b 4 Several papers followed: — Nogami et al [NNT+10], looking at mixed bases. — Ueno et al [UHS+15], looking at redundant bases. — Reyhani et al [RMTA18a,b], improving Boyar-Peralta (BP) search algorithm. — Li et al [LSL+19], incorporating depth into BP algorithm. Ericsson Internal | 2018-02-21

  6. Previous work (low depth) Boyar, Peralta et al ([BP10a,BP10b,BP12,BFP18]) used a normal base A=a 0 Y + a 1 Y 16 and A -1 = (AA 16 ) -1 A 16 (also based on Itoh and Tsujii [IT88]) to derive another implementation. 4 Input U Output R 8 8 8 ( )17 ( )-1 X-1 X xM +b 4 Collect all linear terms and push into two matrices. Several papers followed: — Nogami et al [NNT+10], looking at mixed bases. — Ueno et al [UHS+15], looking at redundant bases. — Reyhani et al [RMTA18a,b], improving Boyar-Peralta (BP) search algorithm. — Li et al [LSL+19], incorporating depth into BP algorithm. Ericsson Internal | 2018-02-21

  7. Architectural starting point [BP12] Base back-conversion and Base conversion and the affine transformation of generation of linear parts the AES SBox. of multiplications Input U Output R 8 4 bit X 18 bit N Bottom linear 22 bit Q Inverse 4 bit Y Mul- 8 Top linear 2 x Mul GF(24) Sum Basic problem statement: Given a binary matrix 𝑁 "#$ and the maximum allowed depth maxD, find the circuit of depth D ≤ maxD with the minimum number of 2-input XOR gates such that it computes 𝑍 = 𝑁 ' 𝑌. 𝑧 + = 𝑦 + + 𝑦 . + 𝑦 / + 𝑦 0 1 0 1 1 1 Additional Input Requirement (AIR) Input signals may arrive with different delay 𝑒 5 • 𝑧 1 = 𝑦 1 + 𝑦 . + 𝑦 0 𝑁 = 0 1 1 0 1 𝑧 . = 𝑦 + + 𝑦 1 + 𝑦 / + 𝑦 0 1 1 0 1 1 Additional Output Requirement (AOR) Output signals may need to be ready earlier, 𝑓 5 ≤ 𝑛𝑏𝑦𝐸 • Ericsson Internal | 2018-02-21

  8. Our contributions — New techniques for minimizing the Top and Bottom matrices (area with delay constraints). — Introduced a probabilistic heuristic approach to the cancellation-free algorithm by Paar [Paa97]. — New cancellation-allowed exhaustive search algorithm, based on BP-algorithm [BP10a]. — Floating Multiplexers for the combined SBox. — New generalization of BP-algorithm, allowing other types of gates. — New metrics, with lots of speed up tricks for the distance function. — Stack algorithm with a search tree. — New architecture that removes the Bottom matrix and reduces the overall depth. — New circuit for the inverse operation. — Additional Transformation Matrices. Ericsson Internal | 2018-02-21

  9. Combined SBox with multiplexers Input U Top Top Forward Inverse Mux Common part Bottom Bottom Forward Inverse Mux Output R Ericsson Internal | 2018-02-21

  10. Combined SBox with multiplexers Input U Input X Example: Top Top Y F Y I Forward Inverse Mux Mux 𝑍 ; = 𝑌 + + 𝑌 1 𝑍 < = 𝑌 + + 𝑌 . Y 𝑍 = MUX(select, 𝑌 + + 𝑌 1 , 𝑌 + + 𝑌 . ) Common part Replace with: 𝑍 = MUX select, 𝑌 1 , 𝑌 . + 𝑌 + Bottom Bottom Forward Inverse Generally: Mux 𝑍 = A + MUX select, 𝐶, 𝐷 → 𝑍 = A + Δ + MUX select, B + Δ, 𝐷 + Δ Output R Ericsson Internal | 2018-02-21

  11. Boyar-Peralta algorithm [BP10a] — Notion of a “ point ”. — In original algorithm, this is a linear combination of input signals. Set of gates used G ={XOR}. S + = 𝑦 + , 𝑦 1 , … , 𝑦 0 = ( 1,0,0,0,0 , 0,1,0,0,0 , … , 0,0,0,0,1 ) — Base set of known points S. 𝑧 + 1 0 1 1 1 𝑧 1 = — Set of target points T, the rows 𝑧 5 of M. 0 1 1 0 1 𝑧 . 1 1 0 1 1 — Metric using a distance function 𝜀 5 𝑇, 𝑧 5 . ∆= (𝜀 + , 𝜀 1 ,..., 𝜀 $_1 ). — Set of candidates C . Try all base pair 𝑡 5 , 𝑡 Q in 𝑇 R and form a candidate 𝑑 = 𝑕 𝑡 5 , 𝑡 Q , in this case: 𝑑 = 𝑡 5 + 𝑡 • Q Calculate the new distance vector ∆ based on 𝑇 R ∪ 𝑑 • We save the candidate 𝑑 that gives the lowest distance 𝑇 R\1 = 𝑇 R ∪ 𝑑 • • Repeat until the distance vector is all-zero. Ericsson Internal | 2018-02-21

  12. BP for Linear Circuits with Floating Multiplexers — Include MUX, NMUX in the set of gates. The six gates MUX(v,w) MUX(w,v) — A point is now a tuple 𝑞 = (𝐺, 𝐽) NMUX(v,w) NMUX(w,v) — F and I are linear combinations of input signals XOR(v,w) XNOR(v,w) — Translated into 𝑁𝑉𝑌(𝑎𝐺, 𝐺 ' 𝑌, 𝐽 ' 𝑌) — Input points 𝑌 e = 2 e , 2 e , 𝑙 = 0, …𝑜 − 1 < , 𝑙 = 0, …, 𝑛 − 1 ; , 𝑍 — Target points 𝑍 e = 𝑍 e e — Improved metrics and new algorithm (with lots of speed up) to calculate 𝜀 5 𝑇, 𝑧 5 |𝐸𝑛𝑏𝑦 . — We keep track of AIR, and AOR at each stage. — For the full Affine transformation, define the point as 𝑞 = (𝑔, 𝐺, 𝑗, 𝐽) à 𝑁𝑉𝑌(𝑎𝐺, 𝐺 ' 𝑌 + 𝑔, 𝐽 ' 𝑌 + 𝑗) Ericsson Internal | 2018-02-21

  13. BP for any Nonlinear Circuit — Allow all kinds of gates in G (XOR, AND, MUX, … 2-input, 3-input…). — A point is now the truth table of a Boolean function. — Combine points using truth tables and gate functionality. — Target points are the truth table for every output signal of the nonlinear block. — Applicable to circuits of maximum 4-5 input signals, and the number of output signals is not limited. — Used to derive a smaller inversion circuit over GF(2 4 ). Ericsson Internal | 2018-02-21

  14. Search Tree S r+TD 20-50 children S r+TD Sr+2 Sr+3 S r+TD Sr Sr+1 S r+TD S r+TD ~ 400 total children — Try to keep leaves from as many different branches as possible Ericsson Internal | 2018-02-21

  15. Search Tree S r+TD S r+TD Sr+2 Sr+3 S r+TD Sr Sr+1 S r+TD S r+TD TD — Try to keep leaves from as many different branches as possible Ericsson Internal | 2018-02-21

  16. New architecture for lower depths The Bottom matrix only depends on the multiplication 18-bit Q Architecture A 8-bit output R of the 4-bit signal Y with some linear combination Bottom 18-bit N linear of the input signal U 2xMul 18-bit Q Inverse Mul- 4-bit X 4-bit Y 8-bit Input U GF(2 4 ) Sum linear Top 𝑺 = 𝑍 + ' 𝑁 + ' 𝑽 + ⋯ + 𝑍 / ' 𝑁 / ' 𝑽 4-bit Y Architecture D 8-bit output R 32nand2 +8xor4 32-bit L where 𝑁 5 is an 8x8 matrix to be scalar multiplied by the 𝑍 5 bit. Calculate 𝑁 5 in parallel in Top matrix. Assembling requires 56 gates (32NAND, 24XOR) Ericsson Internal | 2018-02-21

  17. New circuit for the inversion in GF(2 4 ) 𝑍 + = 𝑌 1 𝑌 . 𝑌 / + 𝑌 + 𝑌 . + 𝑌 1 𝑌 . + 𝑌 . + 𝑌 / 𝑍 1 = 𝑌 + 𝑌 . 𝑌 / + 𝑌 + 𝑌 . + 𝑌 1 𝑌 . + 𝑌 1 𝑌 / + 𝑌 / 𝑍 . = 𝑌 + 𝑌 1 𝑌 / + 𝑌 + 𝑌 . + 𝑌 + 𝑌 / + 𝑌 + + 𝑌 1 𝑍 / = 𝑌 + 𝑌 1 𝑌 . + 𝑌 + 𝑌 . + 𝑌 + 𝑌 / + 𝑌 1 𝑌 / + 𝑌 1 — In [BP12] they found a circuit of 17 gates and depth 4 (with base gates {AND, XOR}). — By applying the BP-algorithm for general non-linear circuits, we managed to achieve 9 gates and depth 3. T0 = NAND(X0, X2) T3 = MUX(X1, X2, 1) Y1 = MUX(T2, X3, T3) T1 = NOR(X1, X3) T4 = MUX(X3, X0, 1) Y2 = MUX(X0, T2, X1) T2 = XNOR(T0, T1) Y0 = MUX(X2, T2, X3) Y3 = MUX(T2, X1, T4) We also found a small conventional (no MUXes) circuit of 15 gates and depth 3. Ericsson Internal | 2018-02-21

Recommend


More recommend