Intel’s New AES Instructions Enhanced Performance and Security Shay Gueron - Intel Corporation, Israel Development Center, Haifa, Israel - University of Haifa, Israel 1
Overview • AES basics • Performance hungry applications • The security issue • The AES instrcutions • Performance scalability • Basic usage • Software flexibility • Software tools • Performance and optimizations • More on software flexibility • And more… 2
AES Basics 3
AES Overview Plain Text Shift Row Fast Software Encryption Add Round Key X 10-14 “Rounds” SubByte (Sbox) Round Mix Columns key Slow Software Encryption Cipher Text 4
AES Transformations • AddRoundKey — 128b xor of State and round key • SubBytes — nonlinear bytewise substitution (repeted 16x) • ShiftRows — bytewise permutation • MixColumns — matrix multiplication in GF(2 8 ) • InvSubBytes, InvShiftRows, InvMixColumns • SubWord – 4 x SubBytes • RotWord – [a0, a1, a2, a3] [a1, a2, a3, a0] • Rcon – in round i equals [{02} i-1 , {00}, {00}, {00}] 5
AES Encryption 40/48/56 steps Tmp = AddRoundKey (Data, Round_Key_Encrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = MixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round]) end loop Tmp = ShiftRows (Tmp) Tmp = SubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14]) Result = Tmp 6
AES Decryption (Equivalent Inverse Cipher) Equivalent Inverse Cipher Tmp = AddRoundKey (Data, Round_Key_Decrypt [0]) For round = 1-9 or 1-11 or 1-13: Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = InvMixColumns (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [round]) end loop Tmp = InvShiftRows (Tmp) Tmp = InvSubBytes (Tmp) Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [10 or 12 or 14]) Result = Tmp 7
AES-128 Key Expansion AES-128 Key Expansion AES-256 Key Expansion Encrypt for (i = 0 .. 3) { w[i] = Cipher Key[i] } for (i = 0 .. 7) { w[i] = Cipher Key[i] } for (i = 4 .. 43) { for (i = 8 .. 59) { temp = w[i-1] temp = w[i-1] if (i mod 4 = 0) { if (i mod 8 = 0) { temp = SubWord(RotWord(temp)) xor Rcon temp = SubWord(RotWord(temp)) xor Rcon } } w[i] = w[i-4] xor temp else if (i mod 8 = 4) { } temp = SubWord(temp) } w[i] = w[i-8] xor temp } 8
Preparing the decryption key schedule Encrypt Keys Encrypt Round Keys Decrypt Round Keys K0 K1 K2 K3 Key0 InvMixCols K4 K5 K6 K7 Key1 InvMixCols Key2 K8 K9 K10 K11 InvMixCols Key3 K12 K13 K14 K15 Key4 K16 K17 K18 K19 InvMixCols InvMixCols Key5 K20 K21 K22 K23 InvMixCols Key6 K24 K25 K26 K27 InvMixCols Key7 K28 K29 K30 K31 Key8 InvMixCols K32 K33 K34 K35 Key9 K36 K37 K38 K39 InvMixCols Key10 K40 K41 K42 K43 For the Equivalent Inverse cipher: apply InvMixCols to Encrypt Round keys 9
Performance Hungry Applications 10
Performance hungry AES usage models • SSL/TLS for HTTPS Relevant to clinet and server platforms • IPSec • OS Based Disk Encryption – E.g., Microsoft Bitlocker – Similar in Linux • File encryption utilities • Storage Encryption • Voice Over IP Security (VOIP) 11
The Security Issue 12
CPU cache Memory tradeoff for capacity and latency (and cost) Most instructions are in relation to memory (load and store) Cache = small and fast memory • working close to CPU’s frequency • hiding the latency of larger large memories • Speculative: holds “next” required data Problem: in a multitasking environment memory access can be made implicitly data-dependent 13
Cache-based attacks (among others) Theoretical attacks by Page: • Time-driven: execution time as function of cache-hit/miss numbers – 2003: Tsunoo et al. on DES – 2004: Bernstein on first round of AES – 2006: Neve et al. on first and second round of AES • Trace-driven: sequence of cache-hit/miss – 2005: Bertoni et al. on AES through SimpleScalar – 2005: Lauradoux et al. on AES – 2006 : Acıiçmez et al. on AES • Access-driven: cache line accesses of crypto process – 2005: Percival on RSA with multithreaded processors – 2005-06: Osvik, Shamir et al. on AES with multithreaded processors – 2005-06: Neve and Seifert on AES with single-threaded processors and last round attack 14
Table based AES (e.g., OpenSSL) Tables based easier accesses and operations on 32-bit proc. For AES encryption, 5 precomputed tables [1-byte] [4-byte] Composed from two tables S and S’ [ 1-byte] [1-byte] T 0 = [S’,S,S,S S’] T 1 = [S S’,S’,S,S] T 2 = [S,S S’,S’,S] T 3 = [S,S,S S’,S’] T 4 = [S,S,S,S] /* round 1: */ t0 = T0[s0 >> 24] T1[(s1 >> 16) & 0xff] T2[(s2 >> 8) & 0xff] T3[s3 & 0xff] rcon[4]; t1 = T0[s1 >> 24] T1[(s2 >> 16) & 0xff] T2[(s3 >> 8) & 0xff] T3[s0 & 0xff] rcon[5]; t2 = T0[s2 >> 24] T1[(s3 >> 16) & 0xff] T2[(s0 >> 8) & 0xff] T3[s1 & 0xff] rcon[6]; t3 = T0[s3 >> 24] T1[(s0 >> 16) & 0xff] T2[(s1 >> 8) & 0xff] T3[s2 & 0xff] rcon[7]; /* round 2: */ … 15
Table based AES T4 is used for the last round (no MixColumns) and for Key Expansion T4= lsb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 msb 0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76 1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0 2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15 3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75 4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84 5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf 6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8 7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2 8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73 9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db 10 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79 each 11 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08 value 12 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a repeated 13 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e 4x 14 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df 15 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16 16
Exploiting OS scheduling AES rounds are short vs context switch frequency Preemptive scheduling ability for a process to yield CPU before end of OS quantum 2 processes • spy continuously watches the cache accesses • crypto runs for small amounts a time accessing (re)loading table and wait tables end of start of OS quantum OS quantum 17
Cache sharing leakages Two processes on the same processor: crypto and spy 1. spy loads a (large) table 2. crypto runs on the processor cache lines 3. spy reloads and times each table line: if loading time is short line not evicted long line evicted 18
Mitigation • There are way to write AES software and avoid the data-dependency of memory accesses – But they severely degrade performance 19
Intel’s AES Instructions 20
AES New Instructions (AES-NI) • Will be introduced into the Intel Instructions Set starting from 2009 Four instructions to perform AES encryption and decryption • AESENC – Perform one round encryption of AES • AESENCLAST – Perform last round encryption of AES • AESDEC – Perform one round decryption of AES • AESDECLAST – Perform last round decryption of AES Two instructions to perform AES Key Expansion • AESKEYGENASSIST – Used for round key expansion • AESIMC – convert encryption round keys to a form usable for decryption • Intel’s architecture uses the equivalent inverse cipher 21
AES Data Structure State and Round Key in xmm0 and xmm2/m128 lsb msb xmm1 X3 X2 X1 X0 127 96 95 64 63 32 31 0 xmm2/ X7 X6 X5 X4 127 96 95 64 63 32 31 0 m128 The State (xmm0) in matrix representation S(0,0) S(0,1) S(0,2) S(0,3) X0 = S (3 ,0) S (2, 0) S (1, 0) S (0, 0) S(1,0) S(1,1) S(1,2) S(1,3) X1 = S (3, 1) S (2 ,1) S (1, 1) S (0, 1) X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2) X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3) S(2,0) S(2,1) S(2,2) S(2,3) S(3,0) S(3,1) S(3,2) S(3,3) 22
The 4 AES Round Instructions AESENC xmm0, xmm2/m128 AESDEC xmm0, xmm2/m128 Tmp:= xmm0; Tmp:= xmm0; Round Key:= xmm2/m128; Round Key:= xmm2/m128; Tmp:= Shift Rows (Tmp); Tmp:= Inverse Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Inverse Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); Tmp:= Inverse Mix Columns (Tmp:=); xmm0:= Tmp xor Round Key xmm0:= Tmp xor Round Key AESENCLAST xmm0, xmm2/m128 AESDECLAST xmm0, xmm2/m128 Tmp:= xmm0; State := xmm0; Round Key:= xmm2/m128; Round Key := xmm2/m128 Tmp:= Shift Rows (Tmp); Tmp:= Inverse Shift Rows (State); Tmp:= Substitute Bytes (Tmp); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor Round Key xmm0:= Tmp xor Round Key 23
Two instructions for Key Expansion AESIMC xmm0, xmm2/m128 RoundKey := xmm2/m128; xmm0 := InvMixColumns (RoundKey) AESKEYGENASSIST xmm0, xmm2/m128, imm8 Tmp := xmm2/m128 RCON[31-8] := 0; RCON[7-0] := imm8; X3[31-0] := Tmp[127-96]; X2[31-0] := Tmp[95-64]; X1[31-0] := Tmp[63-32]; X0[31-0] := Tmp[31-0]; xmm0 := [RotWord (SubWord (X3)) XOR RCON, SubWord (X3), Rotword (SubWord (X1)) XOR RCON, SubWord (X1)] 24
AESKEYGENASSIST xmm0, xmm2/m128, imm8 X0 X1 X2 X3 Duplicate Duplicate X1 X1 X3 X3 S-box S-box S-box S-box X1 ’ X1 ’ X3 ’ X3 ’ Rotate Rotate X1 ’ X3 ’ X1 ’’ X3 ’’ XOR RCON XOR RCON X1 ’ X1 ’’’ X3 ’ X3 ’’’ 25
Recommend
More recommend