Implementing AES on a Bunch of Processors ECRYPT AES Day – Bruges, Belgium Tim Güneysu Hardware Security Group 10/23/2012 Horst Görtz Institute for IT-Security
Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions
The AES Crib Sheet
AES Implementation: General Representation • Two different representations of AES in original proposal – 8-bit standard implementation – 32-bit T-Table implementation, e.g., k E 0 , j 0 k E 1 , j 1 ( ) ( ) ( ) ( ) T A T A T A T A 0 0 1 5 2 10 3 15 E k 2 2 , j E k 3 3 , j ¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each
AES Implementation : Choice of Processor • A dedicated AES processor in HW is not always the preferred option – AES is often just a supplementary function to a software application – HW development is too costly or necessary skills are not available • But when doing AES in software: which processor is the best?
AES Implementation : Parameters • Key size of AES – 128 , 192, 256 bit • Applied mode of operation – ECB , CBC, GCM, CTR ,… • Blocks concurrently processed – Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing) • Round key computation – Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently)
Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusion
Processors and Platforms • Native bit sizes of General-Purpose Processors (GPP) 4-Bit , e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit , e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit , e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit , e.g., ARM, TriCore in smart phones and automobiles 64-Bit , e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit , e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs ) • Myth or Fact: AES is always most efficient on native 8-bit and 32-bit processors!?
Processor Architectures • General processor design RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle • Processor interface to memory Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? ( Cache attacks!) Static/dynamic external or built-in RAM? • Additional processor extensions Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE)
Other Processor Architectures • Streaming processors, such as GPUs – Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory • Digital Signal Processors (DSP) – Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES? • Other array/tile-based processors – Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays)
Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions
AES Software Optimization • General requirements for secure implementation in software – Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks • Common tweaks to achieve high-performance – Make particular use of specialized instructions – Unroll rounds and loops to reduce instruction cycle count – Optimize register allocation – Precompute and store values in tables (e.g., T-Tables, round keys and constants) • Common tweaks to minimize code size – Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values • Common tweaks for low energy consumption – Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance
Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64? AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog] lda expkey+$02,y encrypt ldx tmpblock+4*0+0 ldx #$07 eor ssm2,x .addfirst ldx tmpblock+4*1+1 lda aesblock+0,x ; 4 eor ssm1,x ldx tmpblock+4*2+2 eor expkey+0,x ; 8 eor ssm0,x sta tmpblock+0,x ; 13 ldx tmpblock+4*3+3 lda aesblock+8,x ; 17 eor ssm3,x eor expkey+8,x ; 21 sta aesblock+$02 sta tmpblock+8,x ; 26 lda expkey+$03,y dex ; 28 ldx tmpblock+4*0+0 bpl .addfirst ; 31 eor ssm3,x ldx tmpblock+4*1+1 eor ssm2,x ldy #$10 ldx tmpblock+4*2+2 .round eor ssm1,x lda expkey+$00,y ; 4 ldx tmpblock+4*3+3 ldx tmpblock+4*0+0 ; 7 eor ssm0,x sta aesblock+$03 eor ssm0,x ; 11 ldx tmpblock+4*1+1 ; 14 lda expkey+$04,y eor ssm3,x ; 18 ldx tmpblock+4*1+0 ldx tmpblock+4*2+2 ; 21 eor ssm0,x ldx tmpblock+4*2+1 eor ssm2,x ; 25 eor ssm3,x ldx tmpblock+4*3+3 ; 28 ldx tmpblock+4*3+2 eor ssm1,x ; 32 eor ssm2,x sta aesblock+$00 ; 36 ldx tmpblock+4*0+3 Commodore C64 eor ssm1,x sta aesblock+$04 lda expkey+$01,y 8-bit CPU with 64 KB RAM ldx tmpblock+4*0+0 lda expkey+$05,y eor ssm1,x ldx tmpblock+4*1+0 eor ssm1,x ldx tmpblock+4*1+1 ldx tmpblock+4*2+1 eor ssm0,x eor ssm0,x ldx tmpblock+4*2+2 ldx tmpblock+4*3+2 eor ssm3,x eor ssm3,x ldx tmpblock+4*0+3 ldx tmpblock+4*3+3 eor ssm2,x eor ssm2,x sta aesblock+$05 sta aesblock+$01
Real Coding: Sample T-Table AES in C (Reference code by Brain Gladman) • High-performance AES Interleaved Round keys for processors ≥ 32 bit with z0 = roundkeys[i * 4 + 0]; Memory Layout Read z1 = roundkeys[i * 4 + 1]; interleaved T-tables (32-bit entries) table0 z2 = roundkeys[i * 4 + 2]; 0 Table 0 z3 = roundkeys[i * 4 + 3]; table1 Table 1 table2 • Per round, 4 instances of Table 2 p00 = (uint32) y0 >> 20; table3 Table 3 Extract and mask code snippet required p01 = (uint32) y0 >> 12; input bytes 16 Table 0 p02 = (uint32) y0 >> 4; Table 1 p03 = (uint32) y0 << 4; p00 &= 0xff0; Table 2 • AES has 720 instructions (INS) p01 &= 0xff0; Table 3 – 208 loads p02 &= 0xff0; 32 Table 0 p03 &= 0xff0; Table 1 – 4 stores p00 = *(uint32 *) (table0 + p00); Perform Table 2 – 508 integer instructions p01 = *(uint32 *) (table1 + p01); TLU Table 3 p02 = *(uint32 *) (table2 + p02); • 160 shifts 48 Table 0 p03 = *(uint32 *) (table3 + p03); • 176 masks (+16 for last rnd) Table 1 round keys z0 ^= p00; Add TLU to • Table 2 168 XORs z3 ^= p01; Table 3 • z2 ^= p02; 4 overhead for CTR mode z1 ^= p03; Offset (byte) … (only ¼ round shown) Access j -th table entry of table i via table<i>+16j
Optimizing AES for High-Performance • Special instruction: Combined Shift-and-Mask – On PPC, rlwinm is available as single instruction p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; Extract and mask input bytes p02 = (uint32) y0 >> 4; p00 = (uint32) y0 >> 20 & 0xff0; p03 = (uint32) y0 << 4; p01 = (uint32) y0 >> 12 & 0xff0; p00 &= 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p01 &= 0xff0; p03 = (uint32) y0 << 4 & 0xff0; p02 &= 0xff0; p03 &= 0xff0; – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions
Optimizing AES for High-Performance [cont.] • Special instruction: Scaled Index Loads – On x86, shift and load instructions can be combined Extract and mask p03 = (uint32) y0 << 4 input bytes … and do shifted TLU p03 &= 0xff0 Mask first p03 = y0 & 0xff … p03 = *(uint32 *) (table3 + (p03 << 4)) Perform p03 = *(uint32 *) (table3 + p03) TLU – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions ( not to be combined with previous method!!)
Optimizing AES for High-Performance [cont.] • Availability of 64-bit Registers – On AMD64 and UltraSparcV9, use padded values in 64-registers 0xc66363a5 0x0c60063006300a50 – Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined)
Recommend
More recommend