Fast symmetric crypto on embedded CPUs Peter Schwabe Radboud - PowerPoint PPT Presentation

Fast symmetric crypto on embedded CPUs Peter Schwabe Radboud University Nijmegen, The Netherlands June 5, 2014 Summer School on the design and security of cryptographic algorithms and devices for real-world applications

Embedded CPUs 4-bit CPUs 16-bit CPUs ◮ TMS 1000 ◮ TI MSP430 ◮ Intel 4004 ◮ Microchip Technology PIC24 ◮ Atmel MARC4 32-bit CPUs ◮ Toshiba TLCS-47 ◮ ARM11 8-bit CPUs ◮ ARM Cortex-M ∗ ◮ Atmel AVR ◮ ARM Cortex-A ∗ ◮ Intel 8051 ◮ Atmel AVR32 ◮ Microchip Technology PIC ◮ MIPS32 ◮ STMicroelectronics STM8 ◮ AIM 32-bit PowerPC ◮ STMicroelectronics STM32 Fast symmetric crypto on embedded CPUs 2

Symmetric crypto Fast symmetric crypto on embedded CPUs 3

Optimizing crypto ◮ This talk: optimize for speed ◮ Implement algorithms in assembly ◮ Available instructions and registers are determined by the target architecture Fast symmetric crypto on embedded CPUs 4

Optimizing crypto ◮ This talk: optimize for speed ◮ Implement algorithms in assembly ◮ Available instructions and registers are determined by the target architecture ◮ Throughput : number of instructions (of a certain type) we can do per cycle Fast symmetric crypto on embedded CPUs 4

Optimizing crypto ◮ This talk: optimize for speed ◮ Implement algorithms in assembly ◮ Available instructions and registers are determined by the target architecture ◮ Throughput : number of instructions (of a certain type) we can do per cycle ◮ Latency of an instruction: number of cycles we have to wait before using the result Fast symmetric crypto on embedded CPUs 4

Optimizing crypto ◮ This talk: optimize for speed ◮ Implement algorithms in assembly ◮ Available instructions and registers are determined by the target architecture ◮ Throughput : number of instructions (of a certain type) we can do per cycle ◮ Latency of an instruction: number of cycles we have to wait before using the result ◮ Latency and throughput are determined by the microarchitecture Fast symmetric crypto on embedded CPUs 4

Optimizing crypto ◮ This talk: optimize for speed ◮ Implement algorithms in assembly ◮ Available instructions and registers are determined by the target architecture ◮ Throughput : number of instructions (of a certain type) we can do per cycle ◮ Latency of an instruction: number of cycles we have to wait before using the result ◮ Latency and throughput are determined by the microarchitecture ◮ Optimizing software in assembly means: ◮ Find good representation of data ◮ Choose suitable instructions that implement the algorithm ◮ Schedule those instruction to hide latencies ◮ Assign registers efficiently (avoid spills) Fast symmetric crypto on embedded CPUs 4

Keccak on ARM11 Joint work with Bo-Yin Yang and Shang-Yi Yang Fast symmetric crypto on embedded CPUs 5

The ARM11 ◮ 16 32-bit integer registers (1 used as PC, one used as SP): 14 freely available ◮ Executes at most one instruction per cycle ◮ 1 cycle latency for all relevant arithmetic instructions, 3 cycles for loads from cache ◮ Standard 32-bit RISC instruction set; two exceptions: Fast symmetric crypto on embedded CPUs 6

The ARM11 ◮ 16 32-bit integer registers (1 used as PC, one used as SP): 14 freely available ◮ Executes at most one instruction per cycle ◮ 1 cycle latency for all relevant arithmetic instructions, 3 cycles for loads from cache ◮ Standard 32-bit RISC instruction set; two exceptions: ◮ One input of arithmetic instructions can be rotated or shifted for free as part of the instruction ◮ This input is needed one cycle earlier in the pipeline ⇒ “backwards latency” + 1 Fast symmetric crypto on embedded CPUs 6

The ARM11 ◮ 16 32-bit integer registers (1 used as PC, one used as SP): 14 freely available ◮ Executes at most one instruction per cycle ◮ 1 cycle latency for all relevant arithmetic instructions, 3 cycles for loads from cache ◮ Standard 32-bit RISC instruction set; two exceptions: ◮ One input of arithmetic instructions can be rotated or shifted for free as part of the instruction ◮ This input is needed one cycle earlier in the pipeline ⇒ “backwards latency” + 1 ◮ Loads and stores can move 64-bits between memory and 2 adjacent 32-bit registers (same cost as 32-bit load/store) Fast symmetric crypto on embedded CPUs 6

Keccak ◮ State of 5 × 5 matrix of 64 -bit lanes ◮ Absorb message in blocks of 128 bytes ◮ Perform state transformation in 24 rounds; each round: ◮ Compute b 0 , . . . , b 4 as XORs of columns ◮ Compute c 0 , . . . , c 4 , each as b i ⊕ ( b j ≪ 1) Fast symmetric crypto on embedded CPUs 7

Keccak ◮ State of 5 × 5 matrix of 64 -bit lanes ◮ Absorb message in blocks of 128 bytes ◮ Perform state transformation in 24 rounds; each round: ◮ Compute b 0 , . . . , b 4 as XORs of columns ◮ Compute c 0 , . . . , c 4 , each as b i ⊕ ( b j ≪ 1) ◮ Update state columnwise ◮ Pick up 5 lanes from a diagonal ◮ XOR each lane with one of the c i ◮ Rotate each lane by a different fixed distance ◮ Obtain each new lanes as l i ⊕ (( ¬ l j )& l k ) Fast symmetric crypto on embedded CPUs 7

Keccak ◮ State of 5 × 5 matrix of 64 -bit lanes ◮ Absorb message in blocks of 128 bytes ◮ Perform state transformation in 24 rounds; each round: ◮ Compute b 0 , . . . , b 4 as XORs of columns ◮ Compute c 0 , . . . , c 4 , each as b i ⊕ ( b j ≪ 1) ◮ Update state columnwise ◮ Pick up 5 lanes from a diagonal ◮ XOR each lane with one of the c i ◮ Rotate each lane by a different fixed distance ◮ Obtain each new lanes as l i ⊕ (( ¬ l j )& l k ) ◮ One lane per column is additionally XORed with a round constant Fast symmetric crypto on embedded CPUs 7

A 64 -bit hash-function on a 32 -bit CPU ◮ Represent each lane in two registers, XOR and AND are trivial ◮ How about 64 -bit rotate with 32 -bit registers? Fast symmetric crypto on embedded CPUs 8

A 64 -bit hash-function on a 32 -bit CPU ◮ Represent each lane in two registers, XOR and AND are trivial ◮ How about 64 -bit rotate with 32 -bit registers? ◮ Answer by the Keccak implementation guide: bit interleaving ◮ Put all bits from even positions into one 32 -bit register, all odd bits into the other ◮ Perform all rotates for free on 32-bit registers Fast symmetric crypto on embedded CPUs 8

A 64 -bit hash-function on a 32 -bit CPU ◮ Represent each lane in two registers, XOR and AND are trivial ◮ How about 64 -bit rotate with 32 -bit registers? ◮ Answer by the Keccak implementation guide: bit interleaving ◮ Put all bits from even positions into one 32 -bit register, all odd bits into the other ◮ Perform all rotates for free on 32-bit registers ◮ a ← b ⊙ ( c ≪ n ) is free rotation, but a ← ( b ⊙ c ) ≪ n is not Fast symmetric crypto on embedded CPUs 8

A 64 -bit hash-function on a 32 -bit CPU ◮ Represent each lane in two registers, XOR and AND are trivial ◮ How about 64 -bit rotate with 32 -bit registers? ◮ Answer by the Keccak implementation guide: bit interleaving ◮ Put all bits from even positions into one 32 -bit register, all odd bits into the other ◮ Perform all rotates for free on 32-bit registers ◮ a ← b ⊙ ( c ≪ n ) is free rotation, but a ← ( b ⊙ c ) ≪ n is not ◮ Don’t rotate output, rotate for free when the value is used as input ◮ When both inputs of an instruction need to be rotated: a ← ( b ≪ n 1 ) ⊙ ( c ≪ n 2 ) . ◮ Compute: a ← b ⊙ ( c ≪ ( n 2 − n 1 )) and set the implicit rotation distance of a to n 1 Fast symmetric crypto on embedded CPUs 8

A 64 -bit hash-function on a 32 -bit CPU ◮ Represent each lane in two registers, XOR and AND are trivial ◮ How about 64 -bit rotate with 32 -bit registers? ◮ Answer by the Keccak implementation guide: bit interleaving ◮ Put all bits from even positions into one 32 -bit register, all odd bits into the other ◮ Perform all rotates for free on 32-bit registers ◮ a ← b ⊙ ( c ≪ n ) is free rotation, but a ← ( b ⊙ c ) ≪ n is not ◮ Don’t rotate output, rotate for free when the value is used as input ◮ When both inputs of an instruction need to be rotated: a ← ( b ≪ n 1 ) ⊙ ( c ≪ n 2 ) . ◮ Compute: a ← b ⊙ ( c ≪ ( n 2 − n 1 )) and set the implicit rotation distance of a to n 1 ◮ Need to keep implicit rotation distances invariant over loop iterations ◮ Full unrolling essentially makes all rotates free Fast symmetric crypto on embedded CPUs 8

Memory access overhead ◮ 200 -byte state is way too large for 56 register bytes ◮ Simple structure of main transformations: ◮ Load 5 half-lanes ◮ Load 5 values c i ◮ Perform arithmetic ( 10 XOR, 5 AND) ◮ Store 5 result lanes Fast symmetric crypto on embedded CPUs 9

Memory access overhead ◮ 200 -byte state is way too large for 56 register bytes ◮ Simple structure of main transformations: ◮ Load 5 half-lanes ◮ Load 5 values c i ◮ Perform arithmetic ( 10 XOR, 5 AND) ◮ Store 5 result lanes ◮ This means 50% load/store overhead ◮ Even worse for computation of b i and c i Fast symmetric crypto on embedded CPUs 9

Fast symmetric crypto on embedded CPUs Peter Schwabe Radboud - PowerPoint PPT Presentation

Fast symmetric crypto on embedded CPUs Peter Schwabe Radboud University Nijmegen, The Netherlands June 5, 2014 Summer School on the design and security of cryptographic algorithms and devices for real-world applications Embedded CPUs 4-bit

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Achieving security

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

CRYPTO HERE, CRYPTO THERE, CRYPTO, CRYPTO EVERYWHERE WORLD AQUATIC HEALTH CONFERENCE

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Outline Public key crypto RSA Essentials Computer Security: Public Key Crypto Public Key Crypto

Computer Security: Secret Key Crypto B. Jacobs Institute for Computing and Information Sciences

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

- The First Crypto Merchant - Crypto Payment Crypto Payment for online shops for retail shops

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

An Efficient and Parallel Gaussian Sampler for Lattices Chris Peikert Georgia Tech CRYPTO 2010

DISCOVERING FOUNDATION WITH SASS FOR DRUPAL BRIAN KRALL, SR. FRONT-END DEVELOPER WHO'S THIS

Security Aspects of Authenticated Encryption (in light of the CAESAR competition) Elena Andreeva

The Basics of Product Creation How to Price Your Products Why Product Pricing is Important It

ICS 667 Advanced HCI Design Methods 3. Activity or Conceptual Design Dan Suthers Spring 2005

On to OO design ideas Really just an introduction (much more in CS 48) About programming in the

Math: progress or standing still Hans Hagen TUG Conference Tokyo, October 2013 Math as script

Open Source Virtual Platforms for SW Prototyping on FPGA Mark Burton Enabling System Level Design

Okay. Okay. We're going to go ahead and get started. Welcome again everyone to the information