cryptomaniac
play

CryptoManiac Slides borrowed with permission from Todd Austin and - PDF document

CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan Advanced Computer Architecture Laboratory Architectures Diminishing Return Staples of value we strive for High Speed Low Power Low


  1. CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan Advanced Computer Architecture Laboratory Architectureís Diminishing Return ï Staples of value we strive forÖ ï High Speed ï Low Power ï Low Cost ï Tricks of the trade ï Faster clock rates, via pipelining ï Higher instruction throughput, via ILP extraction ï Strong evidence of diminishing return, PIII vs. P4 ï 22% less P4 inst throughput (0.35 vs. 0.45 SPECInt/MHz) Less return ⇒ less value ⇒ ï 1

  2. A Powerful Solution: Eschew Generality Speed, Flexibility, Efficiency Programmability H/W designs Application General Purpose General Purpose Specific Processors Processors Processor + ISA Extensions ï Specialization limits the scope of a deviceís operation ï Produces stronger properties and invariants ï Results in higher return optimizations ï Programmability preserves the flexibility regarded by GPPís ï A natural fit for embedded designs ï Where application domains are more likely restrictive Where cost and power are 1 st order concerns ï Cryptography ï Definitions: ï encryption vs. decryption ï public-key cipher vs. secret-key cipher ï Public-secret key ciphers are the most commonly used f(x) g(x) pl ai nt ext ci phert ext pl ai nt ext Publ i c Key Pri vat e Key g(x) g(x) pl ai nt ext ci phert ext pl ai nt ext Pri vat e Key Pri vat e Key 2

  3. SSL Session Breakdown Focus: Secret-Key Ciphers client server SSL Characterization by Session Length authenticate 100% public private key 90% 80% Relative Contribution https get 70% to Run Time 60% Public 50% Other https recv 40% Private 30% . private 20% . 10% . 0% 1k 2k 4k 8k 16k 32k close average size of a SSL Session Length (bytes) single web object (21k) Benchmark Suite Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate 3

  4. Cipher Throughput Analysis ï Alpha 21264 vs. 4W ï All except Mars and Twofish 350.00 Alpha 21264 were within 10% of the 300.00 4W actual machine tests DF 250.00 ï Mars 11%, Twofish 15% 200.00 ï Alpha 21264 vs. DF 150.00 ï Blowfish, IDEA, and RC6 are running within 20% of 100.00 DF performance 50.00 ï Mars 29%, Twofish 76% 0.00 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish ï RC4 and Rijndael are outliers Characteristics of Cipher Kernels ï Diffusion (goal of cryptography) ï Goal is to randomly impress upon each group of output bits some information from each of the input bits ï Process needs to be reversible ï Should result in a random perturbation of each output bit with a probability > 50% ï Cipher kernel loops run about 16 times on each block of data,mixing the data more an more reach round ï Cipher kernels have very little/to no parallelism ï Usually a very long recurrence 4

  5. Breakdown of Cipher Operations ï Rotates ï Rotate the bits in a register ï Modular Addition ï Modular Multiplication (2^N + 1 prime modulus operations) ï Substitutions ï Table-based substitutions ï SBOX ñ a table of values indexed with plaintext (a byte) that produces the result of the key-parameterized function ï General Permutations ï XBOX ñ map N bits onto N buts with any arbitrary exchange of individual bits Blowfish Cipher Kernel for (ii=0; ii < BF_ROUNDS; ii++) { register BF_LONG tmp; r ^= p[ii+1]; r ^= (((s[(int)(l >> 24L)] + sbox[0x0100 + ((int)(l >> 16L) & 0xff)]) ^ sbox[0x0200 + ((int)(l >> 8L) & 0xff)]) + sbox[0x0300 + ((int)(l) & 0xff)]) & 0xffffffffL; tmp = r; r = l; l = tmp; } r ^= p[BF_ROUNDS+1]; 5

  6. Cipher Bottleneck Analysis Analysis of Bottlenecks in Cipher Kernels ï Alias - impact of stalling loads in the pipeline until all ealier store 1 addresses have been resolved 0.9 ï Branch - effects of mispredictions 0.8 ï Issue - impact of reducing issue 0.7 width Alias 0.6 Branch ï Mem - impact of introducing a Issue Mem 0.5 realistic memory system Res 0.4 Window ï Res - impact of limited functional unit All resources 0.3 ï Window - impact of a limited-size 0.2 instruction window 0.1 0 3DES Mars RC4 Rijndael Twofish Cipher Relative Run Time Cost Focus: Kernel Loop ï 3DES and IDEA are 100 small even for 16 90 byte sessions Blowfish 80 ï Mars, RC4, RC6, 3DES 70 IDEA Rijndael, and 60 Mars Twofish drop well RC4 50 below 10% for 4k+ RC6 40 Rijndael byte sessions 30 Twofish ï Blowfish is outlier, 20 10 drops below 10% 0 only for 64k+ byte 16 64 256 1k 4k 16k 64k 256k 1M sessions Session Length (in bytes) 6

  7. Cipher Kernel Characterization Characterization of Cipher Kernel Operations ï SBOX - substitutions 100% ï XBOX - permutations 90% ï IDEA, Mars, RC4, and RC6 rely on Branch 80% Mov arithmetic computations; benefit Ld/St 70% from more resources (multiplies) Xbox 60% Sbox and from faster operations Mult (rotates) 50% Rotates Logical 40% ï Blowfish, 3DES, Rijndael and Arith 30% Twofish rely on substitutions; 20% benefit from increased memory bandwidth and accesses 10% 0% Architectural Extensions ï All instructions are limited to two register input operands and one register output ï ROL and ROR (rotates) for 64 and 32-bit data types ï ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input ï MULMOD computes the modular multiplication of two register values modulo the value 0x10001 ï SBOX speeds the accessing of substitution tables with 256- entry tables and 32-bit contents ï SBOXSYNC synchronize the SBOX table with memory ï XBOX implements a portion of a full 64-bit permutation 7

  8. Crypto-Specific ISA ï frequent SBOX substitutions Table Index ï X = sbox[(y >> c) & 0xff] 31 10 0 24 16 8 0 ï X = sbox[ m[ j^c] [1] ] ï SBOX instruction eliminates address generation opcode ï All SBOX tables are aligned to a 1k byte boundary 00 ï Address generation becomes zero-latency bit concatenation ï Stores to SBOX storage are not SBOX Table visible by later SBOXís until ï An SBOXSYNC is executed ï An alias bit is set ï SBOX instruction ï Incorporates byte extract ï Speeds address generation ï Original 4-cycle operation becomes a 1-cycle CryptoManiac instruction Crypto-Specific ISA (cont.) ï Ciphers often mix logical/arithmetic operation ï Excellent diffusion properties plus resistance to attacks ï ISA supports instruction combining ï Logical + ALU op, ALU op + Logical ï Eliminates dangling XORís ï Reduces kernel loop critical paths by nearly 25% ï Small (< 5%) increase in clock cycle time 8

  9. Performance of ISA Extensions 4.5 Orig/4W 4 Opt/4W 3.5 Opt/4W+ Opt/8W+ 3 Opt/DF 2.5 2 1.5 1 0.5 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish CryptoManic ISA ï bundle := <inst><inst><inst><inst> ï inst := <operation pair><dest><operand 1><operand 2><operand 3> ï operation pair := <short><tiny>|<tiny><short>|<tiny><tiny>|<long><nop> ï tiny := <xor> | <and> | <inc> | <signext> | <nop> ï short := <add> | <sub> | <rot> | <sbox> | <nop> ï long := <mul> | <mulmod> Instruction Semantics r4 <- (r1+r2) ⊗ r3 Add-Xor r4, r1, r2, r3 And-Rot r4, r1, r2, r3 r4 <- (r1&&r2)<<<r3 r4 <- (r1&&r2) ⊗ r3 And-Xor r4, r1, r2, r3 9

  10. The CryptoManiac Processor ï A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor ï Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization ï Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism A Case Study: CryptoManiac Request Format Result Format CM id session action dataÖ id session resultÖ Proc CM In Q Out Q Encrypt/decrypt Scheduler Ciphertext/plaintext Proc Request requests results . . . CM Proc Key Store ï Efficient crypto-processor for private-key ciphers ï Chip-multiprocessor design extract inter-session parallelism ï A highly specialized and efficient design ï Crypto-specific microarchitecture, ISA, compiler, and circuits 10

  11. Crypto-Specific Microarchitecture IF ID /R F E X/M E M W B B FU T B FU I M FU R F InQ /O utQ E Inte rfa ce M FU D ata K e ysto re M em In te rface ï Simple 4-wide 32-bit statically scheduled VLIW ï No caches needed, small instruction and data RAMs ï 16-entry BTB predicts branches ï Resulting design is small and efficient Crypto-Specific Functional Unit Logical Unit {tiny} XOR AND Pipelined {long} 32-Bit 1K Byte MUL 32-Bit 32-Bit {short} SBOX Adder Rotator Cache Logical Unit {tiny} XOR AND 11

More recommend