CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan Advanced Computer Architecture Laboratory Architectureís Diminishing Return ï Staples of value we strive forÖ ï High Speed ï Low Power ï Low Cost ï Tricks of the trade ï Faster clock rates, via pipelining ï Higher instruction throughput, via ILP extraction ï Strong evidence of diminishing return, PIII vs. P4 ï 22% less P4 inst throughput (0.35 vs. 0.45 SPECInt/MHz) Less return ⇒ less value ⇒ ï 1
A Powerful Solution: Eschew Generality Speed, Flexibility, Efficiency Programmability H/W designs Application General Purpose General Purpose Specific Processors Processors Processor + ISA Extensions ï Specialization limits the scope of a deviceís operation ï Produces stronger properties and invariants ï Results in higher return optimizations ï Programmability preserves the flexibility regarded by GPPís ï A natural fit for embedded designs ï Where application domains are more likely restrictive Where cost and power are 1 st order concerns ï Cryptography ï Definitions: ï encryption vs. decryption ï public-key cipher vs. secret-key cipher ï Public-secret key ciphers are the most commonly used f(x) g(x) pl ai nt ext ci phert ext pl ai nt ext Publ i c Key Pri vat e Key g(x) g(x) pl ai nt ext ci phert ext pl ai nt ext Pri vat e Key Pri vat e Key 2
SSL Session Breakdown Focus: Secret-Key Ciphers client server SSL Characterization by Session Length authenticate 100% public private key 90% 80% Relative Contribution https get 70% to Run Time 60% Public 50% Other https recv 40% Private 30% . private 20% . 10% . 0% 1k 2k 4k 8k 16k 32k close average size of a SSL Session Length (bytes) single web object (21k) Benchmark Suite Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate 3
Cipher Throughput Analysis ï Alpha 21264 vs. 4W ï All except Mars and Twofish 350.00 Alpha 21264 were within 10% of the 300.00 4W actual machine tests DF 250.00 ï Mars 11%, Twofish 15% 200.00 ï Alpha 21264 vs. DF 150.00 ï Blowfish, IDEA, and RC6 are running within 20% of 100.00 DF performance 50.00 ï Mars 29%, Twofish 76% 0.00 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish ï RC4 and Rijndael are outliers Characteristics of Cipher Kernels ï Diffusion (goal of cryptography) ï Goal is to randomly impress upon each group of output bits some information from each of the input bits ï Process needs to be reversible ï Should result in a random perturbation of each output bit with a probability > 50% ï Cipher kernel loops run about 16 times on each block of data,mixing the data more an more reach round ï Cipher kernels have very little/to no parallelism ï Usually a very long recurrence 4
Breakdown of Cipher Operations ï Rotates ï Rotate the bits in a register ï Modular Addition ï Modular Multiplication (2^N + 1 prime modulus operations) ï Substitutions ï Table-based substitutions ï SBOX ñ a table of values indexed with plaintext (a byte) that produces the result of the key-parameterized function ï General Permutations ï XBOX ñ map N bits onto N buts with any arbitrary exchange of individual bits Blowfish Cipher Kernel for (ii=0; ii < BF_ROUNDS; ii++) { register BF_LONG tmp; r ^= p[ii+1]; r ^= (((s[(int)(l >> 24L)] + sbox[0x0100 + ((int)(l >> 16L) & 0xff)]) ^ sbox[0x0200 + ((int)(l >> 8L) & 0xff)]) + sbox[0x0300 + ((int)(l) & 0xff)]) & 0xffffffffL; tmp = r; r = l; l = tmp; } r ^= p[BF_ROUNDS+1]; 5
Cipher Bottleneck Analysis Analysis of Bottlenecks in Cipher Kernels ï Alias - impact of stalling loads in the pipeline until all ealier store 1 addresses have been resolved 0.9 ï Branch - effects of mispredictions 0.8 ï Issue - impact of reducing issue 0.7 width Alias 0.6 Branch ï Mem - impact of introducing a Issue Mem 0.5 realistic memory system Res 0.4 Window ï Res - impact of limited functional unit All resources 0.3 ï Window - impact of a limited-size 0.2 instruction window 0.1 0 3DES Mars RC4 Rijndael Twofish Cipher Relative Run Time Cost Focus: Kernel Loop ï 3DES and IDEA are 100 small even for 16 90 byte sessions Blowfish 80 ï Mars, RC4, RC6, 3DES 70 IDEA Rijndael, and 60 Mars Twofish drop well RC4 50 below 10% for 4k+ RC6 40 Rijndael byte sessions 30 Twofish ï Blowfish is outlier, 20 10 drops below 10% 0 only for 64k+ byte 16 64 256 1k 4k 16k 64k 256k 1M sessions Session Length (in bytes) 6
Cipher Kernel Characterization Characterization of Cipher Kernel Operations ï SBOX - substitutions 100% ï XBOX - permutations 90% ï IDEA, Mars, RC4, and RC6 rely on Branch 80% Mov arithmetic computations; benefit Ld/St 70% from more resources (multiplies) Xbox 60% Sbox and from faster operations Mult (rotates) 50% Rotates Logical 40% ï Blowfish, 3DES, Rijndael and Arith 30% Twofish rely on substitutions; 20% benefit from increased memory bandwidth and accesses 10% 0% Architectural Extensions ï All instructions are limited to two register input operands and one register output ï ROL and ROR (rotates) for 64 and 32-bit data types ï ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input ï MULMOD computes the modular multiplication of two register values modulo the value 0x10001 ï SBOX speeds the accessing of substitution tables with 256- entry tables and 32-bit contents ï SBOXSYNC synchronize the SBOX table with memory ï XBOX implements a portion of a full 64-bit permutation 7
Crypto-Specific ISA ï frequent SBOX substitutions Table Index ï X = sbox[(y >> c) & 0xff] 31 10 0 24 16 8 0 ï X = sbox[ m[ j^c] [1] ] ï SBOX instruction eliminates address generation opcode ï All SBOX tables are aligned to a 1k byte boundary 00 ï Address generation becomes zero-latency bit concatenation ï Stores to SBOX storage are not SBOX Table visible by later SBOXís until ï An SBOXSYNC is executed ï An alias bit is set ï SBOX instruction ï Incorporates byte extract ï Speeds address generation ï Original 4-cycle operation becomes a 1-cycle CryptoManiac instruction Crypto-Specific ISA (cont.) ï Ciphers often mix logical/arithmetic operation ï Excellent diffusion properties plus resistance to attacks ï ISA supports instruction combining ï Logical + ALU op, ALU op + Logical ï Eliminates dangling XORís ï Reduces kernel loop critical paths by nearly 25% ï Small (< 5%) increase in clock cycle time 8
Performance of ISA Extensions 4.5 Orig/4W 4 Opt/4W 3.5 Opt/4W+ Opt/8W+ 3 Opt/DF 2.5 2 1.5 1 0.5 0 Blowfish 3DES IDEA Mars RC4 RC6 Rijndael Twofish CryptoManic ISA ï bundle := <inst><inst><inst><inst> ï inst := <operation pair><dest><operand 1><operand 2><operand 3> ï operation pair := <short><tiny>|<tiny><short>|<tiny><tiny>|<long><nop> ï tiny := <xor> | <and> | <inc> | <signext> | <nop> ï short := <add> | <sub> | <rot> | <sbox> | <nop> ï long := <mul> | <mulmod> Instruction Semantics r4 <- (r1+r2) ⊗ r3 Add-Xor r4, r1, r2, r3 And-Rot r4, r1, r2, r3 r4 <- (r1&&r2)<<<r3 r4 <- (r1&&r2) ⊗ r3 And-Xor r4, r1, r2, r3 9
The CryptoManiac Processor ï A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor ï Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization ï Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism A Case Study: CryptoManiac Request Format Result Format CM id session action dataÖ id session resultÖ Proc CM In Q Out Q Encrypt/decrypt Scheduler Ciphertext/plaintext Proc Request requests results . . . CM Proc Key Store ï Efficient crypto-processor for private-key ciphers ï Chip-multiprocessor design extract inter-session parallelism ï A highly specialized and efficient design ï Crypto-specific microarchitecture, ISA, compiler, and circuits 10
Crypto-Specific Microarchitecture IF ID /R F E X/M E M W B B FU T B FU I M FU R F InQ /O utQ E Inte rfa ce M FU D ata K e ysto re M em In te rface ï Simple 4-wide 32-bit statically scheduled VLIW ï No caches needed, small instruction and data RAMs ï 16-entry BTB predicts branches ï Resulting design is small and efficient Crypto-Specific Functional Unit Logical Unit {tiny} XOR AND Pipelined {long} 32-Bit 1K Byte MUL 32-Bit 32-Bit {short} SBOX Adder Rotator Cache Logical Unit {tiny} XOR AND 11
Recommend
More recommend