Design Automation for Cryptography Anupam Chattopadhyay Assistant Professor, School of Computer Science and Engineering School of Physical and Mathematical Sciences, Nanyang Technological University June 7, 2017
Motivation • Security not a feature but a design metric • Crytography is highly dynamic Custom cryptanalysis ] Cryptanalysis Lightweight cryptography ▪ Timeline of cryptgraphic competitions 24 PHC 15 AES eSTREAM 35 42 NESSIE All proposals SHA-3 59 attacked ! CRYPTEC 58 CAESAR 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 Year Block ciphers Stream ciphers Hash functions AE
Motivation • Design metrics • Security kerenels developer has a huge design space Variety of constraints Variety of target platforms Area footprint, power GPPs, DSP, GPUs, utilization, latency, operating ASICs, ASIPs, FPGAs, frequency, cost, ... CPLDs, Microcontrollers, ... Architectural customization Variety of requirements Wordsize, instruction set, Throughput, security, thermal Memory, microarchitectural limitations, distribution, scalability, flexibility, ... template, Interfaces, ...
Contents • Custom Optimization Examples • Domain-specific High Level Synthesis • Fault-resistant Design by Physical Synthesis
HC-128: Parallelization by State Splitting • P and Q has 512 words of 32-bit • 5 reads, 1 write P 0 Q 0 Design1 Update P & Design2 : Even P 0 P 1 Q 0 Q 1 Key Gen. odd splitting P 0 P 1 P 2 P 3 Q 0 Q 1 Q 2 Q 3 Design3 : 4-way splitting A. Khalid, et al. One Word/Cycle HC-128 Accelerator via State-Splitting Optimization, in INDOCRYPT 2014
HC-128: Parallelization by State Splitting Pipeline for Design3 Faraday 65m Standard Cell library, typical case
AES: Technology Mapping • The AES MixColumns: matrix multiplication operation of the AES state byte matrix by a constant matrix given by • The smallest circuit in literature requires 108 XOR gates to implement this. • This function is four instances of the following equation over : 41 LUTs using the LUT6 FPGA technology. • Instead, we view the operation as a Boolean function rather than over and we optimize it towards an implementation of 36 LUTs . • Inverse MixColumns similarly can be reduced from 72 to 60 LUTs . Joint work with Mustafa Khairallah and Thomas Peyrin, unpublished
FPGA-Aware Pipelining FPGA-aware Partitioning Logic-aware Partitioning Joint work with Mustafa Khairallah and Thomas Peyrin, unpublished
High-Level Synthesis • Focuses on algorithm to RTL flow × Dependent on user proficiency, varies widely from tool to tool × Unaware of technology platforms × Hard to reuse design knowledge × Storage allocation optimizations missing Domain-specialization?
Berkeley Dwarfs for Parallel Computing [1] How apps relate to 13 dwarfs (Red Hot Blue Cool) • ed es C am b C E m B L P P Health Image Speech Music Brows G M D H E S 1Finite State Mach. 2Combinational 3Graph Traversal 4Structured Grid 5Dense Matrix 6Sparse Matrix 7Spectral (FFT) 8Dynamic Prog 9N-Body 10MapReduce 11Backtrack/ B&B 12Graphical Models 13Unstructured Grid [1] The Landscape of Parallel Computing Research: A View from Berkeley, by K. Asanovic et al , Technical Report, 2006
SoC Processing Elements GPP DSP Log P O W E R D I S S I P A T I O N ASIP Log F L E X I B I L I T Y FPGA Programmable 10 5 . . . 10 6 Configurable Log P E R F O R M A N C E 10 3 . . . 10 4 Source: T. Noll, RWTH Aachen
Domain-specific High Level Synthesis: Lessons from Wireless Communication IP
Contents • Custom Optimization Examples • Domain-specific High Level Synthesis • Fault-resistant Design by Physical Synthesis
CRYKET: Overview • CRYKET (Cryptographic Kernels Toolkit): Domain specific HLS – Language independent GUI based design capture – Domain specific expertise, well understood kernels Algorithmic Architectural Test CRYKET CRYKET Specifications Specifications Vectors Library Synth Test Verilog ANSI C Verification Scripts Bench RTL Model Model Logic RTL System Software System Synthesis Simulation Simulation Integration Validation A. Khalid, et al. RAPID-FeinSPN: A Rapid Prototyping Framework for Feistel and SPN-Based Block Ciphers. ICISS 2013
RunFein: Feistel and SPN Block Cipher ▪ Block/key/word sizes, rounds, mode of operation, test vectors ▪ Layers of operation: S/P-Box, Bitwise/ Arithmetic/Boolean/ Field operations, compound popular cipher operations Key Key Plaintext Plaintext L Data R Data Key Register Data Register Key Register K i layer 0 Rearrange layer 0 Key Update K i layer 1 Key Update layer 1 ... S 1 S 2 S n S 1 layer 2 S 2 ... S n layer 2 Rearrange layer 3 layer 3 GF Mul P-Box layer 4 Feistel Network cipher SPN Cipher A. Khalid, et al. RunFein: A Rapid Prototyping Framework for Feistel and SPN Based Block Ciphers, JCEN 2016
RunFein: Fast Design Space Exploration Plaintext Plaintext Plaintext Plaintext Data Register Round 1a Data Register Data Register Data Register Round 1 Round 1 Data Register Round 1 Round 1b Data Register Round 2 Loop folded ... ... No Unrolling Data Register Round N Round N Round 1c Bit slicing N times unrolled N times unrolled Sub-pipelined round with pipelining A. Khalid, et al. RunFein: A Rapid Prototyping Framework for Feistel and SPN Based Block Ciphers, JCEN 2016
RunFein: GUI
RunFein Cipher Design Capture GUI PRESENT-80/ 128, AES-128, Algorithmic parameters KLEIN-64/ 80/ 96, Known Enter Y Known (basic, round layers and configuration LED-64/128, SEA. Cipher? configurations kround layers) N TEA, XTEA, XXTEA, .... Microarchitecture Testvectors (.xml) Validate? Load configuration Fail Pass Cipher Model creation after validation of layers d_state k_state update Layer 0 Layer 0 Layer 1 Layer 1 ... ... Kernel library, Controller state Validation checks Layer n Layer m Controller Datapath ANSI C Algorithmic Hardware Software Implementation, HDL for Controller Generate Verification environment, and Datapath, Profiling switches, Testbench, Nodes library Synthesis Scripts (.c, .h, scripts) (.v, .h, scripts) RTL/ gate level Simulation, Testvectors Verification, Synthesis, Throughput Profiling, Switching Power estimation NIST test Suite
RunFein: PRESENT-80 Bitslicing Faraday 65m Standard Cell library, typical case, Operating frequency 100 KHz Faraday 65m Standard Cell library, typical case, Synopsys Design Compiler F-2011.09
RunStream: Analysis and Results Better A. Khalid, et al. RunStream: A High-level Rapid Prototyping Framework for Stream Ciphers. In ACM TECS, 2016
Contents • Custom Optimization Examples • Domain-specific High Level Synthesis • Fault-resistant Design by Physical Synthesis
Preventing Differential Fault Analysis Attack • Attacker assumptions – Ability to induce fault at a given time ( 𝑈 ) and space ( 𝑇 ) precision – Ability to infer/solve a system of equations based on the observed faulty ( 𝐷𝑈 ∗ ) and correct ( 𝐷𝑈 ) ciphertext • Prevention – Redundancy, Concurrent Error Detection – Attack-specific mounted sensor
DFARPA: Differential Fault Attack Resistant Physical Design Automation • Exemplary Fault Attack on AES 1 – Multi-byte fault attack model D0 D1 D2 D3 – Fault is induced in at least one of the four D3 D0 D1 D2 diagonals in the AES state D2 D3 D0 D1 D1 D2 D3 D0 • Solution: Generate floorplan for the 16 blocks, so that the fault become un-exploitable in presence of the fault cluster of radius 𝑠 units – Place the blocks(elements) of each diagonal at least 𝑠 units distance – Formulated as a constrained placement problem 1. D. Saha, D. Mukhopadhyay, and D. RoyChowdhury. A diagonal fault attack on the advanced encryption standard. Cryptology ePrint Archive, Report 2009/581, 2009. http://eprint.iacr.org/2009/581. Joint work with Debdeep Mukhopadhyay and Shivam Bhasin, unpublished
DFARPA: Reactive Countermeasure • The sensor is composed of two key components – a watchdog ring oscillator (WRO) and a phase detection (PD) circuit. • High energy injections impact signal propagation delay, which disturbs the phase of WRO. This phase change is detected by the PD circuit to raise an alarm and halt sensitive computation. • 2 Metal layers are reserved for WRO routing Joint work with Debdeep Mukhopadhyay and Shivam Bhasin, unpublished
Contents • Custom Optimization Examples • Domain-specific High Level Synthesis • Fault-resistant Design by Physical Synthesis
Conclusion and Outlook • Conclusion – Domain-specific HLS can push the design efficiency and productivity – Different phases of EDA can integrate cryptographic/cryptanalyst knowhow to improve • Outlook – Integrating (more) custom optimizations – Diverse technology/platform-specific constraints – Diverse cipher families – Integrating automated SCA and DFA protection
Thank you! Questions?
Recommend
More recommend