ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/)
Course Outline q Lectures 1 and 2: Basics Concepts ● Basics of GPU Programming ● Basics of GPU Architecture q Lecture 3: GPU Performance Bottlenecks ● Memory Bottlenecks ● Compute Bottlenecks ● Possible Software and Hardware Solutions q Lecture 4: GPU Security Concerns ● Timing channels ● Possible Software and Hardware Solutions
Era of Heterogeneous Architectures Intel Coffee Lake and AMD Raven Ridge Kaby Lake
Discrete GPUs
Discrete GPUs + Intel Processors
Security Concerns q GPUs may be accelerating applications that are using user-sensitive data (e.g., genomics, financial) q GPUs may be accelerating cryptographic applications (e.g., AES, RSA etc.) and authentication algorithms on-behalf of CPUs q Given the popularity of GPUs, it is imperative to keep GPUs secure against a variety of side-channel attacks and other security vulnerabilities.
Security Attacks q User’s web activity on GPU can be tracked by the malicious attacker who is co-located on the same card [Oakland’14] q AES private keys can be recovered by correlation timing attacks [HPCA’16] q Accelerating attacks via GPUs [Oakland’18] ● Glitch: Accelerating row hammer attacks
Correlation Timing Attacks Server@GPU Plaintexts Ciphertexts Time duration time 1 Plaintext # 1 Ciphertext # 1 time 2 Plaintext # 2 Ciphertext # 2 time 3 Plaintext # 3 Ciphertext # 3 … … … Correct Key Correct Key?? K 1 , K 2 , … ,K , … i Key guesses time start - time stop = time 1 Outside Attacker
Memory Access Coalescing in GPUs Computing Unit Wavefront pool Wavefront . . . Thread # 1 Thread # 32 Scheduler LD/ST Unit Coalescing Unit Global Memory
Memory Access Coalescing in GPUs Wavefront tid = thread id tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 0x00 0x01 0x02 0x03 Block Address # 0 Block Address # 1 0x04 0x05 0x06 0x07 Block Address # 1 0x04 0x05 0x06 0x07 Block Address # 2 0x08 0x09 0x0A 0x0B
Memory Access Coalescing in GPUs Wavefront tid = thread id tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 Coalescing Unit Block Address # 0 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 Block Address # 1 Block Address # 2 0x08 0x09 0x0A 0x0B
AES implementation on GPU q Symmetric Encryption with 128-bit key and 10 rounds. q S-box implementation involves table lookups. q [Jiang/Fei/Kaeli, HPCA’16] demonstrated that the last round is vulnerable.
Last Round of AES on GPU #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ "
Last Round of AES on GPU #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ " Replies # 1 Request # 1 Thread # 1 t i1 T 4 [t i1 ] ⊕ k j LINE # 1 c j1 Replies # 2 Request # 2 Thread # 2 ⊕ k j LINE # 2 t i2 T 4 [t i2 ] c j2 Coalescing … … … … Unit . . . . . . . . . . . . . . . . . . . . . . . . Replies # 32 Request # 32 Thread # 32 LINE # 32 t i32 T 4 [t i32 ] ⊕ k j c j32 Input text Ciphertext to Last Round
Correlation Timing Attack on GPU q Goal of the attack: Recover the AES Key (byte-by-byte) q Last Round of AES is vulnerable #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ " q Last Round is invertible #$% = 𝑈 #$% ⊕ 𝑙 " ] Memory access /0 [𝑑 𝑢 $ ) " of thread tid How an attacker can calculate the number of coalesced accesses?
Attacker calculates the # of coalesced accesses #$% = 𝑈 #$% ⊕ 𝑙 " ] /0 [𝑑 𝑢 $ ) " Guessed Table Lookup Indices T 4-1 [c j1 ⊕ k jm ] ⊕ k jm c j1 t i1,m Coalesced Accesses ⊕ k jm c j2 T 4-1 [c j2 ⊕ k jm ] t i2,m Correct value of key byte? ( A jm,n ) . . … . … . . . . . . . . . . . . . . . . . . T 4- c j32 ⊕ k jm t i32,m 1 [c j32 ⊕ k jm ] Ciphertext
Coalesced Accesses and Execution Time Associate the number of coalesced accesses with execution time
Finding the Correct Key Value q Attacker encrypts ‘N’ number of plaintexts over server ● Records Ciphertext and Execution time Recorded # of Coalesced Accesses Execution Time Correlations A j0,1 , A j0,2 , . . . . , A j0,N E 1 ,E 2 ,...,E N Corr j0 Key Guess 0 A j1,1 , A j1,2 , . . . . , A j1,N Corr j1 Key Guess 1 . . . . . . Maximum Correct Key A jα,1 , A jα,2 , . . . . , A jα,N Corr jα Correlation Key Byte Guess α . . . . . . Key Corr j255 A j255,1 , A j255,2 , . . . . ,A j255,N Guess 255
Simulating Timing Attack on our Set-up How to mitigate Correlation Timing Why is Correlation Timing Attack Correct guess Attacks on GPU? possible? • The baseline attack leverages the deterministic nature of Incorrect guesses Answer: By making it harder for the the coalescing mechanism • AES key value affects the coalesced accesses attacker to correctly calculate the number • # coalesced accesses affects the execution time of coalesced accesses
Naïve Solution RCoal to mitigate the correlation timing q Disable coalescing altogether? attacks ● Correlation drops to ~ 0 Correct guess ● Correct key byte is indistinguishable • Targets the deterministic nature of the coalescing mechanism • Fixed number of subwarps (or subwavefronts) • Fixed sizes of subwarp (or subwavefronts) • Deterministic mapping of the thread elements to subwarps (or subwavefronts) q Up to 178% performance degradation ● Degradation increases with plaintext size Naïve solution is Good for Security, Bad for Performance Offers no tradeoff
RCoal: Fixed Sized Subwarp (FSS) DEFAULT: number of subwarps = 1 FSS: number of subwarps = 2 sid = 1 sid = 0 sid = 0 tid = 0 tid = 1 tid = 2 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 0x00 0x04 0x07 0x09 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x08 0x09 0x0A 0x0B
FSS Security against Baseline Attack • Correlation between the number of coalesced accesses and the execution time drops • Correct key byte is harder to find • Improved security
FSS Performance • Memory accesses increase with number of subwarps • Execution time increases with number of subwarps • Performance degrades as number of subwarp increase Can attacker still recover the AES key?
FSS against FSS attack q Attacker can figure out the number of subwarps
FSS against FSS attack q Attacker can figure out the number of subwarps q Attacker can calculate per subwarp accesses Correct guess
FSS against FSS attack q Attack possible when the attacker can RCoal to mitigate the correlation timing figure out number of subwarps! ● Coalescing still deterministic attacks • Targets the deterministic nature of the coalescing mechanism • Fixed number of subwarps • Fixed sizes of subwarp • Deterministic mapping of the thread elements to subwarps
RCoal: Random Sized Subwarp (RSS) q Size distribution We select RSS with Skewed Distribution RCoal to mitigate the correlation timing RCoal to mitigate the correlation timing attacks attacks • Targets the deterministic nature of the coalescing mechanism û • Fixed number of subwarps ü Skewed Distribution Normal Distribution • Fixed sizes of subwarp • Deterministic mapping of the thread elements to subwarps • Mean of the distribution is different than FSS • Mean of the distribution is same as FSS • Large subwarp offers better coalescing • Security and performance similar to FSS • Improved security compared to FSS • Improved performance compared to FSS
RCoal: Random-Threaded Subwarp (RTS) FSS: number of subwarps = 2 FSS+RTS: number of subwarps = 2 tid = 0 tid = 1 tid = 2 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x01 0x06 0x07 0x00 0x01 0x06 0x07 sid = 0 sid = 0 sid = 1 sid = 1 sid = 0 sid = 0 sid = 1 sid = 1 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07
RCoal: Random-Threaded Subwarp (RTS) RSS: number of subwarps = 2 RSS+RTS: number of subwarps = 2 sid = 1 sid = 1 sid = 0 sid = 0 tid = 2 tid = 0 tid = 1 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x06 0x00 0x01 0x08 0x00 0x01 0x06 0x08 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x08 0x09 0x0A 0x0B
Evaluation Set-up q AES-128 q Plaintext with 32 lines q GPGPU-SIM ● 15 SMs, 32 threads/warp, one subwarp per coalescing unit (base case) ● GDDR5 Memory with 6 MCs, 16 DRAM-banks, 4 bank-groups/MC q Enhanced Attack Algorithms ● Corresponding Attacks
Performance/Security Trade-off 2 Correlation 1 Security (Lower the better) 0 1 2 4 8 16 32 Number of Subwarps FSS FSS+RTP RSS RSS+RTP Offers Security/Performance Trade-off 1.5 Execution Time Execution Time (Lower the better) 1 0.5 0 1 2 4 8 16 32 Number of Subwarps FSS FSS+RTS RSS RSS+RTS
Recommend
More recommend