Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, - PowerPoint PPT Presentation

Conference on Cryptographic Hardware and Embedded Systems (CHES) 2020 Strength in Numbers: Improving Generalization with Ensembles in Machine Learning-based Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, Stjepan Picek In collaboration with 1

Contributions • Analysis of output class probabilities (predictions) • Using proper metrics for profiled SCA with deep learning • Improving generalization in DL-based profiled SCA: • Ensembles: combining multiple NN models into a stronger model 2

DL-based profiled SCA Profiling Traces Validation Traces ( known key) ( known key) Device A (AES) (learning algorithm - DNN) Attack Traces Recovered key ( unknown key) Device B (AES) Good (enough) (learning algorithm - DNN) generalization 3

“... Improving Generalization ...” • If (n-order) SCA leakages are there, we can improve generalization by: – Using a small NN model (implicitly regularized) – Using a large NN model and add (explicit) regularization (dropout, data augmentation, noise layers, batch normalization, weight decay, etc.) – Being precise in training time/epochs (early stopping) – Or, using ensembles . 4

DL-based SCA is (mostly) about hyperparameters • No points of interest selection More secure products • Less sensitive to trace desynchronization (CNN) • Implement high-order profiled SCA • Allow visualization techniques • Work in progress: – Create a good DL model is difficult: efficient and automated hyperparameters tuning not solved yet for SCA – SCA is already costly by itself: adding hyperparameters tuning can render the DL-based SCA impractical 5

DL-based SCA is (also) about metrics • Accuracy, Loss, Recall, Precision: not very consistent for SCA (multiple test traces) • Success Rate Custom loss/error function in Keras/TensorFlow • Guessing Entropy Predictions SCA Key Rank (GE, SR) Traces What can we learn here? 6

Results on Masked AES (MLP) Attacking 1 key byte with HW model Predictions or Output Class Probabilities 7

Output Class Probabilities Example: HW model of 1 byte on AES (S-box output) Classes / Labels HW = 0 HW = 1 HW = 2 HW = 3 HW = 4 HW = 5 HW = 6 HW = 7 HW = 8 𝑞 0,0 𝑞 0,1 𝑞 0,2 𝑞 0,3 𝑞 0,4 𝑞 0,5 𝑞 0,6 𝑞 0,7 𝑞 0,8 Test Traces 𝑞 1,0 𝑞 1,1 𝑞 1,2 𝑞 1,3 𝑞 1,4 𝑞 1,5 𝑞 1,6 𝑞 1,7 𝑞 1,8 𝑄 = 𝑞 2,0 𝑞 2,1 𝑞 2,2 𝑞 2,3 𝑞 2,4 𝑞 2,5 𝑞 2,6 𝑞 2,7 𝑞 2,8 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 𝑞 𝑂−1,0 𝑞 𝑂−1,1 𝑞 𝑂−1,2 𝑞 𝑂−1,3 𝑞 𝑂−1,4 𝑞 𝑂−1,5 𝑞 𝑂−1,6 𝑞 𝑂−1,7 𝑞 𝑂−1,8 𝑞 𝑗,𝑘 = probability that trace 𝑗 contains label (HW) 𝑘 𝑘 = 𝑇 𝑐𝑝𝑦 (𝑞𝑢 𝑗 ⨁𝑙 𝑗 ) (leakage or selection function) 8

Label according to key guess 𝒍 Summation: Key Rank Label(0) = Sbox( 𝒒𝒖 𝟏 ⨁ 𝒍 ) = 3 Label(1) = Sbox( 𝒒𝒖 𝟐 ⨁ 𝒍 ) = 6 Label(2) = Sbox( 𝒒𝒖 𝟑 ⨁ 𝒍 ) = 2 ... Label(N-1) = Sbox( 𝒒𝒖 𝑶−𝟐 ⨁ 𝒍 ) = 4 Classes / Labels HW = 0 HW = 1 HW = 2 HW = 3 HW = 4 HW = 5 HW = 6 HW = 7 HW = 8 𝑞 0,0 𝑞 0,1 𝑞 0,2 𝒒 𝟏,𝟒 𝑞 0,4 𝑞 0,5 𝑞 0,6 𝑞 0,7 𝑞 0,8 Test Traces 𝑞 1,0 𝑞 1,1 𝑞 1,2 𝑞 1,3 𝑞 1,4 𝑞 1,5 𝒒 𝟐,𝟕 𝑞 1,7 𝑞 1,8 𝑄(𝒍) = 𝑞 2,0 𝑞 2,1 𝒒 𝟑,𝟑 𝑞 2,3 𝑞 2,4 𝑞 2,5 𝑞 2,6 𝑞 2,7 𝑞 2,8 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 𝑞 𝑂−1,0 𝑞 𝑂−1,1 𝑞 𝑂−1,2 𝑞 𝑂−1,3 𝒒 𝑶−𝟐,𝟓 𝑞 𝑂−1,5 𝑞 𝑂−1,6 𝑞 𝑂−1,7 𝑞 𝑂−1,8 𝑂−1 𝑄(𝒍) = ෍ log 𝑞 𝑗,𝑘 = log 𝑞 0,3 + log 𝑞 1,6 + log 𝑞 2,2 + ⋯ + log 𝑞 𝑂−1,4 𝑗=0 Recovered key: argmax [𝑄 0 , 𝑄 1 , … , 𝑄(255)] 9 𝑙

Summation: Key Rank Test Accuracy is 100% Classes / Labels HW = 0 HW = 1 HW = 2 HW = 3 HW = 4 HW = 5 HW = 6 HW = 7 HW = 8 0,01 0,02 0,08 𝟏, 𝟓𝟏 0,20 0,25 0,01 0,02 0,01 Test Traces 0,02 0,01 0,06 0,14 0,15 0,20 𝟏, 𝟒𝟔 0,02 0,05 𝑄(𝒍) = 0,01 0,01 𝟏, 𝟔𝟒 0,08 0,22 0,10 0,02 0,02 0,01 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 0,01 0,01 0,02 0,25 𝟏, 𝟓𝟏 0,08 0,20 0,02 0,01 𝑂−1 𝑄(𝒍) = ෍ log 𝑞 𝑗,𝑘 = log 𝟏, 𝟓𝟏 + log 𝟏, 𝟒𝟔 + log 𝟏, 𝟔𝟒 + ⋯ + log 𝟏, 𝟓𝟏 𝑗=0 Always the highest value per row 10

Summation: Key Rank Test Accuracy is 27% Classes / Labels HW = 0 HW = 1 HW = 2 HW = 3 HW = 4 HW = 5 HW = 6 HW = 7 HW = 8 0,01 0,02 𝟏, 𝟑𝟔 0,08 0,20 𝟏, 𝟓𝟏 0,01 0,02 0,01 Test Traces 0,02 0,01 0,06 0,14 𝟏, 𝟒𝟔 0,20 0,02 𝟏, 𝟐𝟔 0,05 𝑄(𝒍) = 0,01 0,01 0,08 𝟏, 𝟔𝟒 0,22 0,10 0,02 0,02 0,01 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 0,01 0,01 0,02 𝟏, 𝟓𝟏 0,08 𝟏, 𝟑𝟔 0,20 0,02 0,01 𝑂−1 𝑄(𝒍) = ෍ log 𝑞 𝑗,𝑘 = log 𝟏, 𝟑𝟔 + log 𝟏, 𝟐𝟔 + log 𝟏, 𝟔𝟒 + ⋯ + log 𝟏, 25 𝑗=0 NOT Always the highest value per row 11

Rank of Class Probabilities Correct key candidate Incorrect key candidates Ordering keys by accuracy 0.29 0.27 1 3 4 5 6 7 8 9 2 max(𝑞 𝑗,0 , … , 𝑞 𝑗,8 ) min(𝑞 𝑗,0 , … , 𝑞 𝑗,8 ) Class Probability Rank 12

Rank of Class Probabilities Correct key candidate Incorrect key candidates Ordering keys by accuracy Low Ranks: summation for 𝒍 is pushed up 0.29 High Ranks: summation for 𝒍 0.27 is pushed down small influence on correct 𝑄(𝒍) large influence on correct 𝑄(𝒍) 1 3 4 5 6 7 8 9 2 Class Probability Rank 13

Results on Leaky AES (MLP) Attacking 1 key byte with HW model 0.48 Output class probabilities are pushed towards ranks 1 and 2 0.48 No test traces with high ranked probabilities 14

Results on Masked AES (MLP) Attacking 1 key byte with HW model 0.22 Output class probabilities are pushed towards ranks 1 and 2 0.22 Successful key recovery Few test traces with high ranked probabilities 15

Two CNN models on masked AES CNN with 4 hidden layers CNN with 3 hidden layers 16

Common story • Deep learning analysis requires a large amount of hyperparameters experiments: ℎ 𝑐𝑓𝑡𝑢 = argmin 𝑀𝑝𝑡𝑡(𝜇 𝑛 , 𝑢 𝑢𝑠𝑏𝑗𝑜 , 𝑢 𝑤𝑏𝑚 ) 𝑛 ∈ 𝑁 Select a proper metric ℎ 𝑐𝑓𝑡𝑢 = argmin 𝐻𝐹(𝜇 𝑛 , 𝑢 𝑢𝑠𝑏𝑗𝑜 , 𝑢 𝑤𝑏𝑚 ) 𝑛 ∈ 𝑁 • From multiple models, we elect a best one. Why not benefit from multiple models instead of a best single model? 17

Ensembles • Boosting • Stacking • Bootstrap Aggregating (Bagging) 𝑁−1 𝑂−1 Select best models based on GE: 𝑁 𝑐𝑓𝑡𝑢 < 𝑁 𝑄(𝒍) = ෍ ෍ log 𝑞 𝑗,𝑘,𝑛 argmin 𝐻𝐹(𝜇 𝑛 , 𝑢 𝑢𝑠𝑏𝑗𝑜 , 𝑢 𝑤𝑏𝑚 ) 𝑛=0 𝑗=0 𝑛 hyperparameters validation traces train traces 18

Ensembles 𝑂−1 𝑁−1 𝑂−1 𝑄(𝒍) = ෍ log 𝑞 𝑗,𝑘 𝑄(𝒍) = ෍ ෍ log 𝑞 𝑗,𝑘,𝑛 𝑗=0 𝑛=0 𝑗=0 Single Best Model = argmin 𝐻𝐹(𝜇 𝑛 , 𝑢 𝑢𝑠𝑏𝑗𝑜 , 𝑢 𝑤𝑏𝑚 ) Ensemble ( 𝑁 𝑐𝑓𝑡𝑢 = 10, 𝑁 = 50 ) 𝑛 19

Datasets Dataset Training Validation Test Features Countermeasures Pinata SW AES 6,000 (fixed key) 1,000 1,000 400 No DPAv4 34,000 (fixed key) 1,000 1,000 2,000 RSM ASCAD 200,000 (random keys) 500 500 1,400 Masking CHES CTF 2018 43,000 (fixed key) 1,000 1,000 2,000 Masking 20

Range of Hyperparameters MLP CNN Hyperparameter min max step Hyperparameter min max step Learning Rate 0.0001 0.001 0.0001 Learning Rate 0.0001 0.001 0.0001 Mini-batch 100 1000 100 Mini-batch 100 1000 100 Dense Layers 2 8 1 Convolution Layers (i) 1 2 1 Neurons 100 1000 100 Filters 8*i 32*i 4 Activation Function Tanh, ReLU, ELU or SELU Kernel Size 10 20 2 Stride 5 10 1 Dense Layers 2 8 1 Neurons 100 1000 100 Activation Function Tanh, ReLU, ELU or SELU *optimal ranges based on literature 21

Results on ASCAD (Hamming Weight) MLP CNN 22

Results on ASCAD (Identity) MLP CNN 23

Conclusions • Output class probabilities are a valid distinguisher for side-channel analysis. • Output class probabilities are sensitive to small changes in hyperparameteres: ensembles remove the effect of small variations, improving generalization results. • Ensembles do not replace hyperparameters search. Ensembles relax the fine tuning of hyperparameters: GE or SR of ensemble tends to be superior to GE or SR of a single best model. • Ensembles do not improve learnability: they improve what single models already learn. • Limited amount of models can be enough to build a strong ensemble. As future works: • Explore another ensemble methods (e.g., stacking) . • Verify how ensembles work in combination with other regularization methods and other metrics (SR, MI). • Formalize the density distribution of output class probabilities (a new metric). 24

Thank you! • Our code is available at: https://github.com/AISyLab/EnsembleSCA 25

Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, - PowerPoint PPT Presentation

Conference on Cryptographic Hardware and Embedded Systems (CHES) 2020 Strength in Numbers: Improving Generalization with Ensembles in Machine Learning-based Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, Stjepan Picek In

Non-Profiled Deep Learning-based Side-Channel attacks with Sensitivity Analysis Benjamin Timon

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Mind The Portability A Warriors Guide through Realistic Profiled Side-channel Analysis Shivam

Introduction to (profiled) side-channel analysis Annelie Heuser In this talk back to

Make Some Noise Unleashing the Power of Convolutional Neural Networks for Profiled Side-channel

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

The System Core Unique Factory-produced W-profiled Core panels Manufactured from Commercial

Pit Profiles: Re-profiled The Changing Face of an Industry Id like to start by giving a little

Outline Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance

Mixed models in R using the lme4 package Part 5: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance Douglas

Brief Counseling Techniques for Your Most Challenging PaFents Learning

A Review of Webbing Anchor Research THOMAS EVANS, SARAH TRUEBE SAR 3 HTTP://SARRR.WEEBLY.COM/

Strength of Weak Ties, Structural Holes, Closure and Small Worlds Steve Borgatti MGT 780, Spring

Heavy-Duty Capacity or Lock Slides 301 175 lbs. Over .75" None None 20 [80 kg]

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Self Driving Car Self Driving Cars Auto Breaking Fully Lane Guidance Autonomous Auto Parking

FY16 Data Review Completed CANS-F Assessments by Jurisdiction, FY16 Jurisdiction Number of

Outline for Today Wednesday, Nov. 28 Chapter 11: Intermolecular Forces and Liquids

Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, - PowerPoint PPT Presentation

Conference on Cryptographic Hardware and Embedded Systems (CHES) 2020 Strength in Numbers: Improving Generalization with Ensembles in Machine Learning-based Profiled Side-Channel Analysis Guilherme Perin , Lukasz Chmielewski, Stjepan Picek In

Non-Profiled Deep Learning-based Side-Channel attacks with Sensitivity Analysis Benjamin Timon

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Mind The Portability A Warriors Guide through Realistic Profiled Side-channel Analysis Shivam

Introduction to (profiled) side-channel analysis Annelie Heuser In this talk back to

Make Some Noise Unleashing the Power of Convolutional Neural Networks for Profiled Side-channel

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

The System Core Unique Factory-produced W-profiled Core panels Manufactured from Commercial

Pit Profiles: Re-profiled The Changing Face of an Industry Id like to start by giving a little

Outline Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance

Mixed models in R using the lme4 package Part 5: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 4: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance Douglas

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance Douglas

Brief Counseling Techniques for Your Most Challenging PaFents Learning

A Review of Webbing Anchor Research THOMAS EVANS, SARAH TRUEBE SAR 3 HTTP://SARRR.WEEBLY.COM/

Strength of Weak Ties, Structural Holes, Closure and Small Worlds Steve Borgatti MGT 780, Spring

Heavy-Duty Capacity or Lock Slides 301 175 lbs. Over .75&quot; None None 20 [80 kg]

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Self Driving Car Self Driving Cars Auto Breaking Fully Lane Guidance Autonomous Auto Parking

FY16 Data Review Completed CANS-F Assessments by Jurisdiction, FY16 Jurisdiction Number of

Outline for Today Wednesday, Nov. 28 Chapter 11: Intermolecular Forces and Liquids

Heavy-Duty Capacity or Lock Slides 301 175 lbs. Over .75" None None 20 [80 kg]