Learning to Reconstruct: Statistical Learning Theory and Encrypted - PowerPoint PPT Presentation

Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs , Marie-Sarah Lacharité, Brice Minaud, Kenny Paterson pag225@cornell.edu, @pag_crypto

Outsourced Databases Database

Encrypting Outsourced Databases? Encryption prevents querying! Database

Encrypted Databases Efficient ones leak access patterns: set of matching records for query Encrypted Database What can an attacker learn from access pattern leakage?

Database Reconstruction (DR) With enough queries, can learn data from access patterns! [KKNO], [LMP], [KPT] Encrypted Database Prior work : huge numbers of queries, [KKNO]: 10 26 for salaries strong assumptions, [LMP]: dense database specific query types. [KPT]: kNN queries only

Our Contributions • Enabling insight: access pattern leakage is a binary classification Use statistical learning theory (SLT) to build and analyze attacks • New DR attacks on range queries Generalize and improve [KKNO], [LMP] with SLT + PQ trees On real data: with only 50 queries, predict salaries to 2% error • Generic reduction from DR with known queries to PAC learning • Give “minimal” attack for all query types via ε-nets Instantiate with first DR attack for prefix queries • First general lower bound on #queries needed for DR Full version: https://eprint.iacr.org/2019/011

Notation and Terminology N: number of possible values, wlog [1, … , N] E.g., N=125 for age data Range query: is a pair [a, b] where 1 ≤ a ≤ b ≤ N. Database: is composed of records , each with values in [1, … , N] [7, 10] 1 2 3 4 2 4 12 8 14 9 8 9 Access Pattern: which records match Full database reconstruction (DR): recovering exact record values Approximate DR: recovering all record values within εN. ε = 0.05 is recovery within 5%. ε = 1/N is full DR. Scale-free : query complexity independent of #records or N.

DR For Range Queries: Our Work [KKNO16] : full DR in O( N 4 log N ) queries Generalizes Three attacks : Lower Bound Full DR ‣ GeneralizedKKNO: O(ε -4 log ε -1 ) for approx. DR Ω(ε -4 ) O( N 4 log N ) ‣ ApproxValue: O(ε -2 log ε -1 ) approx. DR * Ω(ε -2 ) O( N 2 log N ) ‣ ApproxOrder: O(ε -1 log ε -1 ) for approx. order rec * Ω(ε -1 log ε -1 ) O( N log N ) Implies With DB distribution info, get approx. DR [LMP18]: Full DR for dense DB in O( N log N ). Bypass [LMP] lower bound via * Requires a mild hypothesis about data relaxing to “sacrificial” recon.

GeneralizedKKNO f 1 N Less probable More probable Assume uniform distribution on range queries + static database. Induces a distribution f on the probability that a value is accessed.

GeneralizedKKNO f Two values! Estimate 1 N Idea : for each record … How many queries to 1. Count #accesses to estimate f (value) get estimate sufficient 2. Find value by “inverting” f estimate for ε approx. DR? More work needed to break symmetry. See paper for details

Estimating a Probability Set X with probability distribution D . Let C ⊆ X be a set. Sample complexity : to measure Pr( C) within ε, you need O(1/ε 2 ) samples. C X

Estimating a Set of Probabilities Now: set of sets 𝓓 . Goal: estimate all sets’ probabilities simultaneously . The set of samples drawn from X is an ε -sample i ff for all C in 𝓓 : X

The ε-sample Theorem How many points do we need to draw to get an ε-sample w.h.p.? V & C 1971: If 𝓓 has VC dimension d , then the number of points to get an ε-sample whp is Does not depend on | 𝓓 |! X

GeneralizedKKNO f Estimate 1 N Idea : for each record … 1. Count #accesses to estimate f (value) 2. Find value by “inverting” f estimate Can we get rid of squaring? This is an ε-sample! X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1, N ]} VC dim. = 2 We need O(ε -4 log ε -1 ) queries (inverting f adds a square)

GeneralizedKKNO f Estimate 1 N Idea : for each record … 1. Count #accesses to estimate f (value) Assume there exists 2. Find value by “inverting” f estimate at least one record in This is an ε-sample! [N/8, 3N/8]. X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1, N ]} VC dim. = 2 We need O(ε -4 log ε -1 ) queries (inverting f adds a square)

ApproxValue f Estimate 1 N Idea : for each record … 1. Count #accesses to estimate f (value) Assume there exists 2. Find value by “inverting” f estimate at least one record in This is an ε-sample! [N/8, 3N/8]. X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1, N ]} VC dim. = 2 We need O(ε -2 log ε -1 ) queries! More complex attack – see paper

DR For Range Queries: Our Work Three attacks : Lower Bound Full DR ‣ GeneralizedKKNO: O(ε -4 log ε -1 ) for approx. DR Ω(ε -4 ) O( N 4 log N ) ‣ ApproxValue: O(ε -2 log ε -1 ) approx. DR * Ω(ε -2 ) O( N 2 log N ) ‣ ApproxOrder: O(ε -1 log ε -1 ) for approx. order rec * Ω(ε -1 log ε -1 ) O( N log N ) With DB distribution info, get approx. DR Require iid uniform queries, adversary knows query distribution. What can we do without making these assumptions?

DR For Range Queries: Our Work Three attacks : Lower Bound Full DR ‣ GeneralizedKKNO: O(ε -4 log ε -1 ) for approx. DR Ω(ε -4 ) O( N 4 log N ) ‣ ApproxValue: O(ε -2 log ε -1 ) approx. DR * Ω(ε -2 ) O( N 2 log N ) ‣ ApproxOrder: O(ε -1 log ε -1 ) for approx. order rec * Ω(ε -1 log ε -1 ) O( N log N ) With DB distribution info, get approx. DR Reveal order without no assumptions on query distribution. See paper for details

Conclusion • Enabling insight: access pattern leakage is a binary classification Use statistical learning theory (SLT) to build and analyze attacks • New DR attacks on range queries Generalize and improve [KKNO], [LMP] with SLT + PQ trees On real data: with only 50 queries, predict salaries to 2% error • Generic reduction from DR with known queries to PAC learning • Give “minimal” attack for all query types via ε-nets Instantiate with first DR attack for prefix queries • First general lower bound on #queries needed for DR Full version: https://eprint.iacr.org/2019/011 Thanks for listening! Any questions?

Attack Simulation ApproxValue experimental results R = 1000, compared to theoretical ✏ -sample bound 0 . 200 ✏ − 2 log ✏ − 1 ✏ − 2 log ✏ − 1 0 . 175 0 . 150 Symmetric value/error (as a fraction of N ) 0 . 125 0 . 100 0 . 075 0 . 050 0 . 025 0 . 000 0 100 200 300 400 500 Number of queries Max. sacrificed symmetric value Max. symmetric error N = 100 N = 10000 N = 100 N = 10000 N = 1000 N = 100000 N = 1000 N = 100000 Effective constants are ~ 1!

DR As Learning a Binary Classifier This formulation is not specific to range queries! Record values are binary classifiers X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1, N ]} Approximately learning classifier => approximate DR

Learning to Reconstruct: Statistical Learning Theory and Encrypted - PowerPoint PPT Presentation

Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs , Marie-Sarah Lacharit, Brice Minaud, Kenny Paterson pag225@cornell.edu, @pag_crypto Outsourced Databases Database Encrypting Outsourced

To Reconstruct or Not to Reconstruct: That is the Question Nicolas GUILLIOT

Learning to Reconstruct Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs,

Learning to Reconstruct Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

FUNDAMENTALS OF OBSTETRICS Christine Pecci, MD Department of Family and Community Medicine

Pregnant and Bleeding in the I have no disclosures. First Trimester Jody Steinauer, MD, MAS D e

Apollo 8: Lunar Orbit INST 154 Apollo at 50 Onboard Audio Apollo Mission Sequence As Planned

The KNOB is Broken: Exploiting Low Entropy in the Encryption Key Negotiation Of Bluetooth BR/EDR

MANAGING THE COSTS AND RISKS OF NEW GENERATION COORDINATION OF GENERATION AND TRANSMISSION

Data Analytics for Cyber Physical Security Analysis A. Srivastava, A. Hahn, V. V. G. Krishnan,

Supporting LBS Practitioners in Linking Learners to Employment LMP Webinar Series December 17,

Public Process Western Area Power Administration Sierra Nevada Region Monday, June 8, 2020 9:00

Learning to Reconstruct: Statistical Learning Theory and Encrypted - PowerPoint PPT Presentation

Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs , Marie-Sarah Lacharit, Brice Minaud, Kenny Paterson pag225@cornell.edu, @pag_crypto Outsourced Databases Database Encrypting Outsourced

To Reconstruct or Not to Reconstruct: That is the Question Nicolas GUILLIOT

Learning to Reconstruct Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs,

Learning to Reconstruct Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

FUNDAMENTALS OF OBSTETRICS Christine Pecci, MD Department of Family and Community Medicine

Pregnant and Bleeding in the I have no disclosures. First Trimester Jody Steinauer, MD, MAS D e

Apollo 8: Lunar Orbit INST 154 Apollo at 50 Onboard Audio Apollo Mission Sequence As Planned

The KNOB is Broken: Exploiting Low Entropy in the Encryption Key Negotiation Of Bluetooth BR/EDR

MANAGING THE COSTS AND RISKS OF NEW GENERATION COORDINATION OF GENERATION AND TRANSMISSION

Data Analytics for Cyber Physical Security Analysis A. Srivastava, A. Hahn, V. V. G. Krishnan,

Supporting LBS Practitioners in Linking Learners to Employment LMP Webinar Series December 17,

Public Process Western Area Power Administration Sierra Nevada Region Monday, June 8, 2020 9:00

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models