Privacy guarantees in statistical estimation: How to formalize the problem? Martin Wainwright UC Berkeley Departments of Statistics, and EECS van Dantzig Seminar, University of Leiden Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 1 / 22
The modern landscape Modern data sets are often very large biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)
The modern landscape Modern data sets are often very large biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.) Statistical considerations interact with: 1 Computational constraints: (low-order) polynomial-time is essential!
The modern landscape Modern data sets are often very large biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.) Statistical considerations interact with: 1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are often needed
The modern landscape Modern data sets are often very large biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.) Statistical considerations interact with: 1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are often needed 3 Privacy constraints: tension between hiding/sharing data
From Classical Minimax Risk... Choose estimator to minimize the worst-case risk � � �� � Classical minimax risk = inf sup E L θ n , θ . � θ n θ ∈ Ω Abraham Wald 1902–1950
From Classical Minimax Risk... Choose estimator to minimize the worst-case risk � � �� � Classical minimax risk = inf sup E L θ n , θ . � θ n θ ∈ Ω Two party game: Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: ( X 1 , . . . , X n ) �→ � θ n ∈ Ω � �� � Abraham Wald arbitrary measurable function 1902–1950
From Classical Minimax Risk... Choose estimator to minimize the worst-case risk � � �� � Classical minimax risk = inf sup E L θ n , θ . � θ n θ ∈ Ω Two party game: Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: ( X 1 , . . . , X n ) �→ � θ n ∈ Ω � �� � Abraham Wald arbitrary measurable function 1902–1950 Classical questions about minimax risk: how fast does it decay as a function of sample size n ? dependence on dimensionality, smoothness etc.? characterization of optimal estimators?
....to Constrained Minimax Risk Classical framework imposes no constraints on the choice of estimators � θ n .
....to Constrained Minimax Risk Classical framework imposes no constraints on the choice of estimators � θ n . Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties.
....to Constrained Minimax Risk Classical framework imposes no constraints on the choice of estimators � θ n . Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties. On-going research: statistical minimax with constraints Computationally-constrained estimators (e.g., Rigollet & Berthet, 2013; Ma & Wu, 2014; Zhang, W. & Jordan, 2014) Communication constraints (e.g., Zhang et al., 2013; Ma et al. 2014; Braverman et al., 2015) Privacy constraints (e.g., Dwork, 2006; Hardt & Rothblum, 2010; Hall et al., 2011; Duchi, W. & Jordan, 2013)
Why be concerned with privacy? Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project
Why be concerned with privacy? Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013
Why be concerned with privacy? Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013 Question How to obtain principled tradeoffs between these competing criteria?
Basic model of local privacy X 1 Q ( Z n 1 | X n 1 ) X 2 � Z n θ X 3 1 X n each individual i ∈ { 1 , 2 , . . . , n } has personal data X i ∼ P θ ∗ conditional distribution Q between private data X n 1 and public data Z n 1 1 �→ � estimator Z n θ of unknown parameter θ ∗ .
Local privacy at level α Log likelihood log Q ( · | x ) log Q ( · | ¯ x ) z Definition Conditional distribution Q is locally α -differentially private if Q ( z | x n 1 ) e − α ≤ sup 1 ) ≤ e α for all x n x n 1 such that d HAM ( x n x n 1 and ¯ 1 , ¯ 1 ) = 1. x n Q ( z | ¯ z (Dwork et al., 2006)
Illustration of Laplacian mechanism x x Add α -Laplacian noise (Dwork et al., 2006) where W has density ∝ e − α | w | Z = x + W,
Illustration of Laplacian mechanism x x Add α -Laplacian noise (Dwork et al., 2006) where W has density ∝ e − α | w | Z = x + W, For all x, x ′ ∈ [ − 1 / 2 , 1 / 2]: � � � � � log Q ( z | x ) � � � � sup = α � sup | z − x | − | z − x | ≤ α. � � Q ( z | x ) z ∈ R z ∈ R
Various mechanisms for α -privacy Choices from past work: randomized response in survey questions (Warner, 1965) Laplacian noise (Dwork et al., 2006) exponential mechanism (McSherry & Talwar, 2007)
Various mechanisms for α -privacy Choices from past work: randomized response in survey questions (Warner, 1965) Laplacian noise (Dwork et al., 2006) exponential mechanism (McSherry & Talwar, 2007) Some past work on privacy and estimation: local differential privacy and PAC learning (Kasiviswanathan et al., 2008) linear queries over discrete-valued data sets (Hardt & Rothblum, 2010) global differential privacy and histogram estimators (Hall et al., 2011) lower bounds for certain 1-D statistics (Chaudhuri & Hsu, 2012)
Various mechanisms for α -privacy Choices from past work: randomized response in survey questions (Warner, 1965) Laplacian noise (Dwork et al., 2006) exponential mechanism (McSherry & Talwar, 2007) Some past work on privacy and estimation: local differential privacy and PAC learning (Kasiviswanathan et al., 2008) linear queries over discrete-valued data sets (Hardt & Rothblum, 2010) global differential privacy and histogram estimators (Hall et al., 2011) lower bounds for certain 1-D statistics (Chaudhuri & Hsu, 2012) Questions: Can we provide a general characterization of trade-offs between α -privacy and statistical utility? Can we identify optimal “mechanisms” for privacy?
Minimax optimality with α -privacy � family of distributions P ∈ F} , and functional P �→ θ ( P ) 1 �→ � samples X n 1 ≡ { X 1 , . . . , X n } ∼ P and estimator X n θ ( X n 1 ) loss function (e.g., squared error, 0-1 error, ℓ 1 -error) ( � L ( � θ, θ ) �→ θ, θ ) � �� � quality of � θ as estimate of θ
Minimax optimality with α -privacy � family of distributions P ∈ F} , and functional P �→ θ ( P ) 1 �→ � samples X n 1 ≡ { X 1 , . . . , X n } ∼ P and estimator X n θ ( X n 1 ) loss function (e.g., squared error, 0-1 error, ℓ 1 -error) ( � L ( � θ, θ ) �→ θ, θ ) � �� � quality of � θ as estimate of θ Ordinary minimax risk: � �� �� θ ( X n M n ( F ) := inf sup E L 1 ) , θ ( P ) � P ∈F θ ���� ���� Best estimator Worst-case distribution
Minimax optimality with α -privacy � family of distributions P ∈ F} , and functional P �→ θ ( P ) 1 �→ � samples X n 1 ≡ { X 1 , . . . , X n } ∼ P and estimator X n θ ( X n 1 ) loss function (e.g., squared error, 0-1 error, ℓ 1 -error) ( � L ( � θ, θ ) �→ θ, θ ) � �� � quality of � θ as estimate of θ Ordinary minimax risk: � �� �� θ ( X n M n ( F ) := inf sup E L 1 ) , θ ( P ) � P ∈F θ ���� ���� Best estimator Worst-case distribution Minimax risk with α -privacy Estimators now depend on privatized samples Z n 1 � �� �� θ ( Z n M n ( α ; F ) := inf inf sup L 1 ) , θ ( P ) E Q ∈Q α � P ∈F θ � �� � Best α -private channel
Vignette A: α -private location estimation Consider estimation of mean functional θ ( P ) = E [ X ] over family � � distributions P such that E [ X ] ∈ [ − 1 , 1] and E [ | X | k | ] ≤ 1 F k :=
Vignette A: α -private location estimation Consider estimation of mean functional θ ( P ) = E [ X ] over family � � distributions P such that E [ X ] ∈ [ − 1 , 1] and E [ | X | k | ] ≤ 1 F k := � n For k ≥ 2 and non-private setting, sample mean � θ = 1 i =1 X i achieves rate n 1 /n .
Recommend
More recommend