Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA ’18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ 3 Qatar Computing Research Institute Doha, Qatar 4 The University of Texas at San Antonio San Antonio, Texas December 12, 2018
Crowdsourcing Workers Data Curator Tasks • Data curator releases tasks on a crowdsourcing platform. 2 / 36
Crowdsourcing Workers Answers Data Curator Answers Answers • Data curator releases tasks on a crowdsourcing platform. • The workers provide their answers to these tasks in exchange for a reward. 3 / 36
Privacy Concern Collecting answers from individual workers may pose potential privacy risks. • Crowdsourcing-related applications collect sensitive personal information from workers. • By using a sequence of surveys, a data curator (DC) could potentially determine the identities of workers. 4 / 36
Differential Privacy Differential privacy (DP) provides rigorous privacy guarantee. Workers Noise Trusted The Public x 1 Data Curator m 1 � x i + ξ m x 2 i =1 x m However, classical DP requires a trusted data curator to publish privatized statistical information. 5 / 36
Local Differential Privacy Local differential privacy (LDP) is the state-of-the-art approach for privacy-preserving data collection. Workers Untrusted x 1 = x 1 + ξ 1 ˆ Data Curator x 2 = x 2 + ξ 2 ˆ f (ˆ x 1 , ˆ x 2 , . . . , ˆ x m ) ˆ x m = x m + ξ m Before sending the answer to the data curator, each worker perturbs his/her private data locally. 6 / 36
Challenges I - Data Sparsity • Most workers only provide answers to a very small portion of the tasks. • We use NULL to represent the answer if a worker does not provide response for a specific task. Dataset # of Workers # of Tasks Average Sparsity Web 1 34 177 0.705882 AdultContent 2 825 11,040 0.993666 • NULL values should also be protected. • Careless perturbation of NULL values may significantly alter the original answer distribution. 1 http://dbgroup.cs.tsinghua.edu.cn/ligl/crowddata/ 2 https: //github.com/ipeirotis/Get-Another-Label/tree/master/data 7 / 36
Challenges II - Data Utility • Truth inference estimates the true results from answers provided by workers of different quality. • Most truth inference algorithms iterate until convergence. • We aim to preserve the accuracy of truth inference on the perturbed worker answers, even a slight amount of initial noise in the worker answers may be propagated during iterations. 8 / 36
Our Contributions Extension to Existing Approaches • Laplace perturbation (LP) approach • Randomized response (RR) approach • Large expected error in the truth inference results Novel Approach We design a new matrix factorization (MF) perturbation algorithm to satisfy LDP, and guarantee small error. 9 / 36
Outline 1 Introduction 2 Related Work 3 Preliminaries 4 Perturbation Schemes • Laplace Perturbation (LP) • Randomized Response (RR) • Matrix Factorization (MF) 5 Experiments 6 Conclusion 10 / 36
Related Work Local differential privacy • Count, heavy hitters [HILM02, HIM02] • Graph synthesization [QYY + 17] • Linear regression [NXY + 16] Privacy-preserving crowdsourcing • Mutual information [KOV14] • Truth discovery on complete data [LMS + 18] Differentially private recommendation • Perturbation on categories [Can02, SJ14] • Iterative factorization [SKSX18] 11 / 36
Preliminaries - Local Differential Privacy (LDP) Definition ( ǫ -Local Differential Privacy) A randomized privatization mechanism M satisfies ǫ -local differential privacy ( ǫ -LDP) iff for any pair of answer vectors � a a ′ that differ at one cell, we have: and � z p ∈ Range ( M ) : Pr [ M ( � a ) = � z p ] z p ] ≤ e ǫ , ∀ � Pr [ M ( � a ′ ) = � where Range ( M ) denotes the set of all possible outputs of the algorithm M . 12 / 36
Preliminaries - Truth Inference • Associated each worker with a quality. • For each task, estimate the truth by taking the weighted average of the worker answers. • For each worker, estimate the quality by measuring the difference between his answers and the estimated truth. . . . q 1 q 2 q m Quality a 1 j a 2 j a m j t j � Wi ∈ Wj q i × a i,j Estimated truth ˆ µ j = � Wi ∈ Wj q i 1 1 Estimated quali ty q i ∝ σ i = � 1 � tj ∈ T i ( a i,j − ˆ µ j ) 2 |T i | 13 / 36
Preliminaries - Truth Inference Iteratively updating the estimated truth and worker quality until convergence [LLG + 14]. Algorithm 1 Truth inference Require: The workers’ answers { a i,j } Ensure: The estimated true answer (i.e. , the truth) of tasks { ˆ µ j } and the quality of workers { q i } 1: Initialize worker quality q i = 1 /m for each worker W i ∈ W ; 2: while the convergence condition is not met do Estimate { ˆ µ j } ; 3: Estimate { q i } ; 4: 5: end while 6: return { ˆ µ j } and { q i } ; 14 / 36
Preliminaries - Matrix Factorization Given M ∈ R m × n , find U ∈ R m × d and V ∈ R n × d s.t. v j ) 2 is minimized. L ( M , U , V ) = � ( i , j ) ∈ Ω ( M i , j − � u T i � ≈ M i , j , can be approximated by the inner product of � u i and � v j , i.e., � u T v j . i � 15 / 36
Problem Statement Input A set of answers { W i } and their answer vectors A = { � a i } , and a privacy parameter ǫ Output The perturbed answer vectors A P = {M ( � a i ) |∀ � a i ∈ A } Requirement • Privacy: A P satisfies ǫ -LDP. • Utility: Accurate truth inference results from A P , i.e., minimize � T j ∈T | µ j − ˆ µ j | MAE ( A P ) = . n 16 / 36
Laplace Perturbation (LP) Step 1 Replace NULL values with some value in the answer domain Γ . � v a i , j = NULL g ( a i , j ) = a i , j � = NULL , a i , j Step 2 Add Laplace noise to each answer. g ( a i , 1 )+ Lap ( | Γ | ǫ ) , g ( a i , 2 )+ Lap ( | Γ | ǫ ) , ..., g ( a i , n )+ Lap ( | Γ | � � L ( � a i ) = ǫ ) 17 / 36
Laplace Perturbation (LP) Theorem 1 (Expected MAE of LP) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying LP on A . Then the expected of the estimated truth on A P must satisfy � MAE ( A P ) � error E that n m ≤ 1 � � � MAE ( A P ) � ( q i × e LP i , j ) , E n j = 1 i = 1 � � � � � φ j + | Γ | π + | Γ | 2 where e LP i , j = ( 1 − s i ) + s i σ i , µ j is the ǫ ǫ ground truth of task T j , σ i is the standard error deviation of worker W i , s i is the fraction of the tasks that W i returns non-NULL values, and φ j is the deviation between µ j and the expected value E ( v ) of v . 18 / 36
Laplace Perturbation (LP) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 1, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • | Γ | = 10. • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 14 . 13 E 19 / 36
Randomized Response (RR) • Add NULL to the answer domain Γ . • For each answer a i , j , apply randomized response. � e ǫ if y = a i , j | Γ | + e ǫ ∀ y ∈ Γ , Pr [ M ( a i , j ) = y ] = 1 if y � = a i , j | Γ | + e ǫ Each original answer either e ǫ • remains unchanged in with probability | Γ | + e ǫ , or 1 • is replaced with a different value with probability | Γ | + e ǫ . 20 / 36
Randomized Response (RR) Theorem 2 (Expected MAE of RR) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying RR on A . Then the expected of the estimated truth on A P must satisfy � � MAE ( A P ) error E that n � W i ∈ W j q i × e RR ≤ 1 i , j � � MAE ( A P ) � E , � n W i ∈ W j q i j = 1 where � � � � � � � � 1 � � � e RR � � � � i , j =( 1 − s i ) µ j − + s i N ( x ; µ j , σ i ) µ j − , y yP xy � e ǫ + | Γ | � � � � � � � y ∈ Γ x ∈ Γ y ∈ Γ � � � � s i is the fraction of tasks that worker W i returns non-NULL values, and P xy is the probability that value x is replaced with y . 21 / 36
Randomized Response (RR) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 0, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • Γ = [ 0 , 9 ] . • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 3 . 551 E 22 / 36
Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n × d , and sends both V and the tasks T to the workers. Workers V , T Data Curator V , T V , T 23 / 36
Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n d , and sends both V and the tasks T to the workers. • Every worker gets the answers � a i , and returns the differentially private answer profile vector � u i . Workers � u 1 Data Curator � u 2 � u m { � a i = � u i V } 24 / 36
Matrix Factorization (MF) Instead of directly adding noise to � u i , we design a novel approach based on objective perturbation to reduce the distortion. � u i = arg min L DP ( � a i , � u i , V ) . u i � v j ) 2 + 2 � � u T u T L DP ( � u i , V ) = ( a i , j − � a i , � i � i � η i , T j ∈T i η i = { Lap ( | Γ | ǫ ) , . . . , Lap ( | Γ | where � ǫ ) } is a d -dimensional vector. 25 / 36
Recommend
More recommend