Truth Inference on Sparse Crowdsourcing Data with Local - PowerPoint PPT Presentation

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA ’18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ 3 Qatar Computing Research Institute Doha, Qatar 4 The University of Texas at San Antonio San Antonio, Texas December 12, 2018

Crowdsourcing Workers Data Curator Tasks • Data curator releases tasks on a crowdsourcing platform. 2 / 36

Crowdsourcing Workers Answers Data Curator Answers Answers • Data curator releases tasks on a crowdsourcing platform. • The workers provide their answers to these tasks in exchange for a reward. 3 / 36

Privacy Concern Collecting answers from individual workers may pose potential privacy risks. • Crowdsourcing-related applications collect sensitive personal information from workers. • By using a sequence of surveys, a data curator (DC) could potentially determine the identities of workers. 4 / 36

Differential Privacy Differential privacy (DP) provides rigorous privacy guarantee. Workers Noise Trusted The Public x 1 Data Curator m 1 � x i + ξ m x 2 i =1 x m However, classical DP requires a trusted data curator to publish privatized statistical information. 5 / 36

Local Differential Privacy Local differential privacy (LDP) is the state-of-the-art approach for privacy-preserving data collection. Workers Untrusted x 1 = x 1 + ξ 1 ˆ Data Curator x 2 = x 2 + ξ 2 ˆ f (ˆ x 1 , ˆ x 2 , . . . , ˆ x m ) ˆ x m = x m + ξ m Before sending the answer to the data curator, each worker perturbs his/her private data locally. 6 / 36

Challenges I - Data Sparsity • Most workers only provide answers to a very small portion of the tasks. • We use NULL to represent the answer if a worker does not provide response for a specific task. Dataset # of Workers # of Tasks Average Sparsity Web 1 34 177 0.705882 AdultContent 2 825 11,040 0.993666 • NULL values should also be protected. • Careless perturbation of NULL values may significantly alter the original answer distribution. 1 http://dbgroup.cs.tsinghua.edu.cn/ligl/crowddata/ 2 https: //github.com/ipeirotis/Get-Another-Label/tree/master/data 7 / 36

Challenges II - Data Utility • Truth inference estimates the true results from answers provided by workers of different quality. • Most truth inference algorithms iterate until convergence. • We aim to preserve the accuracy of truth inference on the perturbed worker answers, even a slight amount of initial noise in the worker answers may be propagated during iterations. 8 / 36

Our Contributions Extension to Existing Approaches • Laplace perturbation (LP) approach • Randomized response (RR) approach • Large expected error in the truth inference results Novel Approach We design a new matrix factorization (MF) perturbation algorithm to satisfy LDP, and guarantee small error. 9 / 36

Outline 1 Introduction 2 Related Work 3 Preliminaries 4 Perturbation Schemes • Laplace Perturbation (LP) • Randomized Response (RR) • Matrix Factorization (MF) 5 Experiments 6 Conclusion 10 / 36

Related Work Local differential privacy • Count, heavy hitters [HILM02, HIM02] • Graph synthesization [QYY + 17] • Linear regression [NXY + 16] Privacy-preserving crowdsourcing • Mutual information [KOV14] • Truth discovery on complete data [LMS + 18] Differentially private recommendation • Perturbation on categories [Can02, SJ14] • Iterative factorization [SKSX18] 11 / 36

Preliminaries - Local Differential Privacy (LDP) Definition ( ǫ -Local Differential Privacy) A randomized privatization mechanism M satisfies ǫ -local differential privacy ( ǫ -LDP) iff for any pair of answer vectors � a a ′ that differ at one cell, we have: and � z p ∈ Range ( M ) : Pr [ M ( � a ) = � z p ] z p ] ≤ e ǫ , ∀ � Pr [ M ( � a ′ ) = � where Range ( M ) denotes the set of all possible outputs of the algorithm M . 12 / 36

Preliminaries - Truth Inference • Associated each worker with a quality. • For each task, estimate the truth by taking the weighted average of the worker answers. • For each worker, estimate the quality by measuring the difference between his answers and the estimated truth. . . . q 1 q 2 q m Quality a 1 j a 2 j a m j t j � Wi ∈ Wj q i × a i,j Estimated truth ˆ µ j = � Wi ∈ Wj q i 1 1 Estimated quali ty q i ∝ σ i = � 1 � tj ∈ T i ( a i,j − ˆ µ j ) 2 |T i | 13 / 36

Preliminaries - Truth Inference Iteratively updating the estimated truth and worker quality until convergence [LLG + 14]. Algorithm 1 Truth inference Require: The workers’ answers { a i,j } Ensure: The estimated true answer (i.e. , the truth) of tasks { ˆ µ j } and the quality of workers { q i } 1: Initialize worker quality q i = 1 /m for each worker W i ∈ W ; 2: while the convergence condition is not met do Estimate { ˆ µ j } ; 3: Estimate { q i } ; 4: 5: end while 6: return { ˆ µ j } and { q i } ; 14 / 36

Preliminaries - Matrix Factorization Given M ∈ R m × n , find U ∈ R m × d and V ∈ R n × d s.t. v j ) 2 is minimized. L ( M , U , V ) = � ( i , j ) ∈ Ω ( M i , j − � u T i � ≈ M i , j , can be approximated by the inner product of � u i and � v j , i.e., � u T v j . i � 15 / 36

Problem Statement Input A set of answers { W i } and their answer vectors A = { � a i } , and a privacy parameter ǫ Output The perturbed answer vectors A P = {M ( � a i ) |∀ � a i ∈ A } Requirement • Privacy: A P satisfies ǫ -LDP. • Utility: Accurate truth inference results from A P , i.e., minimize � T j ∈T | µ j − ˆ µ j | MAE ( A P ) = . n 16 / 36

Laplace Perturbation (LP) Step 1 Replace NULL values with some value in the answer domain Γ . � v a i , j = NULL g ( a i , j ) = a i , j � = NULL , a i , j Step 2 Add Laplace noise to each answer. g ( a i , 1 )+ Lap ( | Γ | ǫ ) , g ( a i , 2 )+ Lap ( | Γ | ǫ ) , ..., g ( a i , n )+ Lap ( | Γ | � � L ( � a i ) = ǫ ) 17 / 36

Laplace Perturbation (LP) Theorem 1 (Expected MAE of LP) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying LP on A . Then the expected of the estimated truth on A P must satisfy � MAE ( A P ) � error E that n m ≤ 1 � � � MAE ( A P ) � ( q i × e LP i , j ) , E n j = 1 i = 1 � � � � � φ j + | Γ | π + | Γ | 2 where e LP i , j = ( 1 − s i ) + s i σ i , µ j is the ǫ ǫ ground truth of task T j , σ i is the standard error deviation of worker W i , s i is the fraction of the tasks that W i returns non-NULL values, and φ j is the deviation between µ j and the expected value E ( v ) of v . 18 / 36

Laplace Perturbation (LP) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 1, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • | Γ | = 10. • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 14 . 13 E 19 / 36

Randomized Response (RR) • Add NULL to the answer domain Γ . • For each answer a i , j , apply randomized response. � e ǫ if y = a i , j | Γ | + e ǫ ∀ y ∈ Γ , Pr [ M ( a i , j ) = y ] = 1 if y � = a i , j | Γ | + e ǫ Each original answer either e ǫ • remains unchanged in with probability | Γ | + e ǫ , or 1 • is replaced with a different value with probability | Γ | + e ǫ . 20 / 36

Randomized Response (RR) Theorem 2 (Expected MAE of RR) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying RR on A . Then the expected of the estimated truth on A P must satisfy � � MAE ( A P ) error E that n � W i ∈ W j q i × e RR ≤ 1 i , j � � MAE ( A P ) � E , � n W i ∈ W j q i j = 1 where � � � � � � � � 1 � � � e RR � � � � i , j =( 1 − s i ) µ j − + s i N ( x ; µ j , σ i ) µ j − , y yP xy � e ǫ + | Γ | � � � � � � � y ∈ Γ x ∈ Γ y ∈ Γ � � � � s i is the fraction of tasks that worker W i returns non-NULL values, and P xy is the probability that value x is replaced with y . 21 / 36

Randomized Response (RR) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 0, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • Γ = [ 0 , 9 ] . • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 3 . 551 E 22 / 36

Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n × d , and sends both V and the tasks T to the workers. Workers V , T Data Curator V , T V , T 23 / 36

Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n d , and sends both V and the tasks T to the workers. • Every worker gets the answers � a i , and returns the differentially private answer profile vector � u i . Workers � u 1 Data Curator � u 2 � u m { � a i = � u i V } 24 / 36

Matrix Factorization (MF) Instead of directly adding noise to � u i , we design a novel approach based on objective perturbation to reduce the distortion. � u i = arg min L DP ( � a i , � u i , V ) . u i � v j ) 2 + 2 � � u T u T L DP ( � u i , V ) = ( a i , j − � a i , � i � i � η i , T j ∈T i η i = { Lap ( | Γ | ǫ ) , . . . , Lap ( | Γ | where � ǫ ) } is a d -dimensional vector. 25 / 36

Truth Inference on Sparse Crowdsourcing Data with Local - PowerPoint PPT Presentation

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA 18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University

Is Crowdsourcing feasible for optical flow Ground Truth generation? Axel Donath, Daniel

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A/B Testing Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

How Crowdsourcing Enabled Computer Vision Crowdsourcing and Human Computation Instructor: Chris

Rise of Crowdsourcing Crowdsourcing = Harvesting societys wisdom, skill, creativity, and scale

Crowdsourcing and HCI 2: Privacy and Latency Crowdsourcing and Human Computation Instructor:

SYMBOLIC LOGIC UNIT 3: COMPUTING TRUTH VALUES Truth Values The truth value of a

Truth, T Truth-values, and the l like Fabien Schang National Research University Higher

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Crowdsourcing of Weather Data on Mobile App and Deep Learning Lior Perez 99th AMS annual

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris

Compliance Crowdsourcing: Managing customer audits at scale Craig Erickson, CISSP, CISA Data

Truth Revisited: What is Truth? Truth is Important. Pilate therefore said to Jesus: Art

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

before and after IL2 treatment Lu Wang and Ying Sha 9/18/2014 1 Update since the 9/15/2014

Patient Empowerment by Increasing Information Accessibility In a Telecare System presenter :

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range

Agenda SB 1383 Subgroup 2 12:3012:40 Welcome and introductions 12:4012:45 Status update for

San Antonio Water System: Dos Rios WRC Electrical System Improvements - Phase 1 Scope Summary and

Committee March 16, 2016 Agenda OPEN SPACE & ECOLOGY COMMITTEE AGENDA Wednesday, March 16,

Yahara WINs Strategic Planning Workgroup madsewer.org October 9, 2012 Agenda Opening

Truth Inference on Sparse Crowdsourcing Data with Local - PowerPoint PPT Presentation

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA 18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University

Is Crowdsourcing feasible for optical flow Ground Truth generation? Axel Donath, Daniel

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A/B Testing Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

How Crowdsourcing Enabled Computer Vision Crowdsourcing and Human Computation Instructor: Chris

Rise of Crowdsourcing Crowdsourcing = Harvesting societys wisdom, skill, creativity, and scale

Crowdsourcing and HCI 2: Privacy and Latency Crowdsourcing and Human Computation Instructor:

SYMBOLIC LOGIC UNIT 3: COMPUTING TRUTH VALUES Truth Values The truth value of a

Truth, T Truth-values, and the l like Fabien Schang National Research University Higher

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Crowdsourcing of Weather Data on Mobile App and Deep Learning Lior Perez 99th AMS annual

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris

Compliance Crowdsourcing: Managing customer audits at scale Craig Erickson, CISSP, CISA Data

Truth Revisited: What is Truth? Truth is Important. Pilate therefore said to Jesus: Art

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

before and after IL2 treatment Lu Wang and Ying Sha 9/18/2014 1 Update since the 9/15/2014

Patient Empowerment by Increasing Information Accessibility In a Telecare System presenter :

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range

Agenda SB 1383 Subgroup 2 12:3012:40 Welcome and introductions 12:4012:45 Status update for

San Antonio Water System: Dos Rios WRC Electrical System Improvements - Phase 1 Scope Summary and

Committee March 16, 2016 Agenda OPEN SPACE &amp; ECOLOGY COMMITTEE AGENDA Wednesday, March 16,

Yahara WINs Strategic Planning Workgroup madsewer.org October 9, 2012 Agenda Opening

Committee March 16, 2016 Agenda OPEN SPACE & ECOLOGY COMMITTEE AGENDA Wednesday, March 16,