expectation maximization tensor factorization for
play

Expectation-Maximization Tensor Factorization for Practical Location - PowerPoint PPT Presentation

Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks Takao Murakami (AIST*, Japan) *AIST: National Institute of Advanced Industrial Science & Technology 1 Outline [Shokri+,S&P11] [Gambs+,JCSS14]


  1. Expectation-Maximization Tensor Factorization for Practical Location Privacy Attacks Takao Murakami (AIST*, Japan) *AIST: National Institute of Advanced Industrial Science & Technology 1

  2. Outline [Shokri+,S&P11] [Gambs+,JCSS14]  Markov Chain Model-based Attacks [Mulder+,WPES08] [Xue+,ICDE13] etc.  Attacker can de-anonymize traces (or infer locations) with high accuracy when the amount of training data is very large. Mobility Trace Mobility Trace Pseudonym De-anonymize x 2 x 3 x 1 x 2 x 3 x 1 63427 Transition Matrices e.g. 30 min, 1 hour region x i x j x j x j x i x i x i  In reality, training data can be sparsely distributed over time…  Many users disclose a small number of locations not continuously but “sporadically” via SNS (e.g. one or two check-ins per day/week/month). Training Trace ? ? ? ? ? 0 1 0 0 0 Train x 2 x 2 x 1 x 3 ? ? ? ? ? ? ? ? ? ? 2 missing location ? ? ? ? ?

  3. Outline  Worst case scenario for attackers (= reality?)…  No elements are observed in P 2 & P 3 .  Cannot de-anonymize u 2 & u 3 . User Training Traces Transition Matrices u 1 P 2 x 3 P 1 P 3 P 4 x 2 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 u 2 x 4 x 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ML ? ? x 2 0 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? x 3 ? ? ? ? ? 0 0 0 0 1 ? ? ? ? ? ? ? ? ? ? u 3 x 4 x 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? x 5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? u 4 x 3 x 5 (ML: Maximum Likelihood Estimation) Q. Is it possible to de-anonymize traces using such training data?  Our Contributions  We show the answer is “yes” .  We propose a training method that outperforms a random guess even when no elements are observed in more than 70% of cases. ? 3

  4. Contents Introduction (Location Privacy, Related Work) Our Proposal (EMTF: Expectation-Maximization Tensor Factorization) Experiments 4

  5. Location Privacy  Location-based Services (LBS)  Many people are using LBS (e.g. map, route finding, check-in).  “Spatial Big Data” can be provided to a third-party for analysis (e.g. popular places), or made public to provide traffic information. mobility trace LBS provider Spatial Big Data  Privacy Issues  Mobility trace can contain sensitive locations (e.g. homes, hospitals).  Anonymized trace may be de-anonymized. Mobility Trace Pseudonym Mobility Trace x 2 x 3 x 1 De-anonymize x 2 x 3 x 1 63427 Markov chain model 5

  6. [Shokri+,S&P11] [Gambs+,JCSS14] Related Work [Mulder+,WPES08] etc.  Markov Chain Model for De-anonymization  Attacker = anyone who has anonymized traces (except for LBS provider) .  Attacker obtains training locations that are made public (e.g. via SNS) .  Attacker de-anonymizes traces using the trained transition matrices. Training Traces Transition Matrices region x i Matrix P N User u N x j Matrix P 1 User u 1 x i p ij x 2 x 4 x 3 x 1 x 3 x 2 transition probability Users Mobility Traces Nyms Anonymized Traces de-anonymize u 1 x 3 x 2 x 4 x 3 x 2 x 4 32091 u N x 4 x 5 x 4 x 4 x 5 x 4 anonymize 29619 6 LBS provider Third-party (for analysis)

  7. Related Work  Sporadic Training Data (training data are sparsely distributed over time)  Many users disclose a small number of locations “sporadically” (via SNS) .  If we don’t estimate missing locations, we cannot train P 2 and P 3 .   we cannot de-anonymize traces of u 2 and u 3 using these matrices. Training Traces User Transition Matrices u 1 P 2 x 2 x 3 P 1 P 3 P 4 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 x 1 u 2 x 1 x 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ML ? ? x 2 0 0 1 0 0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? x 3 ? ? ? ? ? 0 0 0 0 1 ? ? ? ? ? ? ? ? ? ? u 3 x 4 x 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? x 5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? u 4 x 3 x 5 (ML: Maximum Likelihood Estimation) Users Mobility Traces Nyms Anonymized Traces u 2 x 1 x 1 x 3 x 1 x 1 x 3 32091 u 3 x 3 x 4 x 3 x 3 x 4 x 3 29619 anonymize 7 We need to “somehow” estimate missing locations.

  8. Related Work  Gibbs Sampling Method [Shokri+, S&P11]  Alternates between estimating P n and estimating missing locations of u n independently of other users. P n Training Trace ? ? ? ? ? Estimate matrix 0 1 0 0 0 u n x 2 x 2 x 1 ? ? ? ? ? ? ? ? ? ? Estimate missing locations ? ? ? ? ?  Challenge  When there are few continuous locations in training traces...  (1) Cannot accurately estimate P n .  (2) Cannot accurately estimate missing locations using P n (  (1)). We address this challenge by estimating P n with the help of “other users” (instead of estimating P n independently). x 2 x 3 x 4 x 1 x 3 x 5 8

  9. Contents Introduction (Location Privacy, Related Work) Our Proposal (EMTF: Expectation-Maximization Tensor Factorization) Experiments 9

  10. Overview of EMTF We use the help of “similar users” (other users who have similar behavior) : (1) Training Transition Matrices: We estimate unobserved elements ( “?” ) with the help of “similar users” . We substitute average matrix over all users for completely unobserved matrices. (2) Estimating Missing Locations: We estimate missing locations (we can do this with the help of “similar users” ). Go back to (1)  Each matrix captures unique feature of each user’s behavior since each trace is accurate & user-specific. TF (Tensor Factorization) [Murakami+, TIFS16] similar users similar users 0 0.60.4 0 0 0 0.60.4 0 0 0.30.60.1 ? ? ? ? ? 0 0 0 0.60.4 0 0 0 0.80.2 0 0.80.2 0 0 0 0 0 0.60.4 0 0 0.50.40.1 0 0 ? 0.20.40.4 0 0 ? ? ? ? ? 0 0.40.6 0 0 0 0.40.6 0 0 ? ? ? ? ? 0.20.50.3 0 0 ? ? ? ? ? 0.20.50.3 0 0 0.20.40.30.1 0 average 0 0 1 0 0 0 0 0.50.5 0 0 0 0.50.5 0 0 0 1 0 0 0 0 1 0 0 ? ? ? ? ? 0 0.30.60.1 0 0 0 1 0 0 0 0.10.40.5 0 matrix 0 0 0.20.8 0 ? ? ? ? ? 0 0.20.50.20.1 0 0 0.20.8 0 0 0 0.50.5 0 0 0 0.50.5 0 0 0 0.20.8 0 0 0 0.20.8 0 0 0 0.20.70.1 ? ? ? ? ? 0 0 0.30.50.2 ? ? ? ? ? 0 0 0.30.50.2 0 0 0.20.2 0 0 0.20.20.6 0.6 ? ? ? ? ? 0 0 0.20.20.6 0 0 0.10.20.7 user-specific x 2 x 3 x 4 x 2 x 1 x 3 x 2 x 3 x 4 x 3 x 4 x 2 x 3 x 2 x 5 x 1 x 2 x 3 x 3 x 4 x 5 x 4 x 5 x 3 x 1 x 4 x 3 x 4 x 5 x 2 location 10 EM (Expectation-Maximization) estimated location

  11. Details of EMTF  TF (Tensor Factorization)  Used for item recommendation. Factorizes tensor into low-rank matrices.  Estimates unobserved element ( “?” ) with the help of “similar users” .  EM (Expectation-Maximization)  Trains parameter Θ from observed data x while estimating missing data z .  Each EM cycle is guaranteed to increase the posterior probability Pr( Θ |x) . Estimating missing x 3 x 4 x 2 x 3 data z (E-step) 0 0 1 0 0 ? ? ? ? ? 0 1 0 0 0 x 1 x 2 x 4 x 4 ? ? ? ? ? 0 0 0 0 1 0 0 0 1 0 ? ? ? ? ? ? ? ? ? ? 0 0 1 0 0 Transition matrices 0 0 0 0.50.5 0 0 0 0 0 0 1 1 0 0 x 3 x 5 x 4 x 1 ? ? ? ? ? Training parameter Θ ? ? ? ? ? (= 3 rd order tensor) ? ? ? ? ? x = ( x 2 , x 3 , x 1 , x 4 , x 3 , x 5 ) via TF (M-step) Parameter Θ z = ( x 3 , x 4 , x 2 , x 4 , x 1 , x 4 ) Can find the most probable Θ and z with the help of “similar users”. 11

  12. EMTF Algorithm E-step: Estimate a distribution of missing location vector z : Forward-Backward z = Θ ( ) : Pr( | , ) Q z x algorithm ˆ Θ M-step: Estimate parameter in TF given by ( z ) Q ∑ ˆ = Θ Θ arg max ( ) log Pr( | , ) z x z Q z Θ ≥ 0 z Quadratic problem ∑ ˆ = − + λ 2 Θ 2 arg min ( )(|| || || || ) z A A Q (w.r.t. one parameter) F F Θ ≥ 0 z Max of log-posterior = Min of regularized square error x 2 x 3 x 3 x 4 Estimating locations (E-step) x 2 x 4 x 4 x 1 Tensor A x 1 x 3 x 5 x 4 Training via TF (M-step) x = ( x 2 , x 3 , x 1 , x 4 , x 3 , x 5 ) Parameter Θ z = ( x 3 , x 4 , x 2 , x 4 , x 1 , x 4 ) Time complexity is exponential in the number of missing locations.  12

  13. Approximation of EMTF  Time Complexity of EMTF  Number of possible missing locations z is exponential in its length.  E.g. #(regions) = 256, #(missing locations) = 8  possible z is 256 8 = 2 64 . Training Trace Q ( z ) (distribution of z ) x 156 x 186 x 188 x 192 x 224 x 256 x 256 x 224 x 204 x 140 z = ( x 224 , x 204 , x 140 , x 156 , x 186 , x 192 , x 224 , x 256 ) z  Two Approximation Methods:  [Method I] Viterbi : Approximates Q ( z ) by the most probable value z* .  [Method II] FFBS : Approximates Q ( z ) by random samples z 1 , … , z S . Viterbi FFBS (Forward Filtering Backward Sampling) Q ( z ) Q ( z ) Approximate Q(z) in a more accurate manner z 1 z 2 z z* z z S 13 Both methods reduce time complexity from exponential to linear.

  14. Contents Introduction (Location Privacy, Related Work) Our Proposal (EMTF: Expectation-Maximization Tensor Factorization) Experiments 14

  15. Experimental Set-up (Here we explain only the most important part. Please see our paper for details)  Gowalla Dataset  We used traces in New York & Philadelphia (16 x 16 regions).  Training: 250 users x 1 traces x 10 locations (time interval: more than 30min) .  Testing: 250 users x 9 traces x 10 locations.  We randomly deleted each training location with probability 80%.   No elements in a matrix were observed in more than 70% of cases . Extremely Sporadic Training Data (Worst Case Scenario for Attackers) Transition Matrix Training Trace ? x 3 x 2 ML More than 70% of cases (ML: Maximum Likelihood Estimation) 15

Recommend


More recommend