ANALYZING YZING MA MASS SSIVE VE DATASETS SETS WITH ITH MI MISSI SSING NG ENTRIES TRIES MODELS AND ALGORITHMS Nithin n Varma rma Thesis sis Advi dvisor sor: : So Sofya ya Rask skhod hodni nikov kova 1
Algorithms for massive datasets ■ Cannot read the entire dataset – Sublinear-time algorithms ■ Performance Metrics – Speed – Memory efficiency – Accuracy – Resilience to faults in data 2
Faults in datasets ■ Wrong Entries (Errors) – sublinear algorithms – machine learning – error detection and correction ■ Miss ssing ng Entries ries (Era rasu sures) es) : Ou Our Fo Focu cus 3
Occurrence of erasures: Reasons Hidden friend Data collection relations on social networks Adversarial deletion Accidental deletion 4
Large dataset with Function ctions, s, Codewor words, ds, erasures: Access Graph aphs ■ Algorithm queries the oracle for dataset entries Oracle ■ Algorithm does not know in advance what's erased ■ Oracle returns: Interaction – the nonerased entry, or – special symbol ⊥ if queried point is erased Algorithm Source, CC BY-SA 5 Source, CC BY-SA
Overview of our contributions Codewords Graphs Functions Erasure-Resilien Resilient t T esting ting Local cal Erasur sure-Decoding ecoding Erasure-Resilien Resilient t ■ ■ ■ [Dixit, Raskhodnikova, [Raskhodnikova, Ron-Zewi & Sublinea inear-time time Thakurta & Varma ma '18, Varma ma '19] Algori orithms thms for Graphs phs Kalemaj, Raskhodnikova & [Levi, Pallavoor, App pplication ication to pro proper erty ty – Varma ma] Raskhodnikova & Varma ma] testing ing Sensiti itivit vity y of Graph ph ■ Algori orithms thms to Missin ssing g Edges ges [Varma rma & Yoshida] 6
Outline ■ Erasures in property testing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms – Definition – Main results ■ Average sensitivity of approximate maximum matching ■ Current and future directions 7
Outline ■ Erasu sures es in pr prope perty y test sting ing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms – Definition – Main results ■ Average sensitivity of approximate maximum matching ■ Current and future directions 8
Decision problem Universe Reject, w.p. ■ Can't solve nontrivial NO ≥ 2/3 decision problems without full access to input ■ Need a notion of Accept, w.p. approximation YES ≥ 2/3 9
Property testing Universe problem [Rubinfeld & Sudan '96, Reject, w.p. 𝜁 -far from Goldreich, Goldwasser & Ron '98] the property ≥ 2/3 ■ 𝜻 -fa far from property 𝜁 – ≥ 𝜁 fraction of values to be changed to satisfy property Accept, w.p. Property ≥ 2/3 𝜁 -tester 10
(Error) T olerant Universe testing problem 𝜁 -far [Parnas, Ron & Rubinfeld '06] Reject, w.p. from ≤ 𝜷 fr fract ction on of i f inpu put is s wrong property ≥ 2/3 ■ 𝜷 -cl close se to property 𝜁 – ≤ 𝛽 fraction values can be Accept, w.p. 𝛽 changed to satisfy property ≥ 2/3 Property (𝛽, 𝜁) -tolerant tester 11
Erasure-resilient Universe testing problem Every [Dixit, Raskhodnikova, Thakurta & Reject, w.p. completion Varm rma '16] is 𝜁 -far ≤ 𝜷 fr fract ction on of i f inpu put is s erase sed ≥ 2/3 ■ Worst-case erasures, made 𝜁 before tester queries Accept, w.p. Can be ■ Comp mpletio etion ≥ 2/3 completed – Fill-in values at erased points to satisfy property (𝛽, 𝜁) -erasure-resilient tester 12
Relationship between models Testing Erasure-resilient testing Tolerant testing 13
Erasure-resilient testing: Our results [Dixit, Raskhodnikova, Thakurta, Varma 18] ■ Blackbox transformations ■ Efficient erasure-resilient testers for other properties ■ Separation of standard and erasure-resilient testing 14
Our blackbox transformations ■ Makes certain classes of uniform testers erasure-resilient ■ Works by simply repeating the original tester Query y co comp mplexity xity of f (𝜷, 𝜻) -erasur sure-res esilient ilient test ster er equ qual l to 𝜻 -test ster er for 𝛽 ∈ (0,1) , 𝜁 ∈ (0,1) ■ Applies to: – Monotonicity over general partial orders [FLNRRS02] – Convexity of black and white images [BMR15] – Boolean functions having at most 𝑙 alternations in values 15
Main properties that we study ■ Monotonicity, Lipschitz properties, and convexity of real- valued functions ■ Widely studied in property testing [EKKRV00,DGLRRS99,LR01,FLNRRS02,PRR03,AC04,F04,HK04,BRW05,PRR06,ACCL07,BGJRW12,BCGM10, BBM11, AJMS12, DJRT13, JR13, CS13a,CS13b,BlRY14,CST14,BB15,CDJS15,CDST15,BB16,CS16,KMS18,BCS18,PRV18,B18,CS19, …] ■ Optimal testers for these properties are not uniform testers – Our blackbox transformation does not apply 16
Optimal erasure-resilient testers ■ For functions 𝑔: 𝑜 𝑒 → ℝ ■ For functions 𝑔: 𝑜 → ℝ – Monotonicity – Monotonicity – Lipschitz properties Lipschitz properties – – Convexity Query y comp mplexity xity of f 𝜷, 𝜻 - Query y co comp mplexity xity of f (𝜷, 𝜻) - erasu sure-res esilient ilient test ster r equ qual l erasu sure-res esilient ilient test ster r equ qual l to 𝜻 -teste ster r to 𝜻 -teste ster r for 𝛽 ∈ (0,1) , 𝜁 ∈ (0,1) for 𝜁 ∈ (0,1) , 𝛽 = 𝑃(𝜁/𝑒) 17
Separation of erasure-resilient and standard testing Theorem em: There exists a property 𝑄 on inputs of size 𝑜 such that: testing with co const stant number of queries • every erasure-resilient tester needs ෩ Ω(𝑜) queries • 18
Relationship between models Standard Testing Erasure-resilient testing Tolerant testing Some me containme nment nts s are st strict: • [Fischer Fortnow 05]: standard vs. tolerant • [Dixit Raskhodnikova Thakurta Varma 18]: standard vs. erasure-resilient 19
Outline ■ Erasures in property testing ■ Erasu sures es in loca cal de deco codi ding ■ Average sensitivity of graph algorithms – Definition – Main results ■ Average sensitivity of approximate maximum matching ■ Current and future directions 20
Local decoding ■ Error correcting code 𝐷: Σ 𝑜 → Σ 𝑂 , for 𝑂 > 𝑜 Received Message 𝑦 Encoder Channel 𝐷(𝑦) word 𝑥 ■ Deco codi ding ng: Recover 𝑦 from 𝑥 if not too many errors or erasures ■ Loca cal de deco code der: Sublinear-time algorithm for decoding Local decoding is extensively studied and has many applications [GL89,BFLS91,BLR93,GLRSW91,GS92,PS94,BIKR93,KT00,STV01,Y08,E12,DGY11,BET10…] 21
Local decoding and property testing [Raskhodnikova, Ron-Zewi, Varma rma 19] Ou Our Re Resu sult lts ■ Initiate study of erasures in the context of local decoding ■ Erasures are easier than errors in local decoding ■ Separation between erasure-resilient and (error) tolerant testing 22
Separation of erasure-resilient and tolerant testing [Raskhodnikova, Ron-Zewi, Varma rma 19] Theorem em: There exists a property 𝑄 on inputs of size 𝑜 such that: erasure-resilient testing with co const stant nt number of queries • every (error) tolerant tester needs 𝑜 Ω(1) queries • 23
Relationship between models Testing Erasure-resilient testing Tolerant testing All l contain ainments ents are e strict rict: • [Fischer Fortnow 05]: standard vs. tolerant • [Dixit Raskhodnikova Thakurta Varm rma 18]: standard vs. erasure-resilient • [Raskhodnikova Ron-Zewi Varm rma 19]: erasure-resilient vs. tolerant 24
Outline ■ Erasures in property testing ■ Erasures in local decoding ■ Avera erage ge se sensitiv sitivit ity y of f gr graph ph algo gorit ithms hms – Def efinitio nition – Main n res esults lts ■ Average sensitivity of approximate maximum matching ■ Current and future directions 25
Motivation ■ Want to solve optimization problems on large graphs – Maximum matching, min. vertex cover, min cut, … ■ Cannot assume that we get access to the true graph – A fraction of the edges , say 1% , might be missing ■ Need algorithms that are robust to missing edges 26
T owards average sensitivity ■ Want to solve problem on 𝐻 ; only have access to 𝐻′ . Algorithm 𝐵 𝐵(𝐻) 𝐻 = (𝑊, 𝐹) ≈ 𝐻 ′ = 𝑊, 𝐹 ′ ; 𝐹 ′ ⊆ 𝐹 Algorithm 𝐵 𝐵(𝐻′) ■ Similar to robustness notions in differential privacy [Dwork, Kenthapadi, McSherry, Mironov & Naor 06, Dwork, McSherry, Nissim & Smith 06] , learning theory [Bosquet & Elisseef 02] ,…. 27
Average sensitivity: Deterministic algorithm [Varm rma & Yoshida] ■ 𝐵 : deterministic graph algorithm outputting a set of edges or vertices – e.g., 𝐵 outputs a maximum matching Average sensitivity of deterministic algorithm 𝐵 𝑡 𝐵 𝐻 = avg 𝑓∈𝐹 [ Ham 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ] ■ 𝑡 𝐵 : → ℝ , where is the universe of input graphs 28
Average sensitivity: Randomized algorithm [Varm Output rma & Yoshida] distributions Average sensitivity of randomized algorithm 𝐵 𝑡 𝐵 𝐻 = avg 𝑓∈𝐹 [ Dist 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ] ■ 𝑡 𝐵 : → ℝ , where is the universe of input graphs ■ Algorithm with low average sensitivity: st stabl ble-on on-aver average ge 29
Recommend
More recommend