Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1
The “big data” models The streaming model (Alon, Matias and Szegedy 1996) – high-speed online data – limited storage RAM CPU The k-site model – data is distributedly stored C – limited network bandwidth · · · S k S 1 S 3 S 2 2-1
k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). C ∅ one round · · · S k S 1 S 3 S 2 x 1 x 2 x 3 x k 3-1
k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). – no constraint on #bits can be sent or C ∅ received by each site at each round. one round (usually balanced) – do not count local · · · computation S k S 1 S 3 S 2 (usually linear) x 1 x 2 x 3 x k 3-2
k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction 4-1
k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction C = · · · S k S 1 S 3 S 2 4-2
We will start with the k -site model, and will mention the streaming model at the end 5-1
Sketching Q: How many distinct elements ( F 0 ) in the union of the k bags? global sketch = C merge { local sketches } · · · S k S 1 S 3 S 2 local · · · sketch 6-1
Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector 7-1
Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation 7-2
Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation Simple and useful : used in many statistical/graph/algebraic problems in streaming, compressive sensing, . . . 7-3
But what if the data is noisy? Real world distributed datasets are often noisy! C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-1
But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-2
But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C Cannot use linear sketches :( · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-3
Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. 9-1
Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. “sublinear algorithm workshop 2016” “JHU sublinear algorithm” “sublinear John Hopkins” Queries of the same meaning sent to Google 9-2
Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. 10-1
Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. In the big data models, we want communication/space-efficient algorithms (o(input size)); cannot afford a comprehensive de-duplication. 10-2
Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. 11-1
Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. Goal : minimize communication & #rounds 11-2
Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! 12-1
Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! Remark 2 . We assume transitivity: if u ∼ v , v ∼ w then u ∼ w . In other words, the noise is “well-shaped”. One may come up with the following problematic situation: we have a ∼ b , b ∼ c , . . . , y ∼ z , however, a �∼ z . For many specific metic spaces, our algorithms still work if the number of “outliers” is small. 12-2
Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. 13-1
Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. Remark 4 . Does there exist a magic hash function that (1) map (only) items in same group into same bucket and (2) can be described succinctly? Answer: NO For specific metrics, tools such as LSHs may help 13-2
A few notations C · · · S k S 1 S 3 S 2 • We have k sites (machines), each holding a multiset of items S i . • Let multiset S = � i ∈ [ k ] S i , let m = | S | . • Under the transitivity assumption, S can be partitioned into a set of groups G = { G 1 , . . . , G n } . Each group G i represents a distinct universe element. • ˜ O ( · ) hides poly log( m /ǫ ) factors. 14-1
Our results noisy data noise-free data (comm.) items rounds bits ˜ ˜ O (min { k /ǫ 3 , k 2 /ǫ 2 } ) Ω( k /ǫ 2 ) [WZ12,WZ14] O (1) F 0 ˜ ˜ L 0 -sampling O ( k ) O (1) Ω( k ) O (( k p − 1 + k 3 ) /ǫ 3 ) ˜ Ω( k p − 1 /ǫ 2 ) [WZ12] F p ( p ≥ 1) O (1) √ ˜ O (min { k /ǫ, 1 /ǫ 2 } ) ǫ , 1 k ( φ, ǫ )-HH O (1) Ω(min { ǫ 2 } ) [HYZ12,WZ12] ˜ O ( k /ǫ 2 ) Ω( k /ǫ 2 ) [WZ12] Entropy O (1) i ∈ [ n ] | G i | p . 1. p-th frequency moment F p ( S ) = � We consider F 0 and F p ( p ≥ 1), and allow a (1 + ǫ )-approximation. 2. L 0 -sampling on S : return a group G i (or an arbitrary item in G i ) uniformly at random from G . 3. ( φ, ǫ ) -heavy-hitter of S (0 < ǫ ≤ φ ≤ 1) (definition omitted) | G i | m 4. Empirical entropy : Entropy( S ) = � m log | G i | . i ∈ [ n ] We allow a (1 + ǫ )-approximation. 15-1
Recommend
More recommend