Communication-Efficient Computation on Distributed Noisy Datasets - PowerPoint PPT Presentation

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA’15 June 15, 2015 1-1

Model of computation The coordinator model : k sites and 1 coordinator. – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . – Task : compute f ( x 1 , . . . , x k ) together via communication. The coordinator reports the answer. – computation is divided into rounds. – Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). C ∅ one round · · · S k S 1 S 3 S 2 x 1 x 2 x 3 x k 2-1

Model of computation The coordinator model : k sites and 1 coordinator. – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . – Task : compute f ( x 1 , . . . , x k ) together via communication. The coordinator reports the answer. – computation is divided into rounds. – Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). – no constraint on #bits can be sent by C ∅ each site on each one round round. (usually balanced) – do not count local · · · S k S 1 S 3 S 2 computation x 1 x 2 x 3 x k (usually linear) 2-2

The coordinator model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction 3-1

The coordinator model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction C = · · · S k S 1 S 3 S 2 3-2

The distributed distinct elements ( F 0 ) problem Function f can be: How many distinct elements ( F 0 ) in the union of the k bags? C · · · S k S 1 S 3 S 2 · · · 4-1

The distributed distinct elements ( F 0 ) problem Function f can be: How many distinct elements ( F 0 ) in the union of the k bags? Important in: C traffic monitoring, query optimization, ... · · · S k S 1 S 3 S 2 · · · 4-2

The distributed distinct elements ( F 0 ) problem Function f can be: How many distinct elements ( F 0 ) in the union of the k bags? Important in: C traffic monitoring, Almost always allow a query optimization, (1 + ǫ )-approximation ... · · · S k S 1 S 3 S 2 · · · 4-3

Existing solution – linear sketches How many distinct elements ( F 0 ) in the union of the k bags? global sketch = � local sketches C · · · S k S 1 S 3 S 2 local · · · linear sketch 5-1

Linear sketches Random linear mapping M : R n → R k where k ≪ n . (approximate) = M Mx f ( x ) x sketching vector linear mapping The data. e.g., a frequency vector 6-1

Linear sketches Random linear mapping M : R n → R k where k ≪ n . (approximate) = M Mx f ( x ) x sketching vector linear mapping The data. e.g., a frequency vector Simple and useful : Statistical/graph/algebraic problems in data streams, compressive sensing, . . . 6-2

Linear sketches Random linear mapping M : R n → R k where k ≪ n . (approximate) = M Mx f ( x ) x sketching vector linear mapping The data. e.g., a frequency vector Simple and useful : Statistical/graph/algebraic problems in data streams, compressive sensing, . . . Perfect for distributed computation The data is distributed as x = x 1 + . . . + x k ; x i on site i . Merge using linearity: Mx 1 + . . . + Mx k = M ( x 1 + . . . + x k ) 6-3

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy! C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road John Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 7-1

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road John Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 7-2

Linear sketches cannot work for noisy datasets Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? Cannot use linear sketches C freq. of items rep. the same entity may be mapped into different coordinates of the sketching vector · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road John Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 7-3

Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. 8-1

Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. “SPAA 2015” “27th ACM Symposium on Parallelism in Algorithms and Architectures” “ACM FCRC SPAA’15” Queries of the same meaning sent to Google 8-2

Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. 9-1

Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. This work: distributed, statistical estimations, 9-2

Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. This work: distributed, statistical estimations, We want more communication-efficient algorithms (o(input size)), without a comprehensive de-duplication. 9-3

Our goal and problem C Goal : minimize communication & #rounds · · · S k S 1 S 2 S 3 Problem : how can we perform noise-resilient statistical estimation in the coordinator model comm. efficiently? Assume all parties are provided with a pairwise distance metric and a threshold determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not. 10-1

Our goal and problem C Goal : minimize communication & #rounds · · · S k S 1 S 2 S 3 Problem : how can we perform noise-resilient statistical estimation in the coordinator model comm. efficiently? Assume all parties are provided with a pairwise distance metric and a threshold determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not. The distance metric design is a separate issue. We will design a framework so that users can plug-in any “distance metric” at run time. 10-2

Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! 11-1

Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! Remark 2 . We assume transitivity: if u ∼ v , v ∼ w then u ∼ w . In other words, the noise is “well-shaped”. One may come up with the following problematic situation: we have a ∼ b , b ∼ c , . . . , y ∼ z , however, a �∼ z . Our algorithm still work if the number of “outliers” is small. 11-2

Remarks (cont.) Remark 3 . Do exist approaches w/o assuming transitivity. E.g., assume so-called ICAR properties [BGM+09], or use clustering based approaches [ACN08]. Unlikely to have comm.-efficient algorithms in our setting. 12-1

Communication-Efficient Computation on Distributed Noisy Datasets - PowerPoint PPT Presentation

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA15 June 15, 2015 1-1 Model of computation The coordinator model : k sites and 1 coordinator. each site has a 2-way

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e )

Multi-parameter regularization for ill-posed problems with noisy right hand side and noisy

Learning Nearest Neighbor Graphs from Noisy Distance Samples Noisy Distance Samples Blake Mason,

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

Improved Computation-Communication Trade-Off for Coded Distributed Computing using Linear

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Towards Efficient Distributed Towards Efficient Distributed Simulation in Modelica using

application: error correcting codes 40 Codes are all around us 41 noisy channels Goal: send a

Techniques for Efficient Secure Computation Based on Yaos Protocol Yehuda Lindell Bar-Ilan

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Distributed Databases Distributed database management system A distributed database (DDB) is

2019 CCIM President Carole Brill, CCIM 2019 Commercial Real Estate Forecasts Presented by

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

Earnings Call for Q2-17 Results SAFE HARBOR PROVISION Certain statements made herein that use

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

DEBT VALUATION AND INTEREST Chapter 9 Principles Applied in This Chapter Principle 1: Money

FIT100 FIT100 FIT100 Anno unc e me nts FIT100 FIT100 FIT100 Pro je c t 3B Build the

P O S E I D O N R E A L T I M E F E E D B A C K | D E S I G N E D F O R T R I AT H L E T E

Values Learning Outcomes Define what values are Identify your personal values Relate

Communication-Efficient Computation on Distributed Noisy Datasets - PowerPoint PPT Presentation

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA15 June 15, 2015 1-1 Model of computation The coordinator model : k sites and 1 coordinator. each site has a 2-way

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e )

Multi-parameter regularization for ill-posed problems with noisy right hand side and noisy

Learning Nearest Neighbor Graphs from Noisy Distance Samples Noisy Distance Samples Blake Mason,

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

Improved Computation-Communication Trade-Off for Coded Distributed Computing using Linear

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Towards Efficient Distributed Towards Efficient Distributed Simulation in Modelica using

application: error correcting codes 40 Codes are all around us 41 noisy channels Goal: send a

Techniques for Efficient Secure Computation Based on Yaos Protocol Yehuda Lindell Bar-Ilan

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Distributed Databases Distributed database management system A distributed database (DDB) is

2019 CCIM President Carole Brill, CCIM 2019 Commercial Real Estate Forecasts Presented by

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

Earnings Call for Q2-17 Results SAFE HARBOR PROVISION Certain statements made herein that use

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

DEBT VALUATION AND INTEREST Chapter 9 Principles Applied in This Chapter Principle 1: Money

FIT100 FIT100 FIT100 Anno unc e me nts FIT100 FIT100 FIT100 Pro je c t 3B Build the

P O S E I D O N R E A L T I M E F E E D B A C K | D E S I G N E D F O R T R I AT H L E T E

Values Learning Outcomes Define what values are Identify your personal values Relate

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni