Nap 2019-09-25 Nap : Network-Aware Data Partitions for Efficient Distributed Processing Mr. Or Raz , Prof. Chen Avin, Prof. Stefan Schmid School of Electrical and Computer Engineering Faculty of Computer Science Ben-Gurion University of the Negev University of Vienna Beer-Sheva, Israel Vienna, Austria Nap : Network-Aware Data Partitions September 26, 2019 for Efficient Distributed Processing Hello everyone, my name is Or Raz, I am a Master graduate from the school of Electrical and Computer Engineering in Ben-Gurion University of the Negev, Israel. This research has been done with the support of Mr. Or Raz , Prof. Chen Avin, Prof. Stefan Schmid Professors Chen Avin and Stefan Schmid, and my Thesis is mainly about this work. Today, I will talk about Nap, a scheme that takes the network into consideration when partitioning the data, and therefore minimizes the School of Electrical and Computer Engineering Faculty of Computer Science Ben-Gurion University of the Negev University of Vienna completion time in distributed processing frameworks, such as Hadoop. Beer-Sheva, Israel Vienna, Austria September 26, 2019
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Outline 2019-09-25 Introduction and Motivation 1 Introduction and Motivation 2 Model and Problem Outline 3 Nap Outline 4 Proof-of-Concept and Conclusion First, I introduce the motivation in general, then with a join example, and I will give some empirical motivation. Introduction and Motivation 1 Next, I cover the model for the problem and the problem itself. Then, I go over what is Nap scheme with it’s relation to Young Lattice. Model and Problem In the end I go over the implementation, it’s difficulties and introduce 2 some points for future work. Nap 3 Proof-of-Concept and Conclusion 4 O. Raz (BGU - ECE) September 26, 2019 1 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Introduction 2019-09-25 Introduction and Motivation Nowadays, we are living in the Big Data era. Data is processed and stored in geographically distributed datacenters. Traditional query optimizations neglect the network . Introduction Introduction • The amount of data queried and processed by emerging applications is growing explosively (in many fileds such as health, business, and science). Nowadays, we are living in the Big Data era. • Traditionally, data processing frameworks were designed to run in Data is processed and stored in geographically distributed datacenters. Homogeneous environments or within a single datacenter, but today it is less Traditional query optimizations neglect the network . common with more Geographically distributed processing. • Because the scale of data and the data itself is generated in a geographically distributed fashion (IOT). • Therefore, to maximize performance, we need to consider the available network resources which has been neglected in the optimization analysis, otherwise we could have a poor performance (wide-area analytics). O. Raz (BGU - ECE) September 26, 2019 2 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Introduction 2019-09-25 Introduction and Motivation Nowadays, we are living in the Big Data era. Data is processed and stored in geographically distributed datacenters. Traditional query optimizations neglect the network . Introduction Introduction • The amount of data queried and processed by emerging applications is growing explosively (in many fileds such as health, business, and science). Nowadays, we are living in the Big Data era. • Traditionally, data processing frameworks were designed to run in Data is processed and stored in geographically distributed datacenters. Homogeneous environments or within a single datacenter, but today it is less Traditional query optimizations neglect the network . common with more Geographically distributed processing. • Because the scale of data and the data itself is generated in a geographically distributed fashion (IOT). • Therefore, to maximize performance, we need to consider the available network resources which has been neglected in the optimization analysis, otherwise we could have a poor performance (wide-area analytics). O. Raz (BGU - ECE) September 26, 2019 2 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Introduction 2019-09-25 Introduction and Motivation Nowadays, we are living in the Big Data era. Data is processed and stored in geographically distributed datacenters. Traditional query optimizations neglect the network . Introduction Introduction • The amount of data queried and processed by emerging applications is growing explosively (in many fileds such as health, business, and science). Nowadays, we are living in the Big Data era. • Traditionally, data processing frameworks were designed to run in Data is processed and stored in geographically distributed datacenters. Homogeneous environments or within a single datacenter, but today it is less Traditional query optimizations neglect the network . common with more Geographically distributed processing. • Because the scale of data and the data itself is generated in a geographically distributed fashion (IOT). • Therefore, to maximize performance, we need to consider the available network resources which has been neglected in the optimization analysis, otherwise we could have a poor performance (wide-area analytics). O. Raz (BGU - ECE) September 26, 2019 2 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Introduction 2019-09-25 Introduction and Motivation Nowadays, we are living in the Big Data era. Data is processed and stored in geographically distributed datacenters. Traditional query optimizations neglect the network . Introduction Introduction Contribution Nap, a network-aware and adaptive mechanism for fast large scale data processing based on MapReduce, such as joins. • Our contribution is Nap, a mechanism which minimizes the completion time in a network-aware manner and is optimized to the current network Nowadays, we are living in the Big Data era. conditions. In addition, it doesn’t require any logic modifications where it Data is processed and stored in geographically distributed datacenters. only fools the application for a better partitioning of the data. Traditional query optimizations neglect the network . • We are particularly interested in workloads based on relational databases and consider the most fundamental operation in distributed data processing: Contribution joins. Nap, a network-aware and adaptive mechanism for fast large scale data processing based on MapReduce, such as joins. O. Raz (BGU - ECE) September 26, 2019 2 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Multiway Join 2019-09-25 Introduction and Motivation ACM Tables Example Consider a small database of Papers, Papers-Authors, and Authors that we want to join them, X ( v , p ) ⊲ ⊳ Y ( p , a ) ⊲ ⊳ Z ( a , n ). Multiway Join X (v,p) Y (p,a) Z (a,Name) Venue Paper Paper Author Author Name Multiway Join SIGMOD SkewTune MapReduce 1 1 J. Dean EuroSys Riffle MapReduce 2 7 Y.Kwon OSDI MapReduce HaLoop 5 4 H. Zhang S2RDF 3 8 D. Ullman Riffle 4 2 S. Ghemawat Kraken 6 ACM Tables Example First, lets take a look on these three tables that has two joint attributes, Consider a small database of Papers, Papers-Authors, and Authors that we p and a . want to join them, X ( v , p ) ⊲ ⊳ Y ( p , a ) ⊲ ⊳ Z ( a , n ). X (v,p) Y (p,a) Z (a,Name) Venue Paper Paper Author Author Name SIGMOD SkewTune MapReduce 1 1 J. Dean EuroSys Riffle MapReduce 2 7 Y.Kwon OSDI MapReduce HaLoop 5 4 H. Zhang S2RDF 3 8 D. Ullman Riffle 4 2 S. Ghemawat Kraken 6 O. Raz (BGU - ECE) September 26, 2019 3 / 18 Nap
Introduction and Motivation Model and Problem Nap Implementation and Conclusion Nap Multiway Join 2019-09-25 Introduction and Motivation ACM Tables Example Consider a small database of Papers, Papers-Authors, and Authors that we want to join them, X ( v , p ) ⊲ ⊳ Y ( p , a ) ⊲ ⊳ Z ( a , n ). Multiway Join X (v,p) Y (p,a) Z (a,Name) Venue Paper Paper Author Author Name Multiway Join SIGMOD SkewTune MapReduce 1 1 J. Dean EuroSys Riffle MapReduce 2 7 Y.Kwon OSDI MapReduce HaLoop 5 4 H. Zhang S2RDF 3 8 D. Ullman Riffle 4 2 S. Ghemawat Kraken 6 ACM Tables Example We consider an operation which joins all of these tables, X ( v , p ) ⊲ ⊳ Y ( p , a ) ⊲ ⊳ Z ( a , n ) where ⊲ ⊳ denotes the join operator. Consider a small database of Papers, Papers-Authors, and Authors that we Attributes: v - the Venue, p - the Paper ID, a - the Author ID, and n - want to join them, X ( v , p ) ⊲ ⊳ Y ( p , a ) ⊲ ⊳ Z ( a , n ). the Author name. X (v,p) Y (p,a) Z (a,Name) Venue Paper Paper Author Author Name SIGMOD SkewTune MapReduce 1 1 J. Dean EuroSys Riffle MapReduce 2 7 Y.Kwon OSDI MapReduce HaLoop 5 4 H. Zhang S2RDF 3 8 D. Ullman Riffle 4 2 S. Ghemawat Kraken 6 O. Raz (BGU - ECE) September 26, 2019 3 / 18 Nap
Recommend
More recommend