optimizing data partitioning for data parallel computing
play

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , - PowerPoint PPT Presentation

Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University Partition Data for Data-Parallel Computing ?


  1. Optimizing Data Partitioning for Data-Parallel Computing Qifa Ke , Vijayan Prabhakaran, Jingyue Wu, Junfeng Yang Yinglian Xie, Yuan Yu Microsoft Research Silicon Valley Columbia University

  2. Partition Data for Data-Parallel Computing ? …… // 270 GB input data var output = …… input.GroupBy(x => x.UserId) .Select(g => GetStats(g)) • Data partitioning controls the degree of parallelism • What partition function to choose? – Hash partition, r ange partition, …? • How many partitions to generate? – 100, 1000, 10000, ….? Data partitioning performance and costs

  3. Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select( x => ProcessImages (x)) 0.14 Partition Size Fraction of Data/Computation 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

  4. Problem 1: Do We have a Skew? • Data skew and computation skew // process 20 GB images in 100 partitions var output = Imgs.Select(x=>ProcessImages (x)) • Image processing time 0.14 Partition Size Computation Time Fraction of Data/Computation 0.12 depends on both image 0.1 and ProcessImage() : 0.08 – Number of images 0.06 – Image features 0.04 ProcessImage() is 0.02 targeting to compute 0 0 10 20 30 40 50 60 70 80 90 100 Partition ID

  5. Problem 2: What’s Optimal? • Balanced workload ≠ optimal performance – Tradeoff: workload vs. cross-node traffic // construct a user-user graph for botnet deteciton var records = input1.Apply(x => SelectRecords(x)).HashPartition(x=>x.label, nump); var output = input1.Apply(records, (x,y) => ConstructGraph(x,y));

  6. Optimal Data Partitioning Given code and data, can we generate a data partitioning scheme to optimize performance, without running code on whole data set? • Performance and cost metrics – Job latency – Number of processes – Memory consumption – Disk and network I/O

  7. Why not DB Solutions • Need to understand both code and data • Programming model – Predefined operators (e.g., select, join) vs. arbitrary user-defined functions (UDF) • Data model – Structured tables vs. unstructured data – Static, indexed data vs. dynamic dataset – Minimize intermediate disk writes vs. using disk as communication channel

  8. Code Analysis - Data processing flow - Computational & I/O complexity - Relevant data features • Challenges: user defined functions (UDF) – How data is accessed, processed, and transformed • Number of IEnumerable<stats> ProcessRecord( Ienumerable<record> users) { recipients is a foreach (var u in users) { if (NumRecipients(u) > 10) { relevant feature yield return GetStat(u); } else { • Different records yield return GetSimpleStat(u); take different code } } paths to process }

  9. Data Analysis Statistics of relevant data features • Challenge: compact data representation – Representative samples of input data – Data summarizations – Approximate histogram – Approximate number of distinct keys • Streaming algorithms in a distributed setting

  10. Cost Modeling and Optimization • Modeling: compare different partitioning schemes • Estimation: predict the potential cost – White-box approach • Analytically based on code/data analysis – Black-box approach • Sampling + regression analysis • Optimization: search for best partitioning scheme Input data Data Data Statistics … & Samples Analysis Cost Modeling Cost Optimized EPG & Estimation Optimization Code Code Computational Updated & IO Complexity Analysis EPG EPG

  11. Conclusion • Preparing your input before you start – Data partitioning is critical to performance • New research opportunities in different fields – Programming language analysis – Data analysis – Optimization – Distributed systems

Recommend


More recommend