mrs high performance mapreduce for iterative and
play

Mrs: High Performance MapReduce for Iterative and Asynchronous - PowerPoint PPT Presentation

Mrs. Iterative MapReduce Performance and Case Studies Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python Jeff Lund , Chace Ashcraft, Andrew McNabb and Kevin Seppi Brigham Young University November 14, 2016


  1. Mrs. Iterative MapReduce Performance and Case Studies Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python Jeff Lund , Chace Ashcraft, Andrew McNabb and Kevin Seppi Brigham Young University November 14, 2016

  2. Mrs. Iterative MapReduce Performance and Case Studies What is Mrs? Simple and easy to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind

  3. Mrs. Iterative MapReduce Performance and Case Studies MapReduce Input Map Input Map Reduce Input Map Reduce Input Map Reduce Input Map

  4. Mrs. Iterative MapReduce Performance and Case Studies Example: WordCount wordcount.py import mrs class WordCount(mrs.MapReduce): def map (self, line num, line text): for word in line text.split(): yield (word, 1) def reduce (self, word, counts): yield sum (counts) name == ’ main ’: if mrs.main(WordCount)

  5. Mrs. Iterative MapReduce Performance and Case Studies Why Python? Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing

  6. Mrs. Iterative MapReduce Performance and Case Studies Iterative MapReduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce · · · Input Map Reduce Map Reduce Input Map Reduce Map Reduce Performance Challenges: CPU bound problems Communication time Task Management

  7. Mrs. Iterative MapReduce Performance and Case Studies Proposed Solutions Infrequent Checkpointing Reduce-Map task Generator-Callback Model Asynchronous Scheduling Model

  8. Mrs. Iterative MapReduce Performance and Case Studies How Often to Checkpoint Let X be a random variable indicating a failure occurred during an iteration, then � 1 �� t + c � X ∼ Bernoulli f n n: Number of iterations between checkpoints t: Time to perform each iteration c: Extra time required for a checkpointed iteration f: Failures in a cluster

  9. Mrs. Iterative MapReduce Performance and Case Studies How Often to Checkpoint If Y ∼ Uniform ( n ) indicates the number of iterations since last checkpoint then the expected value of the number of seconds of extra work in an iteration is: E [ X ( r + Yt )] = 1 t + c r + n � � � � 2 t f n and the breakeven number of iterations is � ��� c ��� 1 , 1 � 2 � c n = max 2 + r − 2 c ( r − f ) − 2 + r . t

  10. Mrs. Iterative MapReduce Performance and Case Studies Iterative MapReduce: ReduceMap Input Map Reduce Map Reduce Input Map Reduce Map Reduce · · · Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap · · · Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap

  11. Mrs. Iterative MapReduce Performance and Case Studies Generator-Callback Model def run batches(): data path = input path for iteration in range(MAX ITERATIONS): output path = make temp path() job = new job(data path, map func, reduce func, output path) job.wait for completion() data path = output path if iteration % CHECK FREQUENCY == 0: data = read all(data path) perform output(data) if converged(data): break

  12. Mrs. Iterative MapReduce Performance and Case Studies Generator-Callback Model def generator(queue): dataset = input data for iteration in range(MAX ITERATIONS): output path = make temp path() dataset = mapreduce(dataset, map func, reduce func, output path) if iteration % CHECK FREQUENCY == 0: queue.submit(dataset, callback) else: queue.submit(dataset, None) def callback(data): data.read all() perform output(data) return !converged(data)

  13. Mrs. Iterative MapReduce Performance and Case Studies Task Dependencies: Synchronous MapReduce

  14. Mrs. Iterative MapReduce Performance and Case Studies Task Dependencies: Asynchronous MapReduce

  15. Mrs. Iterative MapReduce Performance and Case Studies Task Execution Traces Synchronous: Asynchronous:

  16. Mrs. Iterative MapReduce Performance and Case Studies Performance and Case Studies We demonstrate on two different problems: Particle Swarm Optimization Minimize 250 degree Rosenbrock function Expectation Maximization Mixture of Multinomials model in the context of clustering text documents

  17. Mrs. Iterative MapReduce Performance and Case Studies Particle Swarm Optimization 40 Inspired by simulations of flocking birds 30 Particles interact while exploring 20 Map: motion and function evaluation 10 Reduce: communication CPU bound problem 0 0 2 4 6 8 10

  18. Mrs. Iterative MapReduce Performance and Case Studies Particle Swarm Optimization 1 Reduce-map tasks Rare checks 0 . 8 Parallel Efficiency Concurrent checks No redundant storage 0 . 6 Redundant storage 0 . 4 0 . 2 0 10 0 10 1 10 2 10 3 Number of subiterations

  19. Mrs. Iterative MapReduce Performance and Case Studies Particle Swarm Optimization: Asynchronous 140 Average Tasks per Second 120 100 80 60 40 Asynchronous 20 Synchronous 0 0 5 10 15 20 Standard deviation of subiterations

  20. Mrs. Iterative MapReduce Performance and Case Studies Particle Swarm Optimization: Asynchronous 80 Average Tasks per Second 60 40 20 Synchronous Asynchronous 0 16 64 128 256 512 768 Number of Processors

  21. Mrs. Iterative MapReduce Performance and Case Studies Expectation Maximization Feature Set Size 80 252 8000 25298 Reduce-map tasks 0.411 0.357 0.277 0.193 Rare checks 0.362 0.314 0.253 0.18 Redundant storage 0.013 0.013 0.013 0.012 Parallel efficiency per iteration of EM for various feature set sizes.

  22. Mrs. Iterative MapReduce Performance and Case Studies Conclusion By taking the following approaches, we have considerably improved performance for iterative parallel algorithms in Mrs: Infrequent Checkpointing Reduce-Map Task Generator-Callback Model Asynchronous Model

  23. Where to find Mrs Mrs Homepage with links to source, documentation, mailing list, etc: https://github.com/byu-aml-lab/mrs-mapreduce In case you forget the url, just google “mrs mapreduce” :)

Recommend


More recommend