roomy a new approach to parallel disk based computation
play

Roomy: A New Approach to Parallel Disk-based Computation Dan Kunkle - PowerPoint PPT Presentation

Roomy: A New Approach to Parallel Disk-based Computation Dan Kunkle Thesis Proposal College of Computer and Information Science Northeastern University Committee: Gene Cooperman (Advisor), Panagiotis Manolios, Mirek Riedewald, Fan Yang


  1. Roomy: A New Approach to Parallel Disk-based Computation Dan Kunkle Thesis Proposal College of Computer and Information Science Northeastern University Committee: Gene Cooperman (Advisor), Panagiotis Manolios, Mirek Riedewald, Fan Yang (Google) November 9, 2009 Dan Kunkle Roomy, disk-based computation November 9, 2009 1 / 25

  2. Outline Overview of Parallel Disk-based Computation 1 Roomy: Programming Model, Goals, and Design 2 Related Work 3 Research Goals and Applications 4 Example Application: Pancake Sorting Problem 5 Dan Kunkle Roomy, disk-based computation November 9, 2009 2 / 25

  3. Outline Overview of Parallel Disk-based Computation 1 Roomy: Programming Model, Goals, and Design 2 Related Work 3 Research Goals and Applications 4 Example Application: Pancake Sorting Problem 5 Dan Kunkle Roomy, disk-based computation November 9, 2009 3 / 25

  4. Problem Statement Goal: to solve space limited problems without significantly increasing hardware costs or radically altering existing algorithms and data structures. A space limited problem is one where existing solutions quickly exceed available memory. This could be solved by significantly increasing available RAM, but that is expensive. New algorithmic techniques that reduce space usage may help in certain cases (e.g., Bloom filters), but not always. Our approach is to use parallel disk-based computation . Dan Kunkle Roomy, disk-based computation November 9, 2009 4 / 25

  5. Definition: Parallel Disk-based Computation Parallel disk-based computation: using disks as the main working memory of a computation, instead of RAM. This provides several orders of magnitude more space for the same price. Performance Issues and Solutions Bandwidth: the bandwidth of a disk is roughly 50 times less than RAM (100 MB/s versus 5 GB/s). Solution: use many disks in parallel. Latency: even worse, the latency of disk is many orders of magnitude worse than RAM. Solution: avoid latency penalties by using streaming access. Dan Kunkle Roomy, disk-based computation November 9, 2009 5 / 25

  6. Implications of Disk-based Computation By replacing RAM with disks A cluster of 50 computers, each with 8 cores and 1 TB of disk space, can substitute for a shared memory computer with 400 cores and a single 50 TB memory subsystem. Algorithm and Software Engineering Issues Unfortunately, writing programs that use many disks in parallel and avoid using random access is often a difficult task. Our group has five years of case histories applying this to computational group theory – but each case requires months of development and debugging. Rubik’s Cube in 26 moves, 2007, 8 TB of aggregate storage. (CACM Viewpoint, April 2008). Dan Kunkle Roomy, disk-based computation November 9, 2009 6 / 25

  7. Outline Overview of Parallel Disk-based Computation 1 Roomy: Programming Model, Goals, and Design 2 Related Work 3 Research Goals and Applications 4 Example Application: Pancake Sorting Problem 5 Dan Kunkle Roomy, disk-based computation November 9, 2009 7 / 25

  8. Roomy Roomy is: A new programming model that extends a programming language with transparent disk-based computing support. An open source library for C/C++ implementing this new programming language extension. The primary goals of Roomy are: Minimally invasive : common data structures in user sequential code are replaced by Roomy data structures (lists, arrays, and hash tables). Performance: the interface biases programmers toward approaches with high performance parallel disk-based implementations. Choice of architectures: can used shared or distributed memory; locally attached disks or storage area networks (SAN). Fault tolerance: can be combined with our group’s distributed checkpointing tool DMTCP. Dan Kunkle Roomy, disk-based computation November 9, 2009 8 / 25

  9. Roomy Programming Model The Roomy programming model: Provides basic data structures (arrays, lists, and hash tables). Transparently distributes data structures across many disks and performs operations on that data in parallel. Immediately processes streaming access operators . Delays processing random access operators until they can be performed efficiently in batch (e.g., collecting and sorting updates to an array). Dan Kunkle Roomy, disk-based computation November 9, 2009 9 / 25

  10. Example: Delayed Processing of Hash Table Insertions �������������������������������� ���������������������������� ������������������� ������������������������� ���������������� ��������������������������������� ���������������� ���������������� ������������������������������� ������������������������������ ����������������������������������� ������������������������ ����������� Dan Kunkle Roomy, disk-based computation November 9, 2009 10 / 25

  11. Design of Roomy Applications A.I search (pancake sorting, Rubik’s Cube) SAT solver Algorithm Library Binary decision diagrams breadth-first search parallel depth-first search Explicit state dynamic programming model checking API RoomyList: RoomyArray: add, remove update, predicates addAll, removeAll delayed read removeDupes map, reduce map, reduce Foundation file management remote I/O external sorting synchronization and barriers Dan Kunkle Roomy, disk-based computation November 9, 2009 11 / 25

  12. Outline Overview of Parallel Disk-based Computation 1 Roomy: Programming Model, Goals, and Design 2 Related Work 3 Research Goals and Applications 4 Example Application: Pancake Sorting Problem 5 Dan Kunkle Roomy, disk-based computation November 9, 2009 12 / 25

  13. Types of Disk-based Computing Systems Current approaches to disk-based computing can be classified into a few broad categories: Large scale data processing: primarily motivated by a need to process very large data sets, such as in web search. Focus on scalability and fault tolerance. → MapReduce (Google), Hadoop (open source MapReduce), Dryad (Microsoft Research) Libraries of theoretically optimal algorithms: motivated by the development of external memory complexity models and algorithms. → TPIE , STXXL Roomy Dan Kunkle Roomy, disk-based computation November 9, 2009 13 / 25

  14. Delayed Random Operations Three ways to handle random access operations : Eliminate random access operations (e.g., MapReduce) → limits the range of algorithms that can be used Process random access operations immediately (e.g., STXXL) → may suffer large latency penalties Delay processing until they can be performed efficiently The delayed processing of random access operations is one of the features that differentiates Roomy from other approaches to disk-based computation. Dan Kunkle Roomy, disk-based computation November 9, 2009 14 / 25

  15. Outline Overview of Parallel Disk-based Computation 1 Roomy: Programming Model, Goals, and Design 2 Related Work 3 Research Goals and Applications 4 Example Application: Pancake Sorting Problem 5 Dan Kunkle Roomy, disk-based computation November 9, 2009 15 / 25

  16. Research Goals Using Roomy as a development platform, the two central research questions we seek to answer are: What is the class of applications for which parallel disk-based computing is practical? How can existing sequential algorithms and software be adapted to take advantage of parallel disk-based computing? We will answer these questions by using Roomy to extend existing algorithms and software . Dan Kunkle Roomy, disk-based computation November 9, 2009 16 / 25

  17. Applications of Parallel Disk-based Computation Previous disk-based computing projects: 26 moves suffice for Rubik’s cube. Search and enumeration problems from computational group theory. General breadth-first search (e.g., pancake sorting problem). Target applications of Roomy from formal verification : Bounded model checking using SAT solvers. Binary decision diagrams (BDDs). Explicit state model checking. We will implement one or more of the above applications by integrating existing open source software with the Roomy library. Dan Kunkle Roomy, disk-based computation November 9, 2009 17 / 25

  18. Potential Applications of Roomy Discipline Example Application A.I. Search Rubik’s Cube Group Theory Search and Enumeration in Mathematical Structures Verification SAT Solvers (as used in Bounded Model Checking) Verification Symbolic Computation using BDDs Verification Explicit State Verification Coding Theory Search for New Codes Security Exhaustive Search for Passwords Semantic Web RDF query language; OWL Web Ontology Language Artificial Intelligence Planning Proteomics Protein folding via a kinetic network model Operations Research Branch-and-Bound Operations Research Integer Programming (applic. of Branch-and-Bound) Economics Dynamic Programming Numerical Analysis ATLAS, PHiPAC, FFTW, and other adaptive software Engineering Sensor Data Dan Kunkle Roomy, disk-based computation November 9, 2009 18 / 25

Recommend


More recommend