examples of big data sources
play

Examples of Big Data Sources Wal-Mart 267 million items/day, sold - PowerPoint PPT Presentation

D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine


  1. D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

  2. Examples of Big Data Sources Wal-Mart  267 million items/day, sold at 6,000 stores  HP built them 4 PB data warehouse  Mine data to manage supply chain, understand market trends, formulate pricing strategies LSST  Chilean telescope will scan entire sky every 3 days  A 3.2 gigapixel digital camera  Generate 30 TB/day of image data – 2 –

  3. Why So Much Data? We Can Get It  Automation + Internet We Can Keep It  Seagate Barracuda  1.5 TB @ $150 (10¢ / GB) We Can Use It  Scientific breakthroughs  Business process efficiencies  Realistic special effects  Better health care Could We Do More?  Apply more computing power to this data – 3 –

  4. Google Data Center Dalles, Oregon  Hydroelectric power @ 2¢ / KW Hr  50 Megawatts  Enough to power 6,000 homes – 4 –

  5. Varieties of Cloud Computing “I don’t want to be a system “I’ve got terabytes of data. Tell me what they mean.” administrator. You handle my data & applications.”  Very large, shared data repository  Hosted services  Complex analysis  Documents, web-based email, etc.  Data-intensive scalable computing (DISC)  Can access from anywhere  Easy sharing and collaboration – 5 –

  6. Oceans of Data, Skinny Pipes 1 Terabyte  Easy to store  Hard to move Disks MB / s Time Seagate Barracuda 115 2.3 hours Seagate Cheetah 125 2.2 hours Networks MB / s Time Home Internet < 0.625 > 18.5 days Gigabit Ethernet < 125 > 2.2 hours PSC Teragrid < 3,750 > 4.4 minutes – 6 – Connection

  7. Data-Intensive System Challenge For Computation That Accesses 1 TB in 5 minutes  Data distributed over 100+ disks  Assuming uniform data partitioning  Compute using 100+ processors  Connected by gigabit Ethernet (or equivalent) System Requirements  Lots of disks  Lots of processors  Located in close proximity  Within reach of fast, local-area network – 7 –

  8. Desiderata for DISC Systems Focus on Data  Terabytes, not tera-FLOPS Problem-Centric Programming  Platform-independent expression of data parallelism Interactive Access  From simple queries to massive computations Robust Fault Tolerance  Component failures are handled as routine events Contrast to existing supercomputer / HPC systems – 8 –

  9. System Comparison: Programming Models DISC Conventional Supercomputers Application Application Programs Programs Machine-Independent Software Programming Model Packages Runtime System Machine-Dependent Programming Model Hardware Hardware  Programs described at very  Application programs low level written in terms of high-level  Specify detailed control of operations on data processing & communications  Runtime system controls  Rely on small number of scheduling, load balancing, … software packages  Written by specialists  Limits classes of problems & – 9 – solution methods

  10. System Comparison: Reliability Runtime errors commonplace in large-scale systems  Hardware failures  Transient errors  Software bugs DISC Conventional Supercomputers “Brittle” Systems Flexible Error Detection and Recovery  Main recovery mechanism is to recompute from most  Runtime system detects and recent checkpoint diagnoses errors  Must bring down system for  Selective use of redundancy diagnosis, repair, or and dynamic recomputation upgrades  Replace or upgrade components while system running  Requires flexible programming model & – 10 – runtime environment

  11. Exploring Parallel Computation Models MapReduce MPI SETI@home Threads PRAM Low Communication High Communication Coarse-Grained Fine-Grained DISC + MapReduce Provides Coarse-Grained Parallelism  Computation done by independent processes  File-based communication Observations  Relatively “natural” programming model  Research issue to explore full potential and limits  Dryad project at MSR  Pig project at Yahoo! – 11 –

  12. Message Passing Existing HPC Machines P 1 P 2 P 3 P 4 P 5 Characteristics  Long-lived processes  Make use of spatial locality  Hold all program data in memory  High bandwidth communication Shared Memory Memory P 1 P 2 P 3 P 4 P 5 Strengths  High utilization of resources  Effective for many scientific applications Weaknesses  Very brittle: relies on everything working correctly and in close synchrony – 12 –

  13. HPC Fault Tolerance P 1 P 2 P 3 P 4 P 5 Checkpoint Checkpoint  Periodically store state of all processes Wasted Computation  Significant I/O traffic Restore Restore  When failure occurs  Reset state to that of last Checkpoint checkpoint  All intervening computation wasted Performance Scaling  Very sensitive to number of failing components – 13 –

  14. Map/Reduce Operation Characteristics Map/Reduce  Computation broken into many, short-lived tasks Map  Mapping, reducing Reduce  Use disk storage to hold Map Reduce intermediate results Map Strengths Reduce  Great flexibility in placement, Map scheduling, and load Reduce balancing  Handle failures by recomputation  Can access large data sets Weaknesses  Higher overhead – 14 –  Lower raw performance

  15. Generalizing Map/Reduce  E.g., Microsoft Dryad Project Computational Model    Op k Op k Op k Op k  Acyclic graph of operators  But expressed as textual program  Each takes collection of objects and produces objects     Purely functional model Implementation Concepts    Op 2 Op 2 Op 2 Op 2  Objects stored in files or memory  Any object may be lost; any operator may fail    Op 1 Op 1 Op 1 Op 1  Replicate & recompute for fault tolerance  Dynamic scheduling x 1 x 2 x 3 x n  # Operators >> # Processors – 15 –

  16. Concluding Thoughts Data-Intensive Computing Becoming Commonplace  Facilities available from Google/IBM, Yahoo!, …  Hadoop becoming platform of choice  Lots of applications are fairly straightforward  Use Map to do embarrassingly parallel execution  Make use of load balancing and reliable file system of Hadoop What Remains  Integrating more demanding forms of computation  Computations over large graphs  Sparse numerical applications  Challenges: programming, implementation efficiency – 16 –

Recommend


More recommend