Examples of Big Data Sources Wal-Mart 267 million items/day, sold - PowerPoint PPT Presentation

D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

Examples of Big Data Sources Wal-Mart  267 million items/day, sold at 6,000 stores  HP built them 4 PB data warehouse  Mine data to manage supply chain, understand market trends, formulate pricing strategies LSST  Chilean telescope will scan entire sky every 3 days  A 3.2 gigapixel digital camera  Generate 30 TB/day of image data – 2 –

Why So Much Data? We Can Get It  Automation + Internet We Can Keep It  Seagate Barracuda  1.5 TB @ $150 (10¢ / GB) We Can Use It  Scientific breakthroughs  Business process efficiencies  Realistic special effects  Better health care Could We Do More?  Apply more computing power to this data – 3 –

Google Data Center Dalles, Oregon  Hydroelectric power @ 2¢ / KW Hr  50 Megawatts  Enough to power 6,000 homes – 4 –

Varieties of Cloud Computing “I don’t want to be a system “I’ve got terabytes of data. Tell me what they mean.” administrator. You handle my data & applications.”  Very large, shared data repository  Hosted services  Complex analysis  Documents, web-based email, etc.  Data-intensive scalable computing (DISC)  Can access from anywhere  Easy sharing and collaboration – 5 –

Oceans of Data, Skinny Pipes 1 Terabyte  Easy to store  Hard to move Disks MB / s Time Seagate Barracuda 115 2.3 hours Seagate Cheetah 125 2.2 hours Networks MB / s Time Home Internet < 0.625 > 18.5 days Gigabit Ethernet < 125 > 2.2 hours PSC Teragrid < 3,750 > 4.4 minutes – 6 – Connection

Data-Intensive System Challenge For Computation That Accesses 1 TB in 5 minutes  Data distributed over 100+ disks  Assuming uniform data partitioning  Compute using 100+ processors  Connected by gigabit Ethernet (or equivalent) System Requirements  Lots of disks  Lots of processors  Located in close proximity  Within reach of fast, local-area network – 7 –

Desiderata for DISC Systems Focus on Data  Terabytes, not tera-FLOPS Problem-Centric Programming  Platform-independent expression of data parallelism Interactive Access  From simple queries to massive computations Robust Fault Tolerance  Component failures are handled as routine events Contrast to existing supercomputer / HPC systems – 8 –

System Comparison: Programming Models DISC Conventional Supercomputers Application Application Programs Programs Machine-Independent Software Programming Model Packages Runtime System Machine-Dependent Programming Model Hardware Hardware  Programs described at very  Application programs low level written in terms of high-level  Specify detailed control of operations on data processing & communications  Runtime system controls  Rely on small number of scheduling, load balancing, … software packages  Written by specialists  Limits classes of problems & – 9 – solution methods

System Comparison: Reliability Runtime errors commonplace in large-scale systems  Hardware failures  Transient errors  Software bugs DISC Conventional Supercomputers “Brittle” Systems Flexible Error Detection and Recovery  Main recovery mechanism is to recompute from most  Runtime system detects and recent checkpoint diagnoses errors  Must bring down system for  Selective use of redundancy diagnosis, repair, or and dynamic recomputation upgrades  Replace or upgrade components while system running  Requires flexible programming model & – 10 – runtime environment

Exploring Parallel Computation Models MapReduce MPI SETI@home Threads PRAM Low Communication High Communication Coarse-Grained Fine-Grained DISC + MapReduce Provides Coarse-Grained Parallelism  Computation done by independent processes  File-based communication Observations  Relatively “natural” programming model  Research issue to explore full potential and limits  Dryad project at MSR  Pig project at Yahoo! – 11 –

Message Passing Existing HPC Machines P 1 P 2 P 3 P 4 P 5 Characteristics  Long-lived processes  Make use of spatial locality  Hold all program data in memory  High bandwidth communication Shared Memory Memory P 1 P 2 P 3 P 4 P 5 Strengths  High utilization of resources  Effective for many scientific applications Weaknesses  Very brittle: relies on everything working correctly and in close synchrony – 12 –

HPC Fault Tolerance P 1 P 2 P 3 P 4 P 5 Checkpoint Checkpoint  Periodically store state of all processes Wasted Computation  Significant I/O traffic Restore Restore  When failure occurs  Reset state to that of last Checkpoint checkpoint  All intervening computation wasted Performance Scaling  Very sensitive to number of failing components – 13 –

Map/Reduce Operation Characteristics Map/Reduce  Computation broken into many, short-lived tasks Map  Mapping, reducing Reduce  Use disk storage to hold Map Reduce intermediate results Map Strengths Reduce  Great flexibility in placement, Map scheduling, and load Reduce balancing  Handle failures by recomputation  Can access large data sets Weaknesses  Higher overhead – 14 –  Lower raw performance

Generalizing Map/Reduce  E.g., Microsoft Dryad Project Computational Model    Op k Op k Op k Op k  Acyclic graph of operators  But expressed as textual program  Each takes collection of objects and produces objects     Purely functional model Implementation Concepts    Op 2 Op 2 Op 2 Op 2  Objects stored in files or memory  Any object may be lost; any operator may fail    Op 1 Op 1 Op 1 Op 1  Replicate & recompute for fault tolerance  Dynamic scheduling x 1 x 2 x 3 x n  # Operators >> # Processors – 15 –

Concluding Thoughts Data-Intensive Computing Becoming Commonplace  Facilities available from Google/IBM, Yahoo!, …  Hadoop becoming platform of choice  Lots of applications are fairly straightforward  Use Map to do embarrassingly parallel execution  Make use of load balancing and reliable file system of Hadoop What Remains  Integrating more demanding forms of computation  Computations over large graphs  Sparse numerical applications  Challenges: programming, implementation efficiency – 16 –

Examples of Big Data Sources Wal-Mart 267 million items/day, sold - PowerPoint PPT Presentation

D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Data Sources; SCNL Data Sources Data sources producing waveform data can come from a remote

Sources Sources: Kinds of Sources Citizen witness Confidential informants Anonymous

Sources of Start Sources of Start- -up Capital up Capital up Capital Sources of Start Sources

RC circuits with DC sources A Circuit i (resistors, voltage sources, v C current sources,

Select the best sources by Currency Select the checking best sources by Range Select the

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Sources of Authority Sources of Authority Sources of Authority Lesson No. 3 ENV H 471

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian

CineGrid @ TF-Media Building a New User Community for Very High Quality Media Applications On

IEEE 802.3ae* standard The Architectural components 10 Gigabit Ethernet Technology of the 802.3ae*

NextGen. Computing and Storage at Scale Overview and Implementation within the European HPC

CS 310 OPERATING SYSTEMS https://neilklingensmith.com/teaching/loyola/cs310-s2020/ WHY DO YOU

TENET Network An revolution TENET Network An revolution in progress p g Andrew Alston

HyspIRI Low Latency Concept & Benchmarks Dan Mandl August 24, 2010 HyspIRI Science Workshop

5G in 2018: Fine-tuning our Digital future The European 5G conference 2018 The backbone of

Examples of Big Data Sources Wal-Mart 267 million items/day, sold - PowerPoint PPT Presentation

D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Data Sources; SCNL Data Sources Data sources producing waveform data can come from a remote

Sources Sources: Kinds of Sources Citizen witness Confidential informants Anonymous

Sources of Start Sources of Start- -up Capital up Capital up Capital Sources of Start Sources

RC circuits with DC sources A Circuit i (resistors, voltage sources, v C current sources,

Select the best sources by Currency Select the checking best sources by Range Select the

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Sources of Authority Sources of Authority Sources of Authority Lesson No. 3 ENV H 471

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian

CineGrid @ TF-Media Building a New User Community for Very High Quality Media Applications On

IEEE 802.3ae* standard The Architectural components 10 Gigabit Ethernet Technology of the 802.3ae*

NextGen. Computing and Storage at Scale Overview and Implementation within the European HPC

CS 310 OPERATING SYSTEMS https://neilklingensmith.com/teaching/loyola/cs310-s2020/ WHY DO YOU

TENET Network An revolution TENET Network An revolution in progress p g Andrew Alston

HyspIRI Low Latency Concept &amp; Benchmarks Dan Mandl August 24, 2010 HyspIRI Science Workshop

5G in 2018: Fine-tuning our Digital future The European 5G conference 2018 The backbone of

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

HyspIRI Low Latency Concept & Benchmarks Dan Mandl August 24, 2010 HyspIRI Science Workshop