CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 - PowerPoint PPT Presentation

Aug 02, 2023 •176 likes •388 views

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1 - Projects - Piazza MOTIVATION Storing large amounts of semi-structured data - Traditionally done using database systems Varied processing needs - low

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018
ADMINISTRIVIA - Assignment 1 - Projects - Piazza
MOTIVATION Storing large amounts of semi-structured data - Traditionally done using database systems Varied processing needs - low latency to bulk processing - data size - schema
BIGTABLE: HIGHLIGHTS 1. Scalability: Petabytes of data, thousands of machines 2. Wide applicability: Handles > 60 applications 3. Fault tolerant: High availability 4. High Performance
OUTLINE - Data Model and API - Architecture - Master, Tabletserver functionality - Optimizations
DATA MODEL Versions Rows Column Families “Timestamps”
WRITE API Single row at a time! Set a number of columns or delete some Apply is atomic Support for read-modify-write transactions
SCAN API Fetch any number of columns, column families Filter rows by regex Iterator pattern, rows arriving in sorted order
TaBLETS
SYSTEM ARCHITECHTURE BigTable Master: metadata ops, rebalancing BigTable TabletServer BigTable TabletServer BigTable TabletServer Serve data from tablets GFS: Store tablets, Chubby: Leader election, replicate store metadata
CHUBBY: A LOCK SERVICE Leader election: Classic problem in distributed systems Approach: Build a separate service to handle leader election Properties: - Uses Paxos algorithm - Low write throughput - Store small amounts of data
TABLET LOCATION - Hierarchical metadata - Root of metadata in Chubby - Client library caches tablet locations
MASTER FUNCTIONALITIES Tablet assignment - Master tracks tablet à tablet server mapping - METADATA has the complete list of tablets - Each tabletserver has list of tablets that are being served - Uses heartbeat + Chubby to detect tablet server failures - On master failure, scan METADATA and list tablet servers
WORKER FUNCTIONALITY Tablets stored in GFS Writes - Commit log - Insert memtable Read - Merge SST able and memtable
WORKER FUNCTIONALITY Challenge: Memtable keeps growing over time Minor Compaction - Freeze memtable, write it as SSTable to disk - But now need to merge more SSTables Major Compaction - Read memtable + all SSTables for this tablet - Write out new SSTable. Handles garbage collection
NOTABLE OPTIMIZATIONS Caching - Scan Cache: key-value pairs returned by the SSTable - Block Cache: SSTables blocks that were read from GFS. Bloom filter - Probabilistic data structure: Definitely not or maybe in it - Use this to eliminate SSTables that need to be read
OTHER OPTIMIZATIONS - Single commit log per tabletserver - Sort commit log entries during recovery - Tablet Splitting - Tablet server records changes in METADATA table - Child tablets share SSTables with parent
LADIS (2009)
BIGTABLE: DISCUSSION Generality vs. Specificity Simplicity, Layering Scalability User overheads
QUESTIONS / DISCUSSION ?

Recommend

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data Analytics Analysis Big Data Big Value Real world Question Data Model Conclusion Machine Learning Use real data to train a model, which can

625 views • 27 slides

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Supervisor Trustees Joseph D. Baltz Bryan W. Kopman Larry Ryan John Theo Theobald Clerk Kristin Cross Brett Wheeler Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620 www.troytownship.com March

599 views • 57 slides

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data algorithms Clinical Big Data Our new algorithms Small data vs. Big data Small data vs. Big data VS Small data vs. Big

922 views • 57 slides

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA What is Big Data? Sangmi Lee Pallickara

569 views • 7 slides

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by Prof. Dan Ariely, Duke University 2 What is big data? No standard definition! Wikipedia: Big data is a field that treats ways to

1.47k views • 53 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 With slides from Mosharaf Chowdhury

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 With slides from Mosharaf Chowdhury and Ion Stoica Datacenter ARCHITECTURE - Hardware Trends - Software Implications - Network Design Why is One Machine Not Enough? Too much data ? Too

1.24k views • 52 slides

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and cloud systems slow down and stop? Big data & cloud systems 3 Big data & cloud systems DB-backed web applications Cloud services

802 views • 68 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment - Midterm clarification - How to make a killer presentation Midterm, Project Midterm Exam - Written exam based on main papers - Held on Nov 5,

573 views • 20 slides

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian FLAT DATACENTER STORAGE - Motivation - Design - Discussions/Questions FLAT DATACENTER STORAGE - Motivation - Design - Discussions/Questions

383 views • 37 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Who am I ? New faculty in Computer

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Who am I ? New faculty in Computer Science! PhD Thesis at UC Berkeley: System Design for Large Scale Machine Learning Industry: Google, Microsoft Research Open source: Apache Spark

602 views • 44 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2019 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2019 Who am I ? Assistant Professor in Computer Science PhD Thesis at UC Berkeley: System Design for Large Scale Machine Learning Industry: Google, Microsoft Research Open source: Apache

594 views • 41 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1: Due Oct

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1: Due Oct 1 - Sign up for Project meetings - Group updates MapReduce GFS BigTable BORG: WORKLOAD Long-running services (should never go down) Batch

822 views • 17 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Administrivia Course Project

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Administrivia Course Project round 3 meetings signup! Final class on Dec 6 th No class on Dec 11 th Poster session Dec 13 th More details very soon! RDMA: REMOTE

364 views • 17 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2020 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2020 Who am I ? Assistant Professor in Computer Science PhD at UC Berkeley: System Design for Large Scale Machine Learning Industry: Google, Microsoft Research Open source: Apache Spark

1.29k views • 41 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Midterm grades up

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Midterm grades up today - Pick up papers office hours today or Tuesday class - Course Projects: round 2 meetings Graph Mining WHATS DIFFERENT ? Graph Analytics Graph

549 views • 17 slides

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm grades this week - Course Projects: round 2 meetings next Friday - Next Tuesday: Guest speaker for first part WHAT WE KNOW SO FAR CONTINUOUS

378 views • 20 slides

Capital Structure II Corporate Finance and Incentives Lars Jul Overby Department of Economics

Capital Structure II Corporate Finance and Incentives Lars Jul Overby Department of Economics University of Copenhagen December 2010 Lars Jul Overby (D of Economics - UoC) Capital Structure II 12/10 1 / 25 Capital structure The firms

441 views • 25 slides

Foundations of Energy Harvesting and Energy Cooperating Wireless Communications Aylin Yener Penn

WiOpt 2017 GREENNET Keynote May 19, 2017 Foundations of Energy Harvesting and Energy Cooperating Wireless Communications Aylin Yener Penn State (on leave at Stanford) yener@{engr.psu, stanford}.edu Introduction Wireless Communications

1.03k views • 84 slides

Governing Equations 4. The governing equations are mathematical statements of the physical

1. We start our development of a numerical method for simulations of multifluid and multiphase flows by a DNS of Multiphase Flows Direct Numerical short discussion of the governing equations. Simulations of Multiphase Flows-2

274 views • 9 slides

Follow the WhiteRabbit: Towards Consolidation of On-the-Fly Virtualization and Virtual Machine

Follow the WhiteRabbit: Towards Consolidation of On-the-Fly Virtualization and Virtual Machine Introspection IFIP SEC 2018 Sergej Proskurin, 1 Julian Kirsch, 1 and Apostolis Zarras 2 1 Technical University of Munich 2 Maastricht University

548 views • 20 slides

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at

796 views • 53 slides

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 7: Mutable State (1/2)

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 7: Mutable State (1/2) November 12, 2019 Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons

746 views • 59 slides

PNUTS: Yahoo!s Hosted Data Serving Platform Reading Review by: Alex Degtiar (adegtiar) 15-799

PNUTS: Yahoo!s Hosted Data Serving Platform Reading Review by: Alex Degtiar (adegtiar) 15-799 9/30/2013 What is PNUTS? Yahoos NoSQL database Motivated by web applications Massively parallel Geographically distributed

277 views • 26 slides

The WLRK Proposal for 13(d) Reform: Market Protection or Corporate Entrenchment? Lucian Bebchuk

The WLRK Proposal for 13(d) Reform: Market Protection or Corporate Entrenchment? Lucian Bebchuk The Conference Board, November 13, 2012 Debate with Martin Lipton Talk builds on: Bebchuk and Jackson (2012), The Law and Economics of

613 views • 13 slides