CS 245: Principles of Data-Intensive Systems Instructor: Matei Zaharia cs245.stanford.edu
Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 2
My Background PhD in 2013 Open source distributed data processing framework Cofounder of analytics company Research in systems for ML CS 245 3
Why Study Data-Intensive Systems? Most important computer applications must manage, update and query datasets » Bank, store, fleet controller, search app, … Data quality, quantity & timeliness becoming even more important with AI » Machine learning = algorithms that generalize from data CS 245 4
What Are Data-Intensive Systems? Relational databases: most popular type of data-intensive system (MySQL, Oracle, etc) Many systems facing similar concerns: message queues, key-value stores, streaming systems, ML frameworks, your custom app? Goal: learn the main issues and principles that span all data-intensive systems CS 245 5
Typical System Challenges Reliability in the face of hardware crashes, bugs, bad user input, etc Concurrency: access by multiple users Performance: throughput, latency, etc Access interface from many, changing apps Security and data privacy CS 245 6
Practical Benefits of Studying These Systems Learn how to select & tune data systems Learn how to build them Learn how to build apps that have to tackle some of these same challenges » E.g. cross-geographic-region billing app, custom search engine, etc CS 245 7
Scientific Interest Interesting algorithmic and design ideas In many ways, data systems are the highest- level successful programming abstractions CS 245 8
Programming: The Dream High-level spec *+. + - (… ) ∀" # $∈& ' ∪) ' Working application 9 CS 245
Programming: The Dream High-level spec *+. + - (… ) ∀" # $∈& ' ∪) ' Working application 10 CS 245
Programming: The Reality 11 CS 245
Programming with Databases High-level spec Relational algebra Actually manages: • Durability • Concurrency • Query optimization • Security CS 245 • … 12
Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 13
Teaching Assistants Ben Braun Edward Gan Leo Mehr Deepak Pratiksha James Thomas Narayanan Thaker CS 245 14
Course Format Lectures in class Assigned paper readings (Q&A in class) 3 programming assignments Midterm and final This is the 1 st run of my version of the course, so we’re still figuring some things out CS 245 16
Paper Readings A few classic or recent research papers Read the paper before the class: we want to discuss it together! We’ll post discussion questions on the class website a week before lecture CS 245 17
How Should You Read a Paper? Read: “How to Read a Paper” TLDR: don’t just go through end to end; focus on key ideas/sections CS 245 18
Our First Paper We’ll be reading part of “A History and Evaluation of System R” for next class! Find instructions and questions on website CS 245 19
Programming Assignments Three assignments implemented in Java or Scala, and submitted online 1. Storage and access methods 2. Query optimization 3. Transactions and recovery Done individually; A1 posted next week CS 245 20
Midterm and Final Written tests based on material covered in lectures, assignments and readings Final will cover the entire course but focus on the second half CS 245 21
Grading 45% Assignments (15% each) 25% Midterm 30% Final CS 245 22
Keeping in Touch Sign up for Piazza on the course website to receive announcements! cs245.stanford.edu CS 245 23
Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 24
Recall: Examples of Data-Intensive Systems Relational databases: most popular type of data-intensive system (MySQL, Oracle, etc) Many systems facing similar concerns: message queues, key-value stores, streaming systems, ML frameworks, your custom app? CS 245 25
Basic Components Logical dataset Clients / users (e.g. table, graph) Data Queries mgmt. system Physical storage (data structures) Administrator CS 245 26
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … 27 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow 28 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors 29 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, sparse arrays, … 30 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG sparse arrays, … construction 31 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW 32 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Kafka 33 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling 34 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling Apache Spark RDDs 35 CS 245
Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling Apache Collections of Read external Functional Distribution, Spark RDDs Java objects systems, cache API, SQL query planning, transactions* 36 CS 245
Some Typical Concerns Access interface from many, changing apps Performance: throughput, latency, etc Reliability in the face of hardware crashes, bugs, bad user input, etc Concurrency: access by multiple users Security and data privacy CS 245 37
Example Message queue system Producers Consumers What should happen if two consumers read() at the same time? CS 245 38
Example Message queue system Producers Consumers What should happen if a consumer reads a message but then immediately crashes? CS 245 39
Example Message queue system Producers Consumers Can a producer put in 2 messages atomically? CS 245 40
Two Big Ideas Declarative interfaces » Apps specify what they want, not how to do it » Example: “store a table with 2 integer columns”, but not how to encode it on disk » Example: “count records where column1 = 5” Transactions » Encapsulate multiple app actions into one atomic request (fails or succeeds as a whole) » Concurrency models for multiple users » Clear interactions with failure recovery CS 245 41
Recommend
More recommend