cs 245 principles of data intensive systems
play

CS 245: Principles of Data-Intensive Systems Instructor: Matei - PowerPoint PPT Presentation

CS 245: Principles of Data-Intensive Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 2 My Background PhD in 2013 Open source


  1. CS 245: Principles of Data-Intensive Systems Instructor: Matei Zaharia cs245.stanford.edu

  2. Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 2

  3. My Background PhD in 2013 Open source distributed data processing framework Cofounder of analytics company Research in systems for ML CS 245 3

  4. Why Study Data-Intensive Systems? Most important computer applications must manage, update and query datasets » Bank, store, fleet controller, search app, … Data quality, quantity & timeliness becoming even more important with AI » Machine learning = algorithms that generalize from data CS 245 4

  5. What Are Data-Intensive Systems? Relational databases: most popular type of data-intensive system (MySQL, Oracle, etc) Many systems facing similar concerns: message queues, key-value stores, streaming systems, ML frameworks, your custom app? Goal: learn the main issues and principles that span all data-intensive systems CS 245 5

  6. Typical System Challenges Reliability in the face of hardware crashes, bugs, bad user input, etc Concurrency: access by multiple users Performance: throughput, latency, etc Access interface from many, changing apps Security and data privacy CS 245 6

  7. Practical Benefits of Studying These Systems Learn how to select & tune data systems Learn how to build them Learn how to build apps that have to tackle some of these same challenges » E.g. cross-geographic-region billing app, custom search engine, etc CS 245 7

  8. Scientific Interest Interesting algorithmic and design ideas In many ways, data systems are the highest- level successful programming abstractions CS 245 8

  9. Programming: The Dream High-level spec *+. + - (… ) ∀" # $∈& ' ∪) ' Working application 9 CS 245

  10. Programming: The Dream High-level spec *+. + - (… ) ∀" # $∈& ' ∪) ' Working application 10 CS 245

  11. Programming: The Reality 11 CS 245

  12. Programming with Databases High-level spec Relational algebra Actually manages: • Durability • Concurrency • Query optimization • Security CS 245 • … 12

  13. Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 13

  14. Teaching Assistants Ben Braun Edward Gan Leo Mehr Deepak Pratiksha James Thomas Narayanan Thaker CS 245 14

  15. Course Format Lectures in class Assigned paper readings (Q&A in class) 3 programming assignments Midterm and final This is the 1 st run of my version of the course, so we’re still figuring some things out CS 245 16

  16. Paper Readings A few classic or recent research papers Read the paper before the class: we want to discuss it together! We’ll post discussion questions on the class website a week before lecture CS 245 17

  17. How Should You Read a Paper? Read: “How to Read a Paper” TLDR: don’t just go through end to end; focus on key ideas/sections CS 245 18

  18. Our First Paper We’ll be reading part of “A History and Evaluation of System R” for next class! Find instructions and questions on website CS 245 19

  19. Programming Assignments Three assignments implemented in Java or Scala, and submitted online 1. Storage and access methods 2. Query optimization 3. Transactions and recovery Done individually; A1 posted next week CS 245 20

  20. Midterm and Final Written tests based on material covered in lectures, assignments and readings Final will cover the entire course but focus on the second half CS 245 21

  21. Grading 45% Assignments (15% each) 25% Midterm 30% Final CS 245 22

  22. Keeping in Touch Sign up for Piazza on the course website to receive announcements! cs245.stanford.edu CS 245 23

  23. Outline Why study data-intensive systems? Course logistics Key issues and themes A bit of history CS 245 24

  24. Recall: Examples of Data-Intensive Systems Relational databases: most popular type of data-intensive system (MySQL, Oracle, etc) Many systems facing similar concerns: message queues, key-value stores, streaming systems, ML frameworks, your custom app? CS 245 25

  25. Basic Components Logical dataset Clients / users (e.g. table, graph) Data Queries mgmt. system Physical storage (data structures) Administrator CS 245 26

  26. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … 27 CS 245

  27. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow 28 CS 245

  28. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors 29 CS 245

  29. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, sparse arrays, … 30 CS 245

  30. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG sparse arrays, … construction 31 CS 245

  31. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW 32 CS 245

  32. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Kafka 33 CS 245

  33. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling 34 CS 245

  34. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling Apache Spark RDDs 35 CS 245

  35. Examples Logical Physical System API Other Features Data Model Storage Relational Relations B-trees, column SQL, ODBC Durability, databases (i.e. tables) stores, indexes, transactions, … query planning, migrations, … TensorFlow Tensors NCHW, NHWC, Python DAG query planning, sparse arrays, … construction distribution, specialized HW Apache Streams of Partitions, Publish, Durability, Kafka opaque records compaction subscribe rescaling Apache Collections of Read external Functional Distribution, Spark RDDs Java objects systems, cache API, SQL query planning, transactions* 36 CS 245

  36. Some Typical Concerns Access interface from many, changing apps Performance: throughput, latency, etc Reliability in the face of hardware crashes, bugs, bad user input, etc Concurrency: access by multiple users Security and data privacy CS 245 37

  37. Example Message queue system Producers Consumers What should happen if two consumers read() at the same time? CS 245 38

  38. Example Message queue system Producers Consumers What should happen if a consumer reads a message but then immediately crashes? CS 245 39

  39. Example Message queue system Producers Consumers Can a producer put in 2 messages atomically? CS 245 40

  40. Two Big Ideas Declarative interfaces » Apps specify what they want, not how to do it » Example: “store a table with 2 integer columns”, but not how to encode it on disk » Example: “count records where column1 = 5” Transactions » Encapsulate multiple app actions into one atomic request (fails or succeeds as a whole) » Concurrency models for multiple users » Clear interactions with failure recovery CS 245 41

Recommend


More recommend