systems for data science
play

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 - PowerPoint PPT Presentation

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 Course Structure Fundamentals you need to know about systems Caching, Virtual memory, concurrency, etc Review of several Big-data systems Learn how they


  1. Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1

  2. Course Structure • Fundamentals you need to know about systems • Caching, Virtual memory, concurrency, etc… • Review of several “Big-data” systems • Learn how they work • Principles of systems design: Why systems are designed that way • Hands-on experience • No electronic devices during classes (not even in airplane mode) 2

  3. Course Assignments • Reading research papers • 2-3 projects • Coding assignments • Midterm + final exam http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/ 3

  4. Course Grades • Midterm exam: 20% • Final exam: 30% • Projects: 50% 4

  5. Questions • Teaching Assistant • Nathan Ng <kwanhong@umass.edu> • Office hours: Tuesday 4.30-5.30 PM @ CS 207 • Piazza website • https://piazza.com/umass/fall2019/compsci532/home • Ask questions there rather than emailing me or Nathan • Credits if you are active • Well-thought questions and answers: be curious (but don’t just show off) • I will never penalize you for saying or asking something wrong 5

  6. Projects • Groups of two people • See course website for details • High-level discussions with other colleagues: ok • “What are the requirements of the project?” • Low-level discussions with other colleagues: ok • “How do threads work in Java?” • Mid-level discussions: not ok • “ How to design a solution for the project?” • Project delivery includes short oral exam 6

  7. What are “systems for data science”?

  8. Systems + Data Science • Data science research • New algorithms • New applications of existing algorithms • Validation: take small representative dataset, show accuracy • Systems research • Run these algorithms efficiently • Scale them to larger datasets • End-to-end pipelines • Applications of ML to system design and software engineering (seminar next Spring!) • Validation: build a prototype that others can use • These are ends of a spectrum

  9. Overview • What type of systems will we target? • Storage systems • Data processing systems • Cloud analytics • System for machine learning • Goal: Hide complexity of underlying hardware • Parallelism: multi-core, distributed systems • Fault-tolerance: hardware fails • Focus on scalable systems • Scale to large datasets • Scale to computationally complex problems

  10. Transactional vs. Analytical Systems • Transactional data management system • Real-time response • Concurrency • Updates • Analytical data management system • Non real-time responses • No concurrency • Read-only • These are ends of a spectrum

  11. Example: Search Engine • Crawlers: download the Web • Hadoop file system (HDFS): store the Web • MapReduce: run massively parallel indexing • Key-value store: store index • Front-end • Serve client requests • Ranking à this is actual the data science • Q: Scalability issues? • Q: Which component is transactional / analytical • Q: Where are storage/data processing/cloud/ML involved?

  12. Design goals

  13. Ease of Use • Good APIs / abstractions are key in a system • High-level API • Easier to use, better productivity, safer code • It makes some implementation choices for you • These choices are based on assumptions on the use cases • Are these choices really what you need? • Low-level API • Harder to use, lower productivity, unsafer code • More flexible 13

  14. Scalability • Ideal world • Linear scalability Speedup Ideal • Reality • Bottlenecks • For example: central coordinator Reality • When do we stop scaling? Parallelism 14

  15. Latency vs. Throughput • Pipe metaphor Latency • System is a pipe Max throughput • Requests are small marbles 100x req • Low load • Minimal latency • Increased load (2x w) 10x req 1x requests 50x req • Higher throughput • Latency stable • High load Throughput • Saturation: no more throughput • Latency skyrockets 15

  16. Fault Tolerance • Assume that your system crashes every month • If you run Python scripts on your laptop, that’s fine • But imagine you run a cluster • 10 nodes = a crash every 3 days • 100 nodes = a crash every seven hours • 1000 nodes = a crash every 50 minutes • Some computations run for more than one hour • Cannot simply restart when something goes wrong • Even when restarting, we need to keep metadata safe 16

  17. Why do we need parallelism? 17

  18. Maximum Clock Rate is Stagnating Two major “laws” are collapsing • Moore’s law • Dennard scaling Source: https://queue.acm.org/detail.cfm?id=2181798

  19. Moore’s Law • “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: Exponential axis until 2021 unless new technologies arise) [1] [1] https://www.technologyreview.com/s/601441/moores-law-is- dead-now-what/

  20. Dennard Scaling • “Reducing transistor size does not increase power density à power consumption proportional to chip area” • Stopped holding around 2006 • Assumptions break when physical system close to limit • Post-Dennard-scaling world of today • Huge cooling and power consumption issues • If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor

  21. Heat Dissipation Problem • Large datacenters consume energy like large cities • Cooling is the main cost factor Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

  22. Where is Luleå?

  23. Single-Core Solutions • Dynamic Voltage and Frequency Scaling (DVFS) • E.g. Intel’s TurboBoost • Only works under low load • Use part of the chip for coprocessors (e.g. graphics) • Lower power consumption • Limited number of generic functionalities to offload

  24. Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core … core core core core core core Socket Socket Socket (to motherboard) Main Memory

  25. Multi-Core processors • Idea: scale computational power linearly • Instead of a single 5 GHz core, 2 * 2.5 GHz cores • Scale heat dissipation linearly • k cores have ~ k times the heat dissipation of a single core • Increasing frequency of a single core by k times creates superlinear heat dissipation increase

  26. How to Leverage Multicores • Run multiple tasks in parallel • Multiprocessing • Multithreading • E.g. PCs have many parallel background apps • OS, music, antivirus, web browser, … • How to parallelize one app is not trivial • Embarrassingly parallel tasks • Can be run by multiple threads • No coordination

  27. Memory Bandwidth Bottleneck • Cores compete for the same main memory bus • Solution: caching help in two ways • They reduce latency (as we have discussed) • They also increase throughput by avoiding bus contention

  28. SIMD Processors • Single Instruction Multiple Data (SIMD) processors • Example • Graphical Processing Units (GPUs) • Intel Phi coprocessors • Q: Possible SIMD snippets for i in [0,n-1] do for i in [0,n-1] do v[i] = v[i] * pi if v[i] < 0.01 then v[i] = 0

  29. Other Approaches • SIMD • Single Instruction Multiple Data • A massive number of simpler cores • FPGAs • Dedicated hardware designed for a specific task

  30. Automatic Parallelization? • Holy grail in the multi-processor era • Approaches • Programming languages • Systems with APIs that help express parallelism • Efficient coordination mechanisms

  31. Homework The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA sergey@cs.stanford.edu and page@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of

Recommend


More recommend