Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1
Course Structure • Fundamentals you need to know about systems • Caching, Virtual memory, concurrency, etc… • Review of several “Big-data” systems • Learn how they work • Principles of systems design: Why systems are designed that way • Hands-on experience • No electronic devices during classes (not even in airplane mode) 2
Course Assignments • Reading research papers • 2-3 projects • Coding assignments • Midterm + final exam http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/ 3
Course Grades • Midterm exam: 20% • Final exam: 30% • Projects: 50% 4
Questions • Teaching Assistant • Nathan Ng <kwanhong@umass.edu> • Office hours: Tuesday 4.30-5.30 PM @ CS 207 • Piazza website • https://piazza.com/umass/fall2019/compsci532/home • Ask questions there rather than emailing me or Nathan • Credits if you are active • Well-thought questions and answers: be curious (but don’t just show off) • I will never penalize you for saying or asking something wrong 5
Projects • Groups of two people • See course website for details • High-level discussions with other colleagues: ok • “What are the requirements of the project?” • Low-level discussions with other colleagues: ok • “How do threads work in Java?” • Mid-level discussions: not ok • “ How to design a solution for the project?” • Project delivery includes short oral exam 6
What are “systems for data science”?
Systems + Data Science • Data science research • New algorithms • New applications of existing algorithms • Validation: take small representative dataset, show accuracy • Systems research • Run these algorithms efficiently • Scale them to larger datasets • End-to-end pipelines • Applications of ML to system design and software engineering (seminar next Spring!) • Validation: build a prototype that others can use • These are ends of a spectrum
Overview • What type of systems will we target? • Storage systems • Data processing systems • Cloud analytics • System for machine learning • Goal: Hide complexity of underlying hardware • Parallelism: multi-core, distributed systems • Fault-tolerance: hardware fails • Focus on scalable systems • Scale to large datasets • Scale to computationally complex problems
Transactional vs. Analytical Systems • Transactional data management system • Real-time response • Concurrency • Updates • Analytical data management system • Non real-time responses • No concurrency • Read-only • These are ends of a spectrum
Example: Search Engine • Crawlers: download the Web • Hadoop file system (HDFS): store the Web • MapReduce: run massively parallel indexing • Key-value store: store index • Front-end • Serve client requests • Ranking à this is actual the data science • Q: Scalability issues? • Q: Which component is transactional / analytical • Q: Where are storage/data processing/cloud/ML involved?
Design goals
Ease of Use • Good APIs / abstractions are key in a system • High-level API • Easier to use, better productivity, safer code • It makes some implementation choices for you • These choices are based on assumptions on the use cases • Are these choices really what you need? • Low-level API • Harder to use, lower productivity, unsafer code • More flexible 13
Scalability • Ideal world • Linear scalability Speedup Ideal • Reality • Bottlenecks • For example: central coordinator Reality • When do we stop scaling? Parallelism 14
Latency vs. Throughput • Pipe metaphor Latency • System is a pipe Max throughput • Requests are small marbles 100x req • Low load • Minimal latency • Increased load (2x w) 10x req 1x requests 50x req • Higher throughput • Latency stable • High load Throughput • Saturation: no more throughput • Latency skyrockets 15
Fault Tolerance • Assume that your system crashes every month • If you run Python scripts on your laptop, that’s fine • But imagine you run a cluster • 10 nodes = a crash every 3 days • 100 nodes = a crash every seven hours • 1000 nodes = a crash every 50 minutes • Some computations run for more than one hour • Cannot simply restart when something goes wrong • Even when restarting, we need to keep metadata safe 16
Why do we need parallelism? 17
Maximum Clock Rate is Stagnating Two major “laws” are collapsing • Moore’s law • Dennard scaling Source: https://queue.acm.org/detail.cfm?id=2181798
Moore’s Law • “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: Exponential axis until 2021 unless new technologies arise) [1] [1] https://www.technologyreview.com/s/601441/moores-law-is- dead-now-what/
Dennard Scaling • “Reducing transistor size does not increase power density à power consumption proportional to chip area” • Stopped holding around 2006 • Assumptions break when physical system close to limit • Post-Dennard-scaling world of today • Huge cooling and power consumption issues • If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor
Heat Dissipation Problem • Large datacenters consume energy like large cities • Cooling is the main cost factor Google @ Columbia River valley (2006) Facebook @ Luleå (2015)
Where is Luleå?
Single-Core Solutions • Dynamic Voltage and Frequency Scaling (DVFS) • E.g. Intel’s TurboBoost • Only works under low load • Use part of the chip for coprocessors (e.g. graphics) • Lower power consumption • Limited number of generic functionalities to offload
Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core … core core core core core core Socket Socket Socket (to motherboard) Main Memory
Multi-Core processors • Idea: scale computational power linearly • Instead of a single 5 GHz core, 2 * 2.5 GHz cores • Scale heat dissipation linearly • k cores have ~ k times the heat dissipation of a single core • Increasing frequency of a single core by k times creates superlinear heat dissipation increase
How to Leverage Multicores • Run multiple tasks in parallel • Multiprocessing • Multithreading • E.g. PCs have many parallel background apps • OS, music, antivirus, web browser, … • How to parallelize one app is not trivial • Embarrassingly parallel tasks • Can be run by multiple threads • No coordination
Memory Bandwidth Bottleneck • Cores compete for the same main memory bus • Solution: caching help in two ways • They reduce latency (as we have discussed) • They also increase throughput by avoiding bus contention
SIMD Processors • Single Instruction Multiple Data (SIMD) processors • Example • Graphical Processing Units (GPUs) • Intel Phi coprocessors • Q: Possible SIMD snippets for i in [0,n-1] do for i in [0,n-1] do v[i] = v[i] * pi if v[i] < 0.01 then v[i] = 0
Other Approaches • SIMD • Single Instruction Multiple Data • A massive number of simpler cores • FPGAs • Dedicated hardware designed for a specific task
Automatic Parallelization? • Holy grail in the multi-processor era • Approaches • Programming languages • Systems with APIs that help express parallelism • Efficient coordination mechanisms
Homework The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA sergey@cs.stanford.edu and page@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of
Recommend
More recommend