CS6453 Big data Systems: Trends and Challenges Rachit Agarwal
Instructor — Rachit Agarwal • Assistant Professor , Cornell • Previously: Postdoc, UC Berkeley • PhD, UIUC • Research interests : Systems, networking, theory • Conferences of interest : OSDI, NSDI, SIGCOMM, SODA • Non-research interests : • I am an assistant professor ;-)
Instructor — Rachit Agarwal Interac9ve [NSDI’16] [NSDI’15] [EuroSys’17] … Queries Resource Systems for Post-Moore’s law Hardware Disaggrega9on [OSDI’16] … Graph Distance Improvement over several decade-old results Oracles [SODA’13] [ESA’14] [PODC’13] … Network Debugging the data plane Debugging [OSDI’16] [SIGCOMM’11] [SOSR’15] … Network New rouWng and scheduling mechanisms Protocols [NSDI’16] [SIGMETRICS’11] [INFOCOM’11] … Gap between linear and non-linear codes Coding theory [ISIT’11] [ISIT’07] [INFOCOM’10]
Big data systems What is big? • Billion $ datacenters • Number of servers • Google, Microso]: ~1 million • Facebook, Yahoo!, IBM, HP: several 100,000s each • Amazon, EBay, Intel, Akamai: >50,000 each • If each server stores 1TB of data • 10s of Petabytes — Exabytes of data
Big data — disrupting businesses
Big data — what is fundamentally new? • Scale? • Applications? Complexity? • Workloads? Performance metrics? • Hardware? Or, are there fundamentally new “technical” problems?
Scale Complexity Batch Analytics Batch Analytics Collect, scan, index Interactive Queries (e.g., Google) Streaming TBs of semi-structured Interactive Queries data a norm! e.g., Search all @tweets that mention “Cornell” Also, data growing faster than the Moore’s law Performance constraints unchanged Low latency , Streaming High throughput (e.g., #queries per second) (Wall/Timeline)
Performance Scale Complexity Fundamentally New Problems
Example 1 — Search Search queries • Customer logs from a video company • Single Amazon EC2 server, single core, 60GB RAM 1000 Search Throughput (Queries per second) ElasWcsearch Session Session MongoDB recordID userID … Tags 100 Start_Wme End_Time Cassandra In-memory query execu9on key to query performance 10 • secondary storage 100x slower 10% queries executed off-memory 1% queries executed off-memory 1 1 2 4 8 16 32 64 128 - throughput reduces by 10x - throughput reduces by 2x “Raw” Input data size (GB)
Example 1 — Search (traditional solutions fail) Data Scans Indexes 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 Search ( ) 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 High storage Low storage Low Throughput Low Throughput
Example 2 — Ranking • Problem : How to rank search results on social graph? • [LinkedIn, 2008][Facebook, 2009] : • Want rankings based on “expected interest” … • Expected interest: distance on social graph • Challenge : • #hops not the right distance measure (small-world) • Assign edge weights (#messages exchanged) • Rank search results according to: • “shortest path distance on a weighted graph” • perhaps one of the oldest problems
Example 2 — Ranking (traditional solutions fail) • Problem : How to compute shortest paths on social graph? • Run a shortest path algorithm : • 10s of seconds on a billion node graph • Pre-compute and store shortest distances : • 277 exabytes of storage • Approximate distances? Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a b l e ... 5 1 3 Worst-case stretch
Example 3 — Fault Tolerance • Problem : How to recover failed data? • Traditional technique : • 3x replication • Problem? • Erasure codes : • E.g., Reed-Solomon Codes • Reduce storage to 1.2x • But require 10x larger bandwidth! • simply moved the bottleneck from storage to network
Big data — From problems to solutions Insight: Exploit the structure in the problem
Example 1 — Search Structure: Do not need to support arbitrary computations A distributed “compressed” data store A distributed data store Queries executed directly on compressed data! ➡ Complex queries: Search, range, random access, RegEx ✓ Scale: In-memory data sizes >= memory capacity ✓ Interactivity: Avoid data scans and decompression
Impact • Adoption in industry : • Elsevier • Databricks • 19 other companies • Academic impact : • New techniques • Text, Graphs, Images • Very active area of research
Example 2 — Ranking Structure: Do not need to support arbitrary graphs Graphs with edges h m = ˜ O ( n ) Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a ... b l e 5 1 3 Worst-case stretch
Impact • Adoption in industry : • LinkedIn • Apple Maps • Academic impact : • New routing protocols • New “compact” graph data structures • Still a very active area of research
Example 3 — Fault tolerance Structure: ? Lot of work, but Still an unresolved problem
Big data — From problems to solutions Approach? Co-design systems and techniques
Big Data Problems Systems that fail to leverage the structure in the problem Scalable Algorithms and Techniques Scalable Systems Techniques that ignore advances in systems and hardware System Resources
Big data systems — Trends & Challenges • Scale — need new algorithms & techniques • Applications — need new abstractions & systems • Workloads — need insights that enable new solutions • Hardware — need to co-design systems with hardware Welcome to 6453!
6453 — Plan • Learn about state-of-the-art research • 2-4 papers every week • Work on exciting project • Hopefully, start next generation of impactful directions
6453 — Reading papers Submit reviews before the lecture starts • Summary of problem being solved • Why is the problem interesting? • What are the main insights and technical contributions? • How does the paper advance the state-of-the-art? • Where may the solution not work well? • What are the next few problems you would solve? • What do you think is the holy grail in this direction?
6453 — Presenting papers Slides for Tuesday and Thursday lectures due by Saturday night and Monday night, respectively • 5-6 papers (depending on enrollment) • Similar to reading papers and writing reviews • but also provide broader overview of the sub-area • Please see course webpage • Please sign up for papers by Tuesday (next lecture)
6453 — Research Project • In groups of *maximum* 2 people • Interdisciplinary teams encouraged • New problem (may be in your sub-area) • Several deadlines: • Weekly project meetings • Survey — 02/14 • Mid-term report — 03/15 • Final report — 05/10 • Final presentation — 05/16 (does that work?)
6453 — Grade • Last thing you should worry about :-) • Paper reviews: 20% • Class participation: 10% • Research project: 70%
Recommend
More recommend