cs226 big data management
play

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - PowerPoint PPT Presentation

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome (back) to UCR! 2 Class information Classes: Monday, Wednesday, Friday 1:00 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office


  1. CS226 Big-Data Management Instructor: Ahmed Eldawy 1

  2. Welcome (back) to UCR! 2

  3. Class information Classes: Monday, Wednesday, Friday 1:00 – 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office hours: TBD Website: http://www.cs.ucr.edu/~eldawy/19FCS226/ iLearn (Any UCRX students?) Email: eldawy@ucr.edu Subject: “[CS226] …” 3

  4. Course work Active participation in the class (5%) Reading and review tasks (10%) Assignments (20%) Mid-term (15%) Project (50%) 4

  5. Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%) Report outline (5%) Class presentation (5%) Final report (15%) Poster presentation (10%) 5

  6. Course goals What are your goals? Understand what big data means Identify the internal components of big data platforms Recognize the differences between different big data platforms Explain how a distributed query runs on big data 6

  7. Super Hero 7

  8. Big-data Expert Understand how the big-data platforms really work Control those thousands of processors efficiently to carry out your task 8

  9. Syllabus Overview of big data Big-data storage Big-data processing Big-data indexing Big-SQL processing Programming packages 9

  10. Introduction 10

  11. 11

  12. 12

  13. Jan 2012: World Economic Forum Report 13

  14. Interest in Big Data in the US ■ June 2013: ■ March 2012: Obama administration Washington unveils BIG DATA initiative: $200 Million Post is calling in R&D investment Obama “ The Big Data President ” 14

  15. Interest in Big Data in Europe March 2014: David Cameron and Angela Merkel talking about Big Data in a Computer Expo in Hannover, Germany 15

  16. The Market of Big Data 16

  17. Four Three V’s of Big Data 17

  18. Big Data Vs Big Computation Full scans (e.g., log processing) Range scans Point lookups Iterations Joins (self, binary, or multiway) Proximity queries Closures and graph traversals 18

  19. Big Data Applications Web search Marketing and advertising Data cleaning Knowledge base Information retrieval Internet of Things (IoT) Visualization Behavioral studies 19

  20. Publicly Available Datasets Data.gov Data.gov.uk Twitter Streaming API Yahoo! Webscope [http://webscope.sandbox.yahoo.com/] GDELT [http://www.gdeltproject.org/] Instagram API 20

  21. Big Data Landscape 2012 http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/ 21

  22. Big Data Landscape 2014 http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/ 22

  23. Big Data Landscape 2016 http://mattturck.com/2016/02/01/big-data-landscape/ 23

  24. Big Data Landscape 2018 24

  25. Components of Big Data 25

  26. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 26

  27. Hadoop Distributed File System (HDFS) The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many … 128MB 128MB 128MB 128MB 128MB 128MB … 27

  28. Indexing Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges: Global index Big volume HDFS limitation New programming paradigms Local indexes Ad-hoc indexes 28

  29. Fault Tolerance Replication Redundancy Multiple masters 29

  30. Streaming …1000100010101011101110101010110111010111011101110100… Processing window Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes 30

  31. Task Execution MapReduce … M 1 M 2 M m Map-Shuffle- Reduce Resiliency through R 1 R 2 R n materialization Resilient Distributed Datasets (RDD) Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages Hyracks Stragglers Load balance 31

  32. Query Optimization Finding the most efficient query plan e.g., grouped aggregation Agg Partition Merge Agg Vs Agg Partition Merge Agg Agg Partition Cost model (CPU – Disk – Network) 32

  33. Provenance Debugging in distributed systems is painful We need to keep track of transformations on each record 33

  34. Big Graphs Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals 34

  35. Hadoop Ecosystem Administration Pig MapReduce Query Engine Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 35

  36. Spark Ecosystem Spark SQL Spark Data Frames MLlib GraphX SparkR Streaming Resilient Distributed Dataset (RDD) a.k.a Spark Core Yet Another Kubernetes Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 36

  37. AsterixQL HiveQL PigLatin MapReduce Pregel Jobs Jobs Other AsteixDB HiveSterix Hyracks compilers jobs Algebricks Hadoop MapReduce Pregelix Algebra Layer Compatibility Hyracks Data-parallel Platform 37

  38. Impala Query Parser Query Planner Query Executor Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 38

  39. SpatialHadoop Pig Latin + Pigeon Spatial Visualization MapReduce Processing + Spatial Query Processing Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) + Spatial Indexing 39

  40. Reading Material “The Age of Analytics in a Data - driven World” [Executive Summary] by McKinsey & Company 40

Recommend


More recommend