of big data
play

of Big Data 10/01/2018 25 Storage of Big Data Data is growing - PowerPoint PPT Presentation

Components of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 10/01/2018 26 Hadoop Distributed File System (HDFS)


  1. Components of Big Data 10/01/2018 25

  2. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 10/01/2018 26

  3. Hadoop Distributed File System (HDFS) The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many … 128MB 128MB 128MB 128MB 128MB 128MB … 10/01/2018 27

  4. Indexing Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges: Global index Big volume HDFS limitation New programming paradigms Local indexes Ad-hoc indexes 10/01/2018 28

  5. Fault Tolerance Replication Redundancy Multiple masters 10/01/2018 29

  6. Streaming …1000100010101011101110101010110111010111011101110100… Processing window Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes 10/01/2018 30

  7. Task Execution MapReduce … M 1 M 2 M m Map-Shuffle- Reduce Resiliency through R 1 R 2 R n materialization Resilient Distributed Datasets (RDD) Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages Hyracks Stragglers Load balance 10/01/2018 31

  8. Query Optimization Finding the most efficient query plan e.g., grouped aggregation Agg Partition Merge Agg Vs Agg Partition Merge Agg Agg Partition Cost model (CPU – Disk – Network) 10/01/2018 32

  9. Provenance Debugging in distributed systems is painful We need to keep track of transformations on each record 10/01/2018 33

  10. Big Graphs Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals 10/01/2018 34

  11. Hadoop Ecosystem Administration Pig MapReduce Query Engine Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 35

  12. Spark Ecosystem Spark SQL Spark Data Frames MLlib GraphX SparkR Streaming Resilient Distributed Dataset (RDD) a.k.a Spark Core Yet Another Kubernetes Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 36

  13. AsterixQL HiveQL PigLatin MapReduce Pregel Jobs Jobs Other AsteixDB HiveSterix Hyracks compilers jobs Algebricks Hadoop MapReduce Pregelix Algebra Layer Compatibility Hyracks Data-parallel Platform 10/01/2018 37

  14. Impala Query Parser Query Planner Query Executor Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 38

  15. SpatialHadoop Pig Latin + Pigeon Spatial Visualization MapReduce Processing + Spatial Query Processing Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) + Spatial Indexing 10/01/2018 39

  16. Reading Material “The Age of Analytics in a Data - driven World” [Executive Summary] by McKinsey & Company 10/01/2018 40

Recommend


More recommend