Components of Big Data 10/01/2018 25
Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 10/01/2018 26
Hadoop Distributed File System (HDFS) The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many … 128MB 128MB 128MB 128MB 128MB 128MB … 10/01/2018 27
Indexing Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges: Global index Big volume HDFS limitation New programming paradigms Local indexes Ad-hoc indexes 10/01/2018 28
Fault Tolerance Replication Redundancy Multiple masters 10/01/2018 29
Streaming …1000100010101011101110101010110111010111011101110100… Processing window Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes 10/01/2018 30
Task Execution MapReduce … M 1 M 2 M m Map-Shuffle- Reduce Resiliency through R 1 R 2 R n materialization Resilient Distributed Datasets (RDD) Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages Hyracks Stragglers Load balance 10/01/2018 31
Query Optimization Finding the most efficient query plan e.g., grouped aggregation Agg Partition Merge Agg Vs Agg Partition Merge Agg Agg Partition Cost model (CPU – Disk – Network) 10/01/2018 32
Provenance Debugging in distributed systems is painful We need to keep track of transformations on each record 10/01/2018 33
Big Graphs Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals 10/01/2018 34
Hadoop Ecosystem Administration Pig MapReduce Query Engine Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 35
Spark Ecosystem Spark SQL Spark Data Frames MLlib GraphX SparkR Streaming Resilient Distributed Dataset (RDD) a.k.a Spark Core Yet Another Kubernetes Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 36
AsterixQL HiveQL PigLatin MapReduce Pregel Jobs Jobs Other AsteixDB HiveSterix Hyracks compilers jobs Algebricks Hadoop MapReduce Pregelix Algebra Layer Compatibility Hyracks Data-parallel Platform 10/01/2018 37
Impala Query Parser Query Planner Query Executor Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 10/01/2018 38
SpatialHadoop Pig Latin + Pigeon Spatial Visualization MapReduce Processing + Spatial Query Processing Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) + Spatial Indexing 10/01/2018 39
Reading Material “The Age of Analytics in a Data - driven World” [Executive Summary] by McKinsey & Company 10/01/2018 40
Recommend
More recommend