CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - PowerPoint PPT Presentation

CS226 Big-Data Management Instructor: Ahmed Eldawy 1

Welcome (back) to UCR! 2

Class information Classes: Monday, Wednesday, Friday 1:00 – 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office hours: TBD Website: http://www.cs.ucr.edu/~eldawy/19FCS226/ iLearn (Any UCRX students?) Email: eldawy@ucr.edu Subject: “[CS226] …” 3

Course work Active participation in the class (5%) Reading and review tasks (10%) Assignments (20%) Mid-term (15%) Project (50%) 4

Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%) Report outline (5%) Class presentation (5%) Final report (15%) Poster presentation (10%) 5

Course goals What are your goals? Understand what big data means Identify the internal components of big data platforms Recognize the differences between different big data platforms Explain how a distributed query runs on big data 6

Super Hero 7

Big-data Expert Understand how the big-data platforms really work Control those thousands of processors efficiently to carry out your task 8

Syllabus Overview of big data Big-data storage Big-data processing Big-data indexing Big-SQL processing Programming packages 9

Introduction 10

Jan 2012: World Economic Forum Report 13

Interest in Big Data in the US ■ June 2013: ■ March 2012: Obama administration Washington unveils BIG DATA initiative: $200 Million Post is calling in R&D investment Obama “ The Big Data President ” 14

Interest in Big Data in Europe March 2014: David Cameron and Angela Merkel talking about Big Data in a Computer Expo in Hannover, Germany 15

The Market of Big Data 16

Four Three V’s of Big Data 17

Big Data Vs Big Computation Full scans (e.g., log processing) Range scans Point lookups Iterations Joins (self, binary, or multiway) Proximity queries Closures and graph traversals 18

Big Data Applications Web search Marketing and advertising Data cleaning Knowledge base Information retrieval Internet of Things (IoT) Visualization Behavioral studies 19

Publicly Available Datasets Data.gov Data.gov.uk Twitter Streaming API Yahoo! Webscope [http://webscope.sandbox.yahoo.com/] GDELT [http://www.gdeltproject.org/] Instagram API 20

Big Data Landscape 2012 http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/ 21

Big Data Landscape 2014 http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/ 22

Big Data Landscape 2016 http://mattturck.com/2016/02/01/big-data-landscape/ 23

Big Data Landscape 2018 24

Components of Big Data 25

Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 26

Hadoop Distributed File System (HDFS) The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many … 128MB 128MB 128MB 128MB 128MB 128MB … 27

Indexing Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges: Global index Big volume HDFS limitation New programming paradigms Local indexes Ad-hoc indexes 28

Fault Tolerance Replication Redundancy Multiple masters 29

Streaming …1000100010101011101110101010110111010111011101110100… Processing window Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes 30

Task Execution MapReduce … M 1 M 2 M m Map-Shuffle- Reduce Resiliency through R 1 R 2 R n materialization Resilient Distributed Datasets (RDD) Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages Hyracks Stragglers Load balance 31

Query Optimization Finding the most efficient query plan e.g., grouped aggregation Agg Partition Merge Agg Vs Agg Partition Merge Agg Agg Partition Cost model (CPU – Disk – Network) 32

Provenance Debugging in distributed systems is painful We need to keep track of transformations on each record 33

Big Graphs Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals 34

Hadoop Ecosystem Administration Pig MapReduce Query Engine Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 35

Spark Ecosystem Spark SQL Spark Data Frames MLlib GraphX SparkR Streaming Resilient Distributed Dataset (RDD) a.k.a Spark Core Yet Another Kubernetes Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 36

AsterixQL HiveQL PigLatin MapReduce Pregel Jobs Jobs Other AsteixDB HiveSterix Hyracks compilers jobs Algebricks Hadoop MapReduce Pregelix Algebra Layer Compatibility Hyracks Data-parallel Platform 37

Impala Query Parser Query Planner Query Executor Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) 38

SpatialHadoop Pig Latin + Pigeon Spatial Visualization MapReduce Processing + Spatial Query Processing Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) + Spatial Indexing 39

Reading Material “The Age of Analytics in a Data - driven World” [Executive Summary] by McKinsey & Company 40

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - PowerPoint PPT Presentation

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome (back) to UCR! 2 Class information Classes: Monday, Wednesday, Friday 1:00 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office

NoSQL CS226 Big-data Management 1 Based on a presentation by Traversy Media 2 What is

CS226 Big-Data Management Instructor: Ahmed Eldawy 09/28/2018 1 Welcome (back) to UCR!

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS226/326 Compilers for Computer Languages David MacQueen Department of Computer Science

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Finding datasets / resources LING575 Analyzing Neural Language Models Shane Steinert-Threlkeld

Dependency Parsing CMSC 470 Marine Carpuat Dependency Grammars Syntactic structure = lexical

Engagement and Motivation Knowing what we do and why is key! 5 steps for success in our young

Forecasting number of natural gas consumers and their total consumption with R Ondej Konr,

Linguistically Motivated Reordering Modeling for Phrase-Based Statistical Machine Translation

1 What makes a successful Successful software teams team? Studies show a 10 to 1 difference

1 What motivates you? Motivation Survey Achievement Interpersonal relationships, superior

Meta Reinforcement Learning as Task Inference Jan Humplik, Alexandre Galashov, Leonard

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - PowerPoint PPT Presentation

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome (back) to UCR! 2 Class information Classes: Monday, Wednesday, Friday 1:00 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office

NoSQL CS226 Big-data Management 1 Based on a presentation by Traversy Media 2 What is

CS226 Big-Data Management Instructor: Ahmed Eldawy 09/28/2018 1 Welcome (back) to UCR!

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS226/326 Compilers for Computer Languages David MacQueen Department of Computer Science

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Finding datasets / resources LING575 Analyzing Neural Language Models Shane Steinert-Threlkeld

Dependency Parsing CMSC 470 Marine Carpuat Dependency Grammars Syntactic structure = lexical

Engagement and Motivation Knowing what we do and why is key! 5 steps for success in our young

Forecasting number of natural gas consumers and their total consumption with R Ondej Konr,

Linguistically Motivated Reordering Modeling for Phrase-Based Statistical Machine Translation

1 What makes a successful Successful software teams team? Studies show a 10 to 1 difference

1 What motivates you? Motivation Survey Achievement Interpersonal relationships, superior

Meta Reinforcement Learning as Task Inference Jan Humplik, Alexandre Galashov, Leonard

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data