Introduction to Big-data Management Review and next steps 1 What - PowerPoint PPT Presentation

Introduction to Big-data Management Review and next steps 1

What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks) Higher-level data flow engines (Pig, SparkSQL) Storage formats (row, column, Parquet, LSM indexing) Document databases (MongoDB) Machine learning (MLlib) 2

HDFS Name node HDFS 128 MB Block 128 MB 128 MB 128 MB … 3 Data nodes

Logical View of MapReduce During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤 Input Intermediate Output Data 𝑙 ! , 𝑤 ! 𝑙 " , 𝑤 " 𝑙 # , 𝑤 # 𝑙 ! , 𝑤 ! 𝑙 " , 𝑤 " 𝑙 # , 𝑤 # Map Reduce … … … 𝑙 ! , 𝑤 ! 𝑙 " , 𝑤 " 𝑙 # , 𝑤 # 4

Map and Reduce Functions Map Function Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙 ! , 𝑤 ! → ⟨𝑙 " , 𝑤 " ⟩ Combine Function Combine: 𝑙 " , 𝑤 " → ⟨𝑙 " , 𝑤 " ⟩ Reduce Function Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙 " , 𝑤 " → { 𝑙 # , 𝑤 # } 5

Job Execution Overview Driver Job Job Map, Shuffle Reduce Cleanup submission preparation Combine 6

Resilient Distributed Dataset (RDD) RDD is a pointer to a distributed dataset Stores information about how to compute the data rather than where the data is Transformation: Converts an RDD to another RDD Action: Returns an answer of an operation over an RDD Narrow Vs wide dependencies How RDD operations work 7

SparkSQL Dataframe (SparkSQL) RDD Lazy execution Lazy execution Spark is aware of the The data model is data model hidden from Spark Spark is aware of the The transformations query logic and actions are black boxes Cannot optimize the Can optimize the query query 8

Storage formats Difference between row and column formats How attributes map to disk Major applications for each of them Parquet files A column store file format Handles nesting and replication Schema à Maximum definition and repetition level Record à Definition and repetition level for each attribute Do not forget to add null (non-existent) attributes 9

Document databases How a document database compares to a relational database (RDBMS) Normalization (nesting and repetition) ACID compliance How MongoDB compares attributes 10

MLlib Main components of MLlib Transformers, e.g., feature extraction Estimator, e.g., clustering or regression Evaluator, e.g., precision and recall calculation Validator, e.g., k-fold cross validation Pipeline: Transformation(s) + Estimator 11

Did we cover everything? 12

2019 Big data & AI Landscape 13

Topics not Covered Key-value stores Big graph analytics Visualization Streaming Coordination Cloud platforms 14

Key-value Stores Provide a simple API to insert/delete/update/search key-value pairs Records are indexed by key (typically a string) Internal structure is typically a Log-structured-merge tree (LSM) Not generally suitable for large-scale analytics 15

Big Graph Analytics Graphs are usually processed using a node- centric processing model Nodes and edges are both treated as first- class citizens Processing is normally iterative with a lot of iterations 16

Visualization Sometimes called Business Intelligence (BI) Focuses more on the end-user interface while producing nice graphs (e.g., bar charts and line graphs) Internally, the data is managed using the common big-data platforms but the systems are tuned to provide fast query response for ad-hoc queries 17

Streaming Some applications need to process data in real-time with a very small latency Examples: Twitter search, IoT applications, and social network trends Works primarily off main memory Keeps only the latest records to ensure real- time response 18

Coordination Most big-data systems are designed for shared-nothing large-scale analytics No coordination between machines is part of the design Coordination systems provide an easy way to coordinate the work in these distributed platforms, e.g., a catalog of information, work queue, and a global system status 19

Machine Learning ML is on the rise The increasing amount of data make it a big- data problem Some big ML systems emerge to provide scalable processing 20

Cloud Platforms Maintaining your own cluster is costly It could be underutilized most of the time Cloud platforms allow you to rent virtual machines to do your work and dispose them after They are well-integrated with big data platforms (such as Hadoop and Spark) to give the best user experience All you need is an internet connection and a credit card 21

What is next? 22

What is next? Real big data is widely available Big data is like gold Only a few people know how to deal with it You’re now one of them Applications Keep your hands dirty Consider using the public cloud (e.g., AWS, Google Cloud, or Microsoft Azure) 23

Job Market 24 https://www.techicy.com/5-best-programming-languages-to-watch-out-in-2019-for-data-science.html

Data Science Credits: Drew Conway 25

Data Science 26 https://mashimo.wordpress.com/2016/05/28/big-data-data-science-and-machine-learning-explained/

Next Steps CS Big data tools Python/R/Scala Math/Stats Linear algebra Correlation analysis Hypothesis tests Collaboration with domain experts Visualization Prototyping 27

CS 28 https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

CS/Big Data 29 https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Math/Stats 30 https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Online Courses 31 https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Data Analytics 32 https://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize

Thank You! Good Luck J 33

Introduction to Big-data Management Review and next steps 1 What - PowerPoint PPT Presentation

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks) Higher-level data flow engines (Pig, SparkSQL) Storage formats (row, column, Parquet, LSM indexing) Document

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

SHA-1 is a Shambles First Chosen-Prefix Collision on SHA-1 and Application to the PGP Web of Trust

Diving into Kotlin Multiplatform Dimitry Savvinov @dsavvinov October 14, 2020 Disclaimer A

Beyond the PDF in Empirical Economic Research Richard Ball Associate Professor of

Unifying Events & Logs into the Cloud Eduardo Silva August 17, 2015

Never Lose a Syslog Message Alexander Bluhm bluhm@openbsd.org September 24, 2017 Motivation

Design and Implementation of a Guard Installation and Administration Framework (GIAF) 15 March

Workflow Orchestration and Mining for Integrated Asset Management in Smart Oilfileds Presenter :

Introduction to Big-data Management Review and next steps 1 What - PowerPoint PPT Presentation

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks) Higher-level data flow engines (Pig, SparkSQL) Storage formats (row, column, Parquet, LSM indexing) Document

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

SHA-1 is a Shambles First Chosen-Prefix Collision on SHA-1 and Application to the PGP Web of Trust

Diving into Kotlin Multiplatform Dimitry Savvinov @dsavvinov October 14, 2020 Disclaimer A

Beyond the PDF in Empirical Economic Research Richard Ball Associate Professor of

Unifying Events &amp; Logs into the Cloud Eduardo Silva August 17, 2015

Never Lose a Syslog Message Alexander Bluhm bluhm@openbsd.org September 24, 2017 Motivation

Design and Implementation of a Guard Installation and Administration Framework (GIAF) 15 March

Workflow Orchestration and Mining for Integrated Asset Management in Smart Oilfileds Presenter :

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Unifying Events & Logs into the Cloud Eduardo Silva August 17, 2015