Introduction to Big Data Systems CS 448 - Spring 2019 March 18th - PowerPoint PPT Presentation

Sep 14, 2023 •158 likes •384 views

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview Discussion on: Motivation for Big Data The MapReduce Model Hadoop distributed file system Spark data processing framework

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah
Overview • Discussion on: • Motivation for Big Data • The MapReduce Model • Hadoop distributed file system • Spark data processing framework • Think-Pair-Share Sessions, given a few discussion question: • 2 minutes of thinking • 2-4 minutes discuss with partner • 2-4 minutes class-wide discussion
Discussion on Big Data What are the characteristics of Big Data? How are they different from traditional database applications? Why do we need different data management systems for them?
What are the characteristics of Big Data? Volume : Size of data Velocity : Rate of data Variety : Types of data Veracity : Quality of data
How are they different from traditional database applications? Structured Semi- or Un-structured e.g. JSON, XML, e.g. Database tables Images, Videos …
Why do we need different data management systems for Big Data? Traditional DBMSs require some form of ETL Not ideal for certain use-cases (e.g., Build an inverted index of webpages, Page-rank of web-pages) One size does not fit all
Discussion on MapReduce What are the main pieces of logic a programmer needs to specify? What are the benefits of the MapReduce and Hadoop?
What are the main pieces of logic a programmer needs to specify?
MapReduce Model map(K1,V1) : List[K2,V2] reduce(K2, List[V2]) : List[K3,V3]
MapReduce Example What does this code compute?
What are the benefits of the MapReduce and Hadoop? Simple distributed programming Allows for highly parallel and distributed and reliable data processing Free and open source
Discussion on HDFS What are the design goals for HDFS? What are the main architectural components of HDFS?
What are the design goals for HDFS? Fault-tolerance Throughput-optimized Support for large files Append-only data write model
What are the main architectural components of HDFS? Name Node (+ secondary) Data Nodes
Discussion on YARN What is the key concept behind YARN? What are the benefits?
Discussion on YARN Separation of Concerns Improved resource utilization Allow other applications to run on cluster
Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012 Shi et al. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics, VLDB 2015
What are the elements of the vision behind Spark? What is the key feature introduced in Spark 2.0?
What are the elements of the vision behind Spark? Functional High-level API to support data scientists workflows Unified data processing What is the key feature introduced in Spark 2.0? Structured APIs
What technology is better? Parallel Databases MapReduce Structured Data Unstructured Data Fault-tolerance Query Expressiveness Simple Usage Support for Novel Applications
Project 4 Use a real cluster environment (RCAC Scholar) Practice with HDFS Practice with Spark and Spark-SQL (possibly Spark-Streaming too!)

Recommend

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data Analytics Analysis Big Data Big Value Real world Question Data Model Conclusion Machine Learning Use real data to train a model, which can

625 views • 27 slides

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data algorithms Clinical Big Data Our new algorithms Small data vs. Big data Small data vs. Big data VS Small data vs. Big

922 views • 57 slides

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by Prof. Dan Ariely, Duke University 2 What is big data? No standard definition! Wikipedia: Big data is a field that treats ways to

1.47k views • 53 slides

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA What is Big Data? Sangmi Lee Pallickara

569 views • 7 slides

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and cloud systems slow down and stop? Big data & cloud systems 3 Big data & cloud systems DB-backed web applications Cloud services

802 views • 68 slides

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data Management summit 24 nd March 216 Philippe Marie-Jeanne Group CDO & Head of the Data Innovation Lab Philippe.mariejeanne@axa.com Big Data is an

449 views • 10 slides

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

APACHE BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE FIRST SPARK-BASED BIG DATA PLATFORM RELEASED After almost a decade developing Big Data projects in Paradigma, through its R+D department we

640 views • 34 slides

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING CONSTRUCTION BUSINESS THROUGH SOCMED DATA MINING 01 02 03 Socmed Data The Big Data The Big Data Mining Concept Adoption DATA THE BIG DATA

656 views • 29 slides

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape Journey PART 1 10000ft Drivers to re-thinking data Where does Hadoop come from? Industry trends and vendor map When should I

709 views • 46 slides

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science Analyst, CyVerse What is Big Data? Big data is a term used to refer to the study and applications of data sets that are too complex for traditional

528 views • 24 slides

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Whats the BIG deal?! 2011 2011 2008 2010 2012 Whats the BIG deal?! (Gartner Hype Cycle) Whats the BIG deal?! Flu Trends

738 views • 41 slides

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall 2017 Whats the BIG deal?! 2011 2011 2008 2010 2012 Whats the BIG deal?! (Gartner Hype Cycle) Whats the BIG deal?! Flu Trends

891 views • 45 slides

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data Security February 2016 Data Security Impacts Design and Delivery of Big Data Projects Data Security frequently is a leading Role Security Impacts

547 views • 23 slides

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ? For different industries and areas of science it means different things Clicks, ad exposures, movies preferences, hyper text links, genome

326 views • 16 slides

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

GraphBIG : Understanding Graph Computing in the Context Of Industrial Solutions Lifeng Nai , Hyesoon Kim (Georgia Tech) Yinglong Xia, IlieTanase, Ching-Yung Lin (IBM Research) BIG DATA 2 This is the Big Data era Big Data are linked

806 views • 47 slides

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available Only a few people know how to deal with it Youre now one of them Applications The project is a start Keep your hands dirty Consider using the

176 views • 13 slides

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki Thessaloniki Greece 1 Outline What is Spark? Basic features

951 views • 64 slides

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help session Setup Follow instructions in HW 0 Piazza Office Hour Deployment Options Local Stand-alone clusters Managed clusters

234 views • 20 slides

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster programming by hiding scaling &

307 views • 12 slides

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno The reference Big

744 views • 53 slides

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020 Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity

779 views • 58 slides

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze Kai-Uwe Sattler TU Ilmenau partially funded under grant no. SA782/22 Motivation data analytics for decision support include spatial and/or

550 views • 24 slides

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of Computer Science, Johns Hopkins University Spark: Batch Computing Reload Map/Reduce style programming Data-parallel, batch, restrictive

498 views • 7 slides

Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing

Outline Motivation Analysis Of Accessed Objects Conclusion Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing Atomicity with Locks 1/16 Outline Motivation Analysis Of Accessed Objects Conclusion

803 views • 61 slides