Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, - PowerPoint PPT Presentation

Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/MBDS2018/

Scalable Tools session • Before we begin: • Do you have a VirtualBox and Ubuntu vm created? • You can copy it from a usb disk • Options 2: Run on cloud (if you can't run it locally): • Setup Google cloud or Amazon EC2 with Python and Spark 2 • We will be using Spark, Python and PySpark.

What is Big Data? 3 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Why scale? • In early 2000s, every company have to paying more and more to DBMS company. 4

Scalable tools for Big Data • MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. 5

What is the problems with Big Data in Traditional System 6

Traditional scenario • Manageable workload 7 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

When data increased, traditional systems would fail • Data come in to fast (high velocity) • Data come in unstructured (high verity) 8 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

How to solve this problem? • Issue 1: Too many order per hours? Hire more Cook! (distributed workers) • Answer?? 9 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

• Same thing happened with the servers and stroage 10 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

• Issue 2: Food shelf becomes Bottleneck • Now, how to solve it??? Distributed and Parallel Approach Data locality concept in Hadoop: data is locally available for each processing unit 11 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

• Sounds good? • How do we solve Big Data problems (storing and processing Big Data) by using Distributed and Parallel Approach like that? • Yes, we can use Hadoop! • Hadoop is a framework that allow us to store and process large data sets in parallel and distributed fashion 12

• Hadoop is a framework that allow us to store and process large data sets in parallel and distributed fashion 13 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Who came up with MapReduce concept? 14

Hadoop Master/Slave Architecture 16 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Hadoop Master/Slave Architecture cont.1 17 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Hadoop Master/Slave Architecture cont.2 got backup worker for all projects 18 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

How it translate to actual architecture 19 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Let’s play a game Spit to four group, • Assign 1 manager, 1 assistant • • Assistant collect result, time the process • Group A: everybody read the whole paper (5 pages), manager combine (average) the result • Group B: each person read one page, manager combine the result • Group A: everybody read the whole paper (5 pages), manager combine (average) the result • Missing Page 2 result Group B: each person read one page, manager combine the result • Missing Page 2 result • • Task for team member: • Read the paper • Count the word (not case-sensitive): Year • Dream • • Will 20 • Describe • Soul

23 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

HDFS Data Block 24

Fault tolerance 25

Fault tolerance: Replication Factor 26

Example: MapReduce for word count process 27 Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Apache Spark 30

Apache Spark • is a lightning fast real-time processing framework. It does in-memory computations to analyze data in real-time. • It came into picture as Apache Hadoop MapReduce was • performing batch processing only and lacked a real-time processing feature. Hence, Apache Spark was introduced as it can perform stream • processing in real-time and can also take care of batch processing. 31

Apache Spark • It leverages Apache Hadoop for both storage and processing. • It uses HDFS (Hadoop Distributed File system) for storage. 32

Spark is fast! 34

But it could cast more, depend on the memory cost 35

pyspark • PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this. • PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them. 36

Spark benchmark (PySpark and Pandas) • https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html Benchmarking Apache Spark on a Single Node Machine The benchmark • involves running the SQL queries over the table “ store_sales ” (scale 10 to 260) in Parquet file format. 37

• What we learn from this? def NewDataProject(): if dataset is large: use Spark or Hadoop else: 38 use Python Pandas

Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, - PowerPoint PPT Presentation

Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/MBDS2018/ Scalable Tools session Before we begin: Do you have a

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Scalable Video Scalable Video Bishoy Gamil Stefanos Outline Outline Introduction

A Scalable Tools Communication Infrastructure presented by Richard L. Graham Motivation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science,

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA Elsa

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Senior Memories May, 2004 Joshua Anderson Lucas Brugman Jacob Carman Justin Case Tim Dowding

Welcome to the maths LINK evening Learning and Interactive Night of Knowledge If you have come

What do you think today's lesson is going to be about? What is happening here? How does this link

The babynames data DATA MAN IP ULATION W ITH DP LYR Chris Cardillo Data Scientist at DataCamp

Review (Past year question) Review (Past year question) Consider the relations R1(A,B,C),

Supervisors : Prof A. L. Ananda , Prof Chan Mun Choon , Prof Li-Shiuan Peh # School of

CR16 Architecture Part of a microcontroller family from National Semiconductor 16-bit

one structural behavior, systems, and design www.greatbuildings.com Introduction 1

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, - PowerPoint PPT Presentation

Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/MBDS2018/ Scalable Tools session Before we begin: Do you have a

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Scalable Video Scalable Video Bishoy Gamil Stefanos Outline Outline Introduction

A Scalable Tools Communication Infrastructure presented by Richard L. Graham Motivation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science,

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA Elsa

The most important free tools for any website owner Google Webmaster Tools &amp; Google Analytics

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Senior Memories May, 2004 Joshua Anderson Lucas Brugman Jacob Carman Justin Case Tim Dowding

Welcome to the maths LINK evening Learning and Interactive Night of Knowledge If you have come

What do you think today's lesson is going to be about? What is happening here? How does this link

The babynames data DATA MAN IP ULATION W ITH DP LYR Chris Cardillo Data Scientist at DataCamp

Review (Past year question) Review (Past year question) Consider the relations R1(A,B,C),

Supervisors : Prof A. L. Ananda , Prof Chan Mun Choon , Prof Li-Shiuan Peh # School of

CR16 Architecture Part of a microcontroller family from National Semiconductor 16-bit

one structural behavior, systems, and design www.greatbuildings.com Introduction 1

Sambuz

Useful Links

Newsletter

Mail Us

The most important free tools for any website owner Google Webmaster Tools & Google Analytics