CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java - PowerPoint PPT Presentation

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API HAPPY EQUATOR DAY!

Housekeeping Lab 4 (mini-project): due Sunday night Lab 5: due tonight (grace period tomorrow) Lab 6: full lab coming out Friday Grading: slowly happening…

Hadoop Java API

Hadoop API Current Version is 3.2.1. hadoop Command-line tools hdfs We limit ourselves to hadoop jar yarn

Hadoop Java API org.apache.hadoop Let’s concentrate on things we absolutely need

Hadoop Java API org.apache.hadoop Core MapReduce classes org.apache.hadoop.mapreduce Inuput/Output org.apache.hadoop.mapreduce.lib.input parsing org.apache.hadoop.mapreduce.lib.output atomic type wrappers org.apache.hadoop.io Job configuration org.apache.hadoop.conf File system classes org.apache.hadoop.fs

org.apache.hadoop.mapreduce MapReduce Job org.apache.hadoop.mapreduce.Job org.apache.hadoop.mapreduce.Mapper Extensible Mapper org.apache.hadoop.mapreduce.Reducer Extensible Reducer Parent class for org.apache.hadoop.mapreduce.Partitioner Partitioning tasks org.apache.hadoop.mapreduce.InputFormat Parent classes for org.apache.hadoop.mapreduce.OutputFormat Input/Output Formats Parent class for org.apache.hadoop.mapreduce.InputSplit Input Split

How it works Input File

How it works InputSplit InputSplit InputSplit Input File

How it works Job InputSplit Mapper Combiner (Reducer) InputSplit Reducer InputSplit Input File

How it works Compute Node1 Job InputSplit Mapper Combiner (Reducer) Compute Node2 InputSplit Reducer InputSplit Compute Node3 Input File

How it works Compute Node1 InputSplit Combiner (Reducer) Mapper Compute Node2 InputSplit Combiner (Reducer) Mapper InputSplit Compute Node3 Combiner (Reducer) Mapper Mapper Input File

Compute Node1 InputSplit Mapper Combiner (Reducer)

time MAP STAGE Reduce STAGE Compute Node1 Compute Node1 Mapper Reducer Combiner Compute Node2 Compute Node2 Mapper Reducer Combiner Compute Node3 Compute Node3 Reducer Mapper Combiner

time MAP STAGE Reduce STAGE Compute Node1 Compute Node1 Partitioner Mapper Reducer Combiner Compute Node2 Compute Node2 Partitioner Mapper Reducer Combiner Compute Node3 Compute Node3 Partitioner Reducer Mapper Combiner

time MAP STAGE Reduce STAGE Compute Node1 Compute Node1 Partitioner Mapper Reducer Combiner Compute Node2 Compute Node2 Partitioner Mapper Reducer Combiner Compute Node3 Compute Node3 Partitioner Reducer Mapper Shuffle STAGE Combiner

Mapper in a nutshell protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) protected void map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) void run(org.apache.hadoop.mapreduce.Mapper.Cont ext context)

run(InputSplit s, Context c): Run setup() once setup(s,c); for each record in s do: Run map() for each record map(record, c); end for; cleanup(s,c) Run cleaunp() once

Reducer in a nutshell protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) protected void reduce(KEYIN key, Iterable<VALUEIN> value, org.apache.hadoop.mapreduce.Mapper.Context context) protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) void run(org.apache.hadoop.mapreduce.Mapper.Cont ext context)

Shuffle Sort SecondarySort run(InputSplit s, Context c): Run setup() once setup(s,c); Run map() for each record for each record in s do: map(record, c); end for; Run cleaunp() once cleanup(s,c)

Hadoop Java API org.apache.hadoop Core MapReduce classes org.apache.hadoop.mapreduce Inuput/Output org.apache.hadoop.mapreduce.lib.input parsing org.apache.hadoop.mapreduce.lib.output atomic type wrappers org.apache.hadoop.io Job configuration org.apache.hadoop.conf File system classes org.apache.hadoop.fs

org.apache.hadoop.mapreduce.lib.input Single File Input Format Generic Input File format (others extend it) FileInputFormat Text Input TextInputFormat User-defined Key-Value Pairs KeyValueInputFormat Fixed Length Records in input FixedLengthInputFormat NLineInputFormat Controls the size of split (in terms of #lines)

org.apache.hadoop.mapreduce.lib.input Single File Input Format Generic Input File format (others extend it) FileInputFormat Text Input TextInputFormat User-defined Key-Value Pairs KeyValueInputFormat Fixed Length Records in input FixedLengthInputFormat Controls the size of split (in terms of #lines) NLineInputFormat Other Important Classes Multiple Files as inputs to a single Mapper MultipleInputs File Partitions FileSplits

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java - PowerPoint PPT Presentation

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API HAPPY EQUATOR DAY! Housekeeping Lab 4 (mini-project): due Sunday night Lab 5: due

CSC 369: Distributed Computing Alex Dekhtyar This is an official test How Zoom records

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and

CSC 369: Distributed Computing Alex Dekhtyar April 22 Day 8: Problem-solving with

CSC 369: Distributed Computing Alex Dekhtyar April 17 Day 6: The Algebra Of Data Transformations

CSC 369: Distributed Computing Alex Dekhtyar April 20 Day 7: MongoDB Aggregation Pipeline, Part

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

I-69 SYSTEM (I-369) HARRISON COUNTY/MARSHALL ROUTE STUDY Texas Transportation Commission

Call-in to listen: (877) 369-6670 Or listen via web Follow us on Twitter for live updates:

Service-Oriented Computing CSC 450 and CSC 750 Munindar P. Singh, Professor singh@ncsu.edu

Service-Oriented Computing CSC 450 and CSC 750 Munindar P. Singh, Professor singh@ncsu.edu

Social Computing CSC 495 and CSC 555 Munindar P. Singh, Professor singh@ncsu.edu Department of

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

Lecture 1 Dr. Tom Way CSC 4700 1 Introduction Dr. Tom Way CSC 4700 2 Software engineering

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

New Ideas Track: Testing MapReduce-Style Programs Christoph Csallner, Leonidas Fegaras, Chengkai

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Apache Solr Injection Michael Stepankin @artsploit DEF CON 27 @whoami Michael Stepankin

Kernel HTTPS/TCP/IP stack for HTTP DDoS mitigation Alexander Krizhanovsky Tempesta Technologies,

DONT OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Apache Calcite) TELUQ, Montral,

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us