Hadoop Performance Evaluation Praktikum für Fortgeschrittene Name: Tien Duc Dinh Betreuer: Olga Mordvinova, Julian Kunkel Datum: 04-12-2007
Outline Introduction 1. Motivation Basic notations HDFS Overview 1. Architecture MapReduce HDFS Performance 1. Test Scenarios Write Read Comparison with local FS 2
What is Hadoop ? Hadoop is an open-source, Java-based programming Outline framework – Apache project 1. Introduction o Motivation supports the processing of large data sets in a distributed o Basic notations computing environment was inspired by Google MapReduce and Google File 2. Overview System (GFS) o Architectur currently used by many famous IT enterprises, e.g. o MapReduce Google, Yahoo, IBM 2. Performance o Test scenarios o Write o Read o Comparison with local FS 3
Basic notations HDFS = Hadoop Distributed File System Outline Distributed file system 1. Introduction – contains mechanisms for job scheduling/execution o Motivation – for instance allows to move jobs to data o Basic notations 2. Overview Job/Task = MapReduce job/task o Architectur Metadata o MapReduce – data, which consist of other data information 2. Performance – e.g. file name, block location o Test scenarios Block o Write o Read – part of a logical file o Comparison with – contiguous data stored on one server local FS – 64 MB default 4 – configurable
HDFS Overview XXXXXXXX Outline get job Secondary Queue JobTracker Namenode 1. Introduction o Motivation metadata request submit Namenode Metadata o Basic notations job metadata response 2. Overview o Architectur Client o p - r e q u e s t o MapReduce o p - Datanode Datanode r e s p o n s 2. Performance e TaskTracker TaskTracker o Test scenarios Filesystem Filesystem o Write o Read o Comparison with local FS 5 5
Client get job Secondary Queue JobTracker Namenode Outline metadata request submit Namenode Metadata job metadata response 1. Introduction o Motivation o Client p - r e q u e s t o Basic notations Datanode Datanode op-response TaskTracker TaskTracker 2. Overview Filesystem Filesystem o Architectur o MapReduce 2. Performance - is an api of a HDFS application o Test scenarios o Write - communicates with the Namenode because of metadata and directly runs the operation on Datanodes o Read o Comparison with - if it’s a MapReduce operation, client creates an job and send it into the queue. local FS JobTracker handles this queue 6
Namenode get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - is the master server which manages all system metadata like the namespace, 2. Performance access control information, mapping from files to chunks and chunk locations o Test scenarios executes file system namespace operations like opening, closing, renaming files o Write and directories o Read o Comparison with - gives instructions to the Datanodes to perform system operations, e.g. block local FS creation, deletion and replication - having only one Namenode simplifies the design 7
Datanode get job Secondary Queue JobTracker Namenode Outline metadata request submit Namenode Metadata job metadata response 1. Introduction o Motivation o Client p - r e q u e s t o Basic notations Datanode Datanode op-response TaskTracker TaskTracker 2. Overview Filesystem Filesystem o Architectur o MapReduce 2. Performance - one per node o Test scenarios o Write - stores HDFS data in its local file system o Read - performs operations by clients and system operations upon instruction from the o Comparison with Namenode local FS 8
Secondary Namenode get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - modifications to the file system are stored as a log file by the Namenode 2. Performance - while starting up, the Namenode reads the HDFS state from an image file o Test scenarios (fsimage) and then applies modifications from the log file o Write o Read - after the Namenode finished writing the new HDFS state to the image file, it o Comparison with empties the log file local FS - merges fsimage and the log file periodically and keeps the log size within a limit 9
TaskTracker get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - is a node in the cluster that accepts MapReduce tasks from the JobTracker 2. Performance - is configured with a set of slots, these indicate the number of tasks that it can o Test scenarios accept o Write - spawns a separate JVM processes to do the actual work, this helps to ensure that o Read process failure does not take down the TaskTracker o Comparison with local FS - monitors the processes and reports their state to the JobTracker 10 - contacts to the JobTracker through heartbeat meassages
JobTracker (1) get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce 2. Performance - is the MapReduce master o Test scenarios - runs normally on a separate node o Write - uses a queue for the IO scheduling o Read - talks to the NameNode to determine the location of the data o Comparison with local FS - submits the work to the chosen TaskTracker nodes and monitors them through 11 heartbeat meassages in a time interval
JobTracker (2) get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce 2. Performance - if a task is failed, it may resubmitted elsewhere o Test scenarios - when the work is completed, the JobTracker updates its status o Write - Client applications can poll the JobTracker for information o Read o Comparison with - JobTracker is a single point of failure for the Map/Reduce infrastructure. If it goes local FS down, all running jobs are lost. The fileystem remains live 12 - t here is currently no checkpointing or recovery within a single map/reduce job
MapReduce (1) Outline Is a programming model and an associated implementation for processing and generating large data sets 1. Introduction Its functions map and reduce are supplied by the user o Motivation Map o Basic notations – process a key/value pair to generate a set of intermediate key/value pairs 2. Overview – group together all intermediate values with the same key and pass them o Architectur to the Reducer o MapReduce Reduce 2. Performance – XXXXXXXXXXXXXXX o Test scenarios o Write o Read o Comparison with local FS 13
MapReduce (2) Outline 1. Introduction o Motivation o Basic notations 2. Overview o Architectur o MapReduce 2. Performance o Test scenarios o Write o Read o Comparison with local FS 14
MapReduce (3) Outline 1. Introduction o Motivation o Basic notations 2. Overview o Architectur o MapReduce 2. Performance o Test scenarios o Write o Read o Comparison with local FS 15
Example: Word count occurences (1) Outline map(String key, String value): // key: document name (usually key isn’t used) 1. Introduction // value: document contents o Motivation for each word w in value:pair. o Basic notations EmitIntermediate(w, ”1”); 2. Overview reduce(String key, Iterator values): o Architectur o MapReduce // key: a word // values: a list of counts 2. Performance int result = 0; o Test scenarios o Write for each v in values: o Read result += ParseInt(v); o Comparison with Emit(AsString(result)); local FS 16
Example: Word count occurences (2) the folder “data” contains 2 files a and b with the following contents: Outline – a: Hello World Bye World – b: Hello Hadoop Goodbye Hadoop 1. Introduction the following command will solve this problem o Motivation o Basic notations > perl -p -e ’s/s+/n/g’ data/* | sort | uniq -c 2. Overview the output looks like o Architectur 1 Bye o MapReduce 1 Goodbye 2. Performance 2 Hadoop o Test scenarios 2 Hello o Write 2 World o Read o Comparison with local FS 17
Recommend
More recommend