Introduction to Hadoop 1
Distributed Data Processing The idea of distributed databases is older than you might think Richard Peebles, Eric G. Manning: A Computer Architecture for Large (Distributed) Data Bases. VLDB 1975 : 405-427 Distributed data structures and algorithms have always been around So, what is new? 2
Distributed Data Processing Big input Final data output Data partitioning Load balancing Fault tolerance Synchronization A cluster of machines 3
MapReduce A programing paradigm for expressing distributed algorithms Introduced by Google in 2004 Google File System for distributed storage Google MapReduce for distributed processing Hadoop is the open source counterpart released in 2007 and contributed mainly by Yahoo! HDFS Hadoop MapReduce 4
Hadoop Overview Master node Resource manager Name node Node Node Node Node Node Node Slave nodes manager manager manager manager manager manager Data Data Data Data Data Data node node node node node node 5
HDFS Loading HDFS 128 MB Block 128 MB 128 MB 128 MB 88 MB Input file (600 MB) 6
HDFS Storage B B B B B B B B B B B B B B B 7
Hadoop MapReduce A kind of functional programming A program is expressed in two functions only, map and reduce Map: r → {(k,v)} Takes as input one record and returns zero or more <key, value> pairs Reduce: (k,{v}) → a Takes one key and all its associated values and returns zero or more output values 8
Example: Word Count Map(line) { split line into words if you cannot fly, for each word w then run, if you output (w,1) you: 5 cannot run, then } cannot: 3 walk, if you cannot walk: 2 walk, then crawl, if: 3 but whatever you Reduce(w, c[]) { … do you have to s = Sum(c) keep moving output(w, s) forward } Output Input text file 9
Hadoop Operation Modes RM NN Name Resource node manager Node manager One JRE Data instance node NM DN Standalone mode Pseudo-distributed Cluster mode mode 10
Recommend
More recommend