13 June 2012 MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012 Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Schä hätz tzle le Georg org Laus usen en University of Freiburg Databases & Information Systems
Age gend nda 1. Why MapReduce? a. Data Management Approaches b. Distributed Processing of Data 2. Apache Hadoop a. HDFS b. MapReduce c. Pig d. Hive e. HBase 3. Programming MapReduce a. MapReduce with Java b. Moving into the Cloud 0. Agenda nda MapReduce Introduction 2
MapRedu pReduce: ce: Why hy? Large ge datasets sets ◦ Amount of data increases constantly ◦ “Five exabytes of data are generated every to days” (corresponds to the whole amount of data generated up to 2003) by Eric Schmidt Fac acebo ebook: ok: ◦ >800 million active users in Facebook, interacting with >1 billion objects ◦ 2.5 petabytes of userdata per day! How to explore, ore, analyze yze such ch large e datasets sets? 1. MapReduc pReduce MapReduce Introduction 3 Data a Manageme ement nt
MapRedu pReduce: ce: Why hy? Proce cessi ssing ng 100 TB d datase set On 1 node ◦ Scanning @ 50 MB/s = 23 3 days On 1000 node cluster ◦ Scanning @ 50 MB/s = 33 3 min in Curren ent t development elopment ◦ Companies often can't cope with logged user behavior and throw away data after some time lost opportunities ◦ Growing cloud-computing capacities ◦ Price/performance advantage of low-end servers increases to about a factor of twelve 1. MapReduc pReduce MapReduce Introduction 4 Data a Manageme ement nt
Data ta Mana nageme gement nt Approache roaches High gh-per perfo forman rmance e single gle machi achines es ◦ “Scale - up” with limits (hardware, software, costs) ◦ Workloads today are beyond the capacity of any single machine ◦ I/O Bottleneck Parall llel l Databas abases es ◦ Fast and reliable ◦ “Scale -out ” restricted to some hundreds machines ◦ Maintaining & administrations of parallel databases is hard Specia eciali lized zed clust uster er of f power owerful l machi chines es ◦ “specialized” = powerful hardware satisfying individual software needs ◦ fast and reliable but also very expensive ◦ For data- intensive applications: scaling “out” is superior to scaling “up” performance gab insufficient to justify the price 1. 1. MapReduc pReduce MapReduce Introduction 5 Data a Manageme ement nt
Data ta Mana nageme gement nt Approache roaches s (2) Clusters usters of commo ommodity dity servers vers (wit ith h MapRe pReduce) duce) “ Commodity ty servers” = not individually adjusted ◦ e.g. 8 cores, 16G of RAM ◦ Cost & energy efficiency MapRe Reduce uce ◦ Designed around clusters of commodity servers ◦ Widely used in a broad range of applications ◦ By many organizations ◦ Scaling ”out”, e.g. Yahoo! uses > 40.000 machines ◦ Easy to maintain & administrate 1. MapReduc 1. pReduce MapReduce Introduction 6 Data a Manageme ement nt
Distributed tributed Proc ocessing ssing of of Da Data ta Proble lem: m: How to compute the PageRank for a crawled set of websites on a cluster of machines? MapReduce! Main in Challen hallenges: ges: ◦ How to break up a large problem into smaller tasks, that can be executed in parallel? ◦ How to assign tasks to machines? ◦ How to partition and distribute data? ◦ How to share intermediate results? ◦ How to coordinate synchronization, scheduling, fault-tolerance? 1. MapReduc pReduce MapReduce Introduction 7 Dist stribut uted ed Proce cessi ssing ng
Big g ideas eas behi hind nd MapRedu pReduce ce Scale “out”, not “up” ◦ Large number of commodity servers Assum ume failu lures res are commo ommon ◦ In a cluster of 10000 servers, expect 10 failures a day. Move e proce cessin sing g to the data ◦ Take advantage of data locality and avoid to transfer large datasets through the network Process cess data a sequentia quentiall lly and nd avo void id rando dom access cess ◦ Random disk access causes seek times Hide e system em-lev evel el detail ails from om the appli licat atio ion n developer veloper ◦ Developers can focus on their problems instead of dealing with distributed programming issues Seam amless less scala alabil bilit ity ◦ Scaling “out” improves the performance of an algorithm without any modifications 1. MapReduc pReduce MapReduce Introduction 8 Dist stribut uted ed Proce cessi ssing ng
MapRedu pReduce ce MapReduc educe ◦ Popularized by Google & widely used ◦ Algorithms that can be expressed as (or mapped to) a sequence of Map() and Reduce() functions are automatically parallelized by the framework Distr tributed ibuted File e System tem ◦ Data is split into equally sized blocks and stored distributed ◦ Clusters of commodity hardware Fault tolerance by replication ◦ Very large files / write-once, read-many pattern Advant ntage ages ◦ Partitioning + distribution of data That all is done ◦ Parallelization and assigning of task automatically! ◦ Scalability, fault- tolerance, scheduling,… 1. MapReduc 1. pReduce MapReduce Introduction 9 Dist stribut uted ed Proce cessi ssing ng
Wha hat t is MapRedu pReduce ce us used d fo for? At Google gle ◦ Index construction for Google Search (replaced in 2010 by Caffeine) ◦ Article clustering for Google News ◦ Statistical machine translation At Yahoo oo! ◦ “Web map” powering Yahoo! Search ◦ Spam detection for Yahoo! Mail At Face cebook book ◦ Data mining, Web log processing ◦ SearchBox (with Cassandra) ◦ Facebook Messages (with HBase) ◦ Ad optimization ◦ Spam detection 1. 1. MapReduc pReduce MapReduce Introduction 10 Dist stribut uted ed Proce cessi ssing ng
Wha hat t is MapRedu pReduce ce us used d fo for? In resear arch ch ◦ Astronomical image analysis (Washington) ◦ Bioinformatics (Maryland) ◦ Analyzing Wikipedia conflicts (PARC) ◦ Natural language processing (CMU) ◦ Particle physics (Nebraska) ◦ Ocean climate simulation (Washington) ◦ Processing of Semantic Data (Freiburg) ◦ <Your application here> 1. MapReduc 1. pReduce MapReduce Introduction 11 Dist stribut uted ed Proce cessi ssing ng
Age gend nda 1. Why MapReduce? a. Data Management Approaches b. Distributed Processing of Data 2. Apache Hadoop a. HDFS b. MapReduce c. Pig d. Hive e. HBase 3. Programming MapReduce a. MapReduce with Java b. Moving into the Cloud 0. Agenda nda MapReduce Introduction 12
2. Ap Apache che Hadoop oop “Open -source software for reliable, scalable, distributed computing” 2. Hado doop MapReduce Introduction 13
Apac ache he Ha Hadoop: oop: Why hy? Apache che Hadoop oop ◦ Well-known Open-Source implementation of ◦ Google’s MapReduce & Google File System (GFS) paper ◦ Enriched by many subprojects ◦ Used by Yahoo, Facebook, Amazon, IBM, Last.fm , EBay … ◦ Cloudera’s Distribution with VMWare images, tutorials and further patches 2. Hado doop MapReduce Introduction 14
Ha Hadoop oop Ecosy osystem stem PIG Hive Chukwa (Data Flow) (SQL) (Managing) ZooKeeper MapReduce (Coordination) (Serialization) Avro (Job Scheduling/Execution System) HBase (NoSQL) HDFS (Hadoop Distributed File System) Hadoop Common (supporting utilities, libraries) 2. Hado doop MapReduce Introduction 15
Yahoo’s Hadoop Cluster Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf 2. Hado doop MapReduce Introduction 16
Ha Hadoop oop Distributed tributed File le System tem Files split into 64MB blocks NameNode File1 Blocks replicated across several 1 DataNodes (usually 3) 2 3 4 Single NameNode stores metadata ◦ file names, block locations, etc Optimized for large files, sequential reads 1 2 1 3 Files are append-only 2 1 4 2 4 3 3 4 DataNodes 2. Hado doop MapReduce Introduction 17 HDFS
Hadoop Ha oop Archite hitectu cture re Master & Slaves architecture JobTracker + Master NameNode JobTracker schedules and manages jobs TaskTracker executes individual map() and reduce() task on each cluster node JobTracker and Namenode as well as TaskTrackers and TaskTracker TaskTracker TaskTracker DataNodes are placed on the + + + DataNode DataNode DataNode same machines Slaves 2. Hado doop MapReduce Introduction 18 Archi hitec ecture ure
MapRedu pReduce ce Wor orkflow kflow (1) Map Phase se ◦ Raw data read and converted to key/value pairs ◦ Map() function applied to any pair (2) Shuff ffle le Phase ase ◦ All key/value pairs are sorted and grouped by their keys (3) Reduce ce Phas ase ◦ All values with a the same key are processed by within the same reduce() function 2. Hado doop MapReduce Introduction 19 MapR pRed educe uce
Recommend
More recommend