distributed computing
play

Distributed Computing 17.7. 22.7. 2011 Wolf-Tilo Balke & Pierre - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN


  1. Software Architecture • Similar for uniprocessors and multiprocessors – But for multiprocessors: the kernel is designed to handle multiple CPUs and the number of CPUs is transparent Applications Operating System Services Kernel 26

  2. Software Architecture • For multicomputers Distributed Applications OS OS OS there are several Services Services Services Kernel possiblities Kernel Kernel – Network OS Distributed Applications – Middleware Middleware Services OS OS OS – Distributed OS Services Services Services Kernel Kernel Kernel Distributed Applications Distributed OS Services Kernel Kernel Kernel 27

  3. This Course • Not about architectural issues – A lot of open discussions that would fill our time slot completely … • Our main focus: scalability and time 28

  4. Response Time Models • “Classic” cost models focus on total resource consumption of a task – Leads to good results for heavy computational load and slow network connections • If execution plan saves resources, many threads can be executed in parallel on different machines – However, algorithms can also be optimized for short response times • “Waste” some resources to get first results earlier • Take advantage of lightly loaded machines and fast connections • Utilize intra-thread parallelism – Parallelize one thread instead of cuncurrent multiple threads 29

  5. Response Time Models • Response time models are needed! – “When does the first piece of the result arrive?” • Important for Web search, query processing, – “When has the final result arrived?” 30

  6. Distributed Query Processing • Example – Assume relations or fragments A, B, C, and D – All relations/fragments are available on all nodes • Full replication – Compute 𝐵 ⋈ 𝐶 ⋈ 𝐷 ⋈ 𝐸 – Assumptions • Each join costs 20 time units (TU) • Transferring an intermediate result costs 10 TU • Accessing relations is free • Each node has one computation thread 31

  7. Distributed Query Processing • T wo plans: – Plan 1: Execute all operations on one node • T otal costs: 60 – Plan 2: Join on different nodes, ship results • T otal costs: 80 Node 3 ⋈ Plan 1 Plan 2 Receive Receive Node 1 ⋈ Node 1 Node 2 Send Send ⋈ ⋈ ⋈ ⋈ C D A B C D A B 32

  8. Distributed Query Processing • With respect to total costs, plan 1 is better • Example (cont.) – Plan 2 is better wrt. to response time as operations can be carried out in parallel 𝐵 ⋈ 𝐶 𝐷 ⋈ 𝐸 𝐵𝐶 ⋈ 𝐷𝐸 Plan 1 𝐷 ⋈ 𝐸 𝑈𝑌 𝐵𝐶 ⋈ 𝐷𝐸 Plan 2 𝐵 ⋈ 𝐶 𝑈𝑌 40 50 60 30 0 Time 10 20 33

  9. Distributed Query Processing • Response Time – Two types of response times • First Tuple & Full Result Response Time • Computing response times – Sequential execution parts • Full response time is sum of all computation times of all used operations – Multiple parallel threads • Maximal costs of all parallel sequences 34

  10. Response Time Models • Considerations : – How much speedup is possible due to parallelism? • Or: “Does kill -it-with- iron” work for parallel problems? – Performance speed-up of algorithms is limited by Amdahl’s Law • Gene Amdahl, 1968 • Algorithms are composed of parallel and sequential parts • Sequential code fragments severely limit potential speedup of parallelism ! 35

  11. Response Time Models – Possible maximal speed-up: 𝑞 • 𝑛𝑏𝑦𝑡𝑞𝑓𝑓𝑒𝑣𝑞 ≤ 1+𝑡∗ 𝑞−1 • 𝑞 is number of parallel threads • 𝑡 is percentage of single-threaded code – e.g. if 10% of an algorithm is sequential, the maximum speed up regardless of parallelism is 10x – For maximal efficient parallel systems, all sequential bottlenecks have to be identified and eliminated! 36

  12. Response Time Models 37

  13. Response Time Models • Good First item Response benefits from operations executed in a pipelined fashion – Not pipelined : • Each operation is fully completed and a intermediate result is created • Next operation reads intermediate result and is then fully completed – Reading and writing of intermediate results costs resources! – Pipelined • Operations do not create intermediate results • Each finished tuple is fed directly into the next operation • Tuples “ flow ” through the operations 38

  14. Response Time Models • Usually, the result flow is controlled by iterator interfaces implemented by each operation – “Next” command – If execution speed of operations in the pipeline differ, results are either cached or the pipeline blocks • Some operations are more suitable than others for pipelining – Good: selections, filtering, unions, … – Tricky: joining, intersecting, … – Very Hard: sorting 39

  15. Pipelined Query Processing • Simple pipeline example: – Tablescan, Selection, Projection • 1000 tuples are scanned, selectivity is 0.1 – Costs: • Accessing one tuple during tablescan: 2 TU (time unit) • Selecting (testing) one tuple: 1 TU time event • Projecting one tuple: 1 TU 2 First tuple finished Pipelined Non-Pipelined tablescan Final 3 First tuple finished FInal time event selection (if Pipeline selected…) 2 First tuple in IR1 Projection Projection 4 First tuple in Final 2000 All tuples in IR1 IR2 3098 Last tuple finished 2001 First tuple in IR2 Selection Selection tablescan 3000 All tuples in IR2 3099 Last tuple finished 3001 First tuple in Final IR1 selection Table Scan 3100 All tuples in Final Table Scan 3100 All tuples in Final 40

  16. Pipelined Query Processing BNL Join • Consider following example: Pipeline Pipeline – Joining two table subsets Projection Projection • Non-pipelined BNL join Selection Selection • Both pipelines work in parallel Table Scan Table Scan – Costs: • 1.000 tuples are scanned in each pipeline, selectivity 0.1 • Joining 100 ⋈ 100 tuples: 10,000 TU (1 TU per tuple combination) – Response time • The first tuple can arrive at the end of any pipeline after 4 TU – Stored in intermediate result • All tuples have arrived at the end of the pipelines after 3,100 TU • Final result will be available after 13,100 TU – No benefit from pipelining wrt. response time – First tuple arrives at 3100 ≪ 𝑢 ≤ 13100 41

  17. Pipelined Query Processing • The suboptimal result of the previous example is due to the unpipelined join – Most traditional join algorithms are unsuitable for pipelining • Pipelining is not a necessary feature in a strict single thread environment – Join is fed by two input pipelines – Only one pipeline can be executed at a time – Thus, at least one intermediate result has to be created – Join may be performed single / semi-pipelined • In parallel / distributed DBs, fully pipelined joins are beneficial 42

  18. Pipelined Query Processing • Single-Pipelined-Hash-Join – One of the “ classic ” join algorithms – Base idea 𝑩 ⋈ 𝑪 • One input relation is read from an intermediate result ( B ), the other is pipelined though the join operation ( A ) • All tuples of B are stored in a hash table – Hash function is used on the join attribute – i.e. all tuples showing the same value for the join attribute are in one bucket » Careful: hash collisions! Tuple with different joint attribute value might end up in the same bucket! • Every incoming tuple a (via pipeline) of A is hashed by join attributed • Compare a to each tuple in the respective B bucket – Return those tuples which show matching join attributes 43

  19. Pipelined Query Processing Output Feed • Double-Pipelined-Hash-Join AB – Dynamically build a hashtables for A and B tuples 𝑩 ⋈ 𝑪 • Memory intensive! A Hash B Hash – Process tuples on arrival Hash Tuple Hash Tuple • Cache tuples if necessary 17 A1, A2 29 B1 • Balance between A and B tuples for better performance 31 A3 • Rely on statistics for a good A:B ratio – If a new A tuple a arrives • Insert a into the A -table • Check in the B table if there are join A B Input Feeds partners for a • If yes, return all combined AB tuples 31 B2 – If a new B tuple arrives, process it analogously 44

  20. Pipelined Query Processing Output Feed AB 𝑩 ⋈ 𝑪 • B(31,B2) arrives B Hash A Hash • Insert into B Hash Hash Tuple Hash Tuple • Find matching A tuples 17 A1, A2 29 B1 • Find A3 31 A3 31 B2 • Assume that A3 matches B2… • Put AB(A2, B2) into output feed A B 31 B2 Input Feeds 45

  21. Pipelined Query Processing • In pipelines, tuples just “flow” through the operations – No problem with that in one processing unit… – How do tuples flow to other nodes? • Sending each tuple individually may be very ineffective – Communication costs : • Setting up transfer & opening communication channel • Composing message • Transmitting message: header information & payload – Most protocols impose a minimum message size & larger headers – Tuplesize ≪ Minimal Message Size • Receiving & decoding message • Closing channel 46

  22. Pipelined Query Processing • Idea: Minimize Communication Overhead by Tuple Blocking – Do not send single tuples, but larger blocks containing multiple tuples • “Burst - Transmission” • Pipeline-Iterators have to be able to cache packets • Block size should be at least the packet size of the underlying network protocol – Often, larger packets are more beneficial – ….more cost factors for the model 47

  23. Google File System • GFS ( G oogle F ile S ystem) is the distributed file system used by most Google services – Driver in development was managing the Google Web search index – Applications may use GFS directly • The database Bigtable is an application that was especially designed to run on-top of GFS – GFS itself runs on-top of standard POSIX-compliant Linux file systems • Hadoop’s file system (HDFS) was coded inspired by GFS papers, only open source… 48

  24. GFS • Design constraints and considerations – Run on potentially unreliable commodity hardware – Files are large (usually ranging from 100 MB to multiple GBs of size) • e.g. satellite imaginary, or a Bigtable file – Billions of files need to be stored – Most write operations are appends • Random writes or updates are rare • Most files are write-once, read-many (WORM) • Appends are much more resilient in distributed environments than random updates • Most Google applications rely on Map and Reduce which naturally results in file appends 49

  25. GFS – Two common types of read operations • Sequential streams of large data quantities – e.g. streaming video, transferring a web index chunk, etc. – Frequent streaming renders caching useless • Random reads of small data quantities – However, random reads are usually “always forward”, e.g. similar to a sequential read skipping large portions of the file – Focus of GFS is on high overall bandwidth , not latency • In contrast to system like e.g. Amazon Dynamo – File system API must be simple and expandable • Flat file name space suffices – File path is treated as string » No directory listing possible – Qualifying file names consist of namespace and file name • No POSIX compatibility needed • Additional support for file appends and snapshot operations 50

  26. GFS • A GFS cluster represents a single file system for a certain set of applications • Each cluster consists of – A single master server • The single master is one of the key features of GFS! – Multiple chunk servers per master • Accessed by multiple clients – Running on commodity Linux machines • Files are split into fixed-sized chunks – Similar to file system blocks – Each labeled with a 64-bit unique global ID – Stored at a chunk server – Usually, each chunk is three times replicated across chunk servers 51

  27. GFS • Application requests are initially handled by a master server – Further, chunk-related communication is performed directly between application and chunk server 52

  28. GFS • Master server – Maintains all metadata • Name space, access control, file-to-chunk mappings, garbage collection, chunk migration – Queries for chunks are handled by the master server • Master returns only chunk locations • A client typically asks for multiple chunk locations in a single request • The master also optimistically provides chunk locations immediately following those requested • GFS clients – Consult master for metadata – Request data directly from chunk servers • No caching at clients and chunk servers due to the frequent streaming 53

  29. GFS • Files (cont.) – Each file consists of multiple chunks – For each file, there is a meta-data entry • „ File namespace • „ File to chunk mappings • „ Chunk location information – Including replicas ! • „ Access control information • „ Chunk version numbers 54

  30. GFS • Chunks are rather large (usually 64MB) – Advantages • Less chunk location requests • Less overhead when accessing large amounts of data • Less overhead for storing meta data • Easy caching of chunk metadata – Disadvantages • Increases risk for fragmentation within chunks • Certain chunks may become hot spots 55

  31. GFS • Meta-Data is kept in main-memory of master server – Fast, easy and efficient to periodically scan through meta data • „ Re-replication in the presence of chunk server failure • „ Chunk migration for load balancing • „ Garbage collection – Usually, there are 64Bytes of metadata per 64MB chunk • Maximum capacity of GFS cluster limited by available main memory of master – In practice, query load on master server is low enough such that it never becomes a bottle neck 56

  32. GFS • Master server relies on soft-states – Regularly sends heart-beat messages to chunk servers • Is chunk server down? • Which chunks does chunk server store? – Including replicas • Are there any disk failures at a chunk server? • Are any replicas corrupted? – Test by comparing checksums – „ Master can send instructions to chunk server • Delete existing chunks • Create new empty chunk 57

  33. GFS • All modifications to meta-data are logged into an operation log to safeguard against GFS master failures – Meta-data updates are not that frequent – The operation log contains a historical record of critical metadata changes, replicated on multiple remote machines – Checkpoints for fast recovery • Operation log can also serve to reconstruct a timeline of changes – Files and chunks, as well as their versions are all uniquely and eternally identified by the logical times at which they were created – In case of failure, the master recovers its file system state by replaying the operation log • Usually, a shadow master is on hot-standby to take over during recovery 58

  34. GFS • Guarantees of GFS – Namespace mutations are always atomic • Handled by the master with locks • e.g. creating new files or chunks • Operation is only treated as successful when operation is performed and all log replicas are flushed to disk 59

  35. GFS – Data mutations follow a relaxed consistency model • A chunk is consistent, if all clients see the same data, independently of the queried replica • A chunk is defined, if all its modifications are visible – i.e. writes have been atomic – GFS can recognize defined and undefined chunks • In most cases, all chunks should be consistent and defined – …but not always. – Only using append operations for data mutations minimizes probability for undefined or inconsistent chunks 60

  36. GFS • Mutation operations – To encourage consistency among replicas, the master grants a lease for each chunk to a chunk server • Server owning the lease is responsible for that chunk – i.e. has the primary replica and is responsible for mutation operations • Leases are granted for a limited time (e.g. 1 minute) – Granting leases can be piggybacked to heartbeat messages – Chunk server may request a lease extension, if it currently mutates the chunk – If a chunk server fails, a new leases can be handed out after the original one expired » No inconsistencies in case of partitions 61

  37. GFS • Mutation operations have a separated data flow and control flow – Idea: maximize bandwidth utilization and overall system throughput – Primary replica chunk server is responsible for control flow ´ 62

  38. GFS • Mutation workflow overview 1 Client Master 4 2 3 Secondary Replica A 3 Control flow 7 6 5 Primary Replica 5 Data flow 3 6 Secondary Replica B 63

  39. GFS • Application originates mutation request 1. GFS client translates request from (filename, data) to (filename, chunk index) , and sends it to master – Client “knows” which chunk to modify • Does not know where the chunk and its replicas are located 2. Master responds with chunk handle and (primary + secondary) replica locations 1 Client Master 2 64

  40. GFS 3. Client pushes write data to all replicas – Client selects the “best” replica chunk server and Client transfers all new data • e. g. closest in the network, or with highest known 3 bandwidth • Not necessarily the server holding the lease Secondary • New data: the new data and the address range it is Replica A supposed to replace 3 – Exception: appends Primary – Data is stored in chunk servers’ internal buffers Replica • New data is stored as fragments in buffer 3 – New data is pipelined forward to next chunk Secondary server Replica B • … and then the next • Serially pipelined transfer of the data • Try to optimize bandwidth usage 65

  41. GFS 4. After all replicas received the data, the client sends a write request to the primary chunk server – Primary determines serial order for new data fragments stored in its buffer and writes the fragments in that order to the chunk • Write of fragments is thus atomic – No additional write request are served during write operation • Possibly multiple fragments from Client 4 one or multiple clients Primary Replica 66

  42. GFS 5. After the primary server successfully finished writing the chunk, it orders the replicas to write – The same serial order is used! • Also, the same timestamps are used – Replicas are inconsistent for a Secondary short time Replica A 3 6 6. After the replicas completed, 5 Primary the primary server is notified Replica 5 3 6 Secondary Replica B 67

  43. GFS 7. The primary notifies the client – Also, all error are reported to the client • Usually, errors are resolved by retrying some parts of the workflow – Some replicas may contain the same datum multiple times due to retries – Only guarantee of GFS: data will be written at least once atomically • Failures may render chunks inconsistent Client 7 Primary Replica 68

  44. GFS • Google aims at using append operations for most mutations – For random updates, clients need to provide the exact range for the new data within the file • Easy to have collisions with other clients – i.e. client A write to range 1, client B overwrites range 1 because it assumed it as empty – Usually, locks would solve the problem – Appends can be easily performed in parallel • Just transfer new data to chunk server – Clients can transfer new data in parallel – Chunks server buffers data • Chunk server will find a correct position at the end of the chunk – Additional logic necessary for creating new chunks if current chunk cannot hold new data – Typical use case • Multiple producers append to the same file while simultaneously multiple consumer read from it – e.g. then of the web crawler and feature extraction engine 69

  45. GFS • Master takes care of chunk creation and distribution – New empty chunk creation , re-replication , rebalances • Master server notices if a chunk has to few replicas and can re- replicate – Master decides on chunk location. Heuristics: • Place new replicas on chunk servers with below-average disk space utilization . Over time this will equalize disk utilization across chunk servers • Limit the number of “ recent ” creations on each chunk server – Chunks should have different age to spread chunk correlation • Spread replicas of a chunk across racks 70

  46. GFS • After a file is deleted, GFS does not immediately reclaim the available physical storage – Just delete meta-data entry from the master server – File or chunks become stale • Chunks or files may also become stale if a chunk server misses an update to a chunk – Updated chunk has a different Id than old chunk – Master server holds only links to new chunks • Master knows the current chunks of a file • Heartbeat messages with unknown (e.g. old) chunks are ignored • During regular garbage collection , stale chunks are physically deleted 71

  47. GFS • Experiences with GFS – Chunk server workload • Bimodal distribution of small and large files • Ratio of write to append operations: 4:1 to 8:1 • Virtually no overwrites – Master workload • Most request for chunk locations and open files – Reads achieve 75% of the network limit – Writes achieve 50% of the network limit 72

  48. GFS • Summary and notable features GFS – GFS is a distributed file system • Optimized for file append operations • Optimized for large files – File are split in rather large 64MB chunks and distributed and replicated – Uses single master server for file and chunk management • All meta-data in master server in main memory – Uses flat namespaces 73

  49. Distributed Computing Web Data Management http://webdam.inria.fr/Jorge/ S. Abiteboul, I. Manolescu, P . Rigaux, M.-C. Rousset, P . Senellart July 19, 2011

  50. Outline MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 2 / 71

  51. Outline MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 3 / 71

  52. Data analysis at a large scale Very large data collections (TB to PB) stored on distributed filesystems: Query logs Search engine indexes Sensor data Need efficient ways for analyzing, reformatting, processing them In particular, we want: Parallelization of computation (benefiting of the processing power of all nodes in a cluster) Resilience to failure Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 4 / 71

  53. Centralized computing with distributed data storage Run the program at client node, get data from the distributed system. memory Client node program disk memory memory data flow (input) disk disk data flow (output) Downsides: important data flows, no use of the cluster computing resources. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 5 / 71

  54. Pushing the program near the data process1 Client node program() coordinator() result disk program() program() result process3 process2 result disk disk MapReduce: A programming model (inspired by standard functional programming operators) to facilitate the development and execution of distributed tasks. Published by Google Labs in 2004 at OSDI [Dean and Ghemawat, 2004]. Widely used since then, open-source implementation in Hadoop. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 6 / 71

  55. MapReduce in Brief The programmer defines the program logic as two functions: Map transforms the input into key-value pairs to process Reduce aggregates the list of values for each key The MapReduce environment takes in charge distribution aspects A complex program can be decomposed as a succession of Map and Reduce tasks Higher-level languages (Pig, Hive, etc.) help with writing distributed applications Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 7 / 71

  56. Outline MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 8 / 71

  57. Three operations on key-value pairs 1. User-defined: map : ( K , V ) → list ( K ′ , V ′ ) Example function map(uri , document) foreach distinct term in document output (term , count(term , document )) 2. Fixed behavior: shuffle : list ( K ′ , V ′ ) → list ( K ′ , list ( V ′ )) regroups all intermediate pairs on the key 3. User-defined: reduce : ( K ′ , list ( V ′ )) → list ( K ′′ , V ′′ ) Example function reduce(term , counts) output (term , sum(counts )) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 9 / 71

  58. Job workflow in MapReduce Important: each pair, at each phase, is processed independently from the other pairs. (k' 1, v'1) Map operator ... (k'2, v'2) (k'2, <v'2, ...>) ... ... ,(kn, vn) , ... , (k2,v2), (k1, v1) map(k1, v1) (k'1, v'p) Input: a list of (key, value) pairs ... Intermediate (k'1, v'q) structure ... (k'1, <v'1, v'p, v'q, ...>) Reduce operator (v") reduce(k'1, <v'1, v'p, v'q, ...>) Network and distribution are transparently managed by the MapReduce environment. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 10 / 71

  59. Example: term count in MapReduce (input) URL Document the jaguar is a new world mammal of the felidae family. u 1 for jaguar, atari was keen to use a 68k family device. u 2 mac os x jaguar is available at a price of us $199 for apple’s u 3 new “family pack”. one such ruling family to incorporate the jaguar into their name u 4 is jaguar paw. it is a big cat. u 5 Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 11 / 71

  60. Example: term count in MapReduce term count jaguar 1 mammal 1 family 1 jaguar 1 available 1 jaguar 1 family 1 family 1 jaguar 2 . . . map output shuffle input Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 12 / 71

  61. Example: term count in MapReduce term count jaguar 1 mammal 1 term count family 1 jaguar 1,1,1,2 jaguar 1 mammal 1 available 1 family 1,1,1 jaguar 1 available 1 family 1 . . . family 1 shuffle output jaguar 2 reduce input . . . map output shuffle input Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 12 / 71

  62. Example: term count in MapReduce term count jaguar 1 mammal 1 term count term count family 1 jaguar 1,1,1,2 jaguar 5 jaguar 1 mammal 1 mammal 1 available 1 family 1,1,1 family 3 jaguar 1 available 1 available 1 family 1 . . . . . . family 1 shuffle output final output jaguar 2 reduce input . . . map output shuffle input Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 12 / 71

  63. Example: simplification of the map function map(uri , document) foreach distinct term in document output (term , count(term , document )) can actually be further simplified: function map(uri , document) foreach term in document output (term , 1) since all counts are aggregated. Might be less efficient though (we may need a combiner, see further) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 13 / 71

  64. A MapReduce cluster Nodes inside a MapReduce cluster are decomposed as follows: A jobtracker acts as a master node; MapReduce jobs are submitted to it Several tasktrackers run the computation itself, i.e., map and reduce tasks A given tasktracker may run several tasks in parallel Tasktrackers usually also act as data nodes of a distributed filesystem (e.g., GFS, HDFS) + a client node where the application is launched. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 14 / 71

  65. Processing a MapReduce job A MapReduce job takes care of the distribution, synchronization and failure handling. Specifically: the input is split into M groups; each group is assigned to a mapper (assignment is based on the data locality principle) each mapper processes a group and stores the intermediate pairs locally grouped instances are assigned to reducers thanks to a hash function ( shuffle ) intermediate pairs are sorted on their key by the reducer one obtains grouped instances, submitted to the reduce function Remark: the data locality does no longer hold for the reduce phase, since it reads from the mappers. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 15 / 71

  66. Assignment to reducer and mappers Each mapper task processes a fixed amount of data (split), usually set to the distributed filesystem block size (e.g., 64MB) The number of mapper nodes is function of the number of mapper tasks and the number of available nodes in the cluster: each mapper nodes can process (in parallel and sequentially) several mapper tasks Assignment to mapper tries optimizing data locality: the mapper node in charge of a split is, if possible, one that stores a replica of this split (or if not possible, a node of the same rack) The number of reducer tasks is set by the user Assignment to reducers is done through a hashing of the key, usually uniformly at random; no data locality possible Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 16 / 71

  67. Distributed execution of a MapReduce job. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 17 / 71

  68. Processing the term count example Let the input consists of documents, say, one million 100-terms documents of approximately 1 KB each. The split operation distributes these documents in groups of 64 MBs: each group consist of 64,000 documents. Therefore M = ⌈ 1 , 000 , 000 / 64 , 000 ⌉ ≈ 16 , 000 groups. If there are 1 , 000 mapper node, each node processes on average 16 splits. If there are 1 , 000 reducers, each reducer r i processes all key-value pairs for terms t such that hash ( t ) = i (1 ≤ i ≤ 1 , 000) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 18 / 71

  69. Processing the term count example (2) Assume that hash(’call’) = hash(’mine’) = hash(’blog’) = i = 100. We focus on three Mappers m p , m q and m r : 1. G p i = (<. . . , (’mine’, 1), . . . , (’call’,1), . . . , (’mine’,1), . . . , (’blog’, 1) . . . > 2. G q i = (< . . . , (’call’,1), . . . , (’blog’,1), . . . > 3. G r i = (<. . . , (’blog’, 1), . . . , (’mine’,1), . . . , (’blog’,1), . . . > r i reads G p i , G p i and G p i from the three Mappers, sorts their unioned content, and groups the pairs with a common key: . . . , (’blog’, <1, 1, 1, 1>), . . . , (’call’, <1, 1>), . . . , (’mine’, <1, 1, 1>) Our reduce function is then applied by r i to each element of this list. The output is (’blog’, 4), (’call’, 2) and (’mine’, 3) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 19 / 71

  70. Failure management In case of failure, because the tasks are distributed over hundreds or thousands of machines, the chances that a problems occurs somewhere are much larger; starting the job from the beginning is not a valid option. The Master periodically checks the availability and reachability of the tasktrackers (heartbeats) and whether map or reduce jobs make any progress 1. if a reducer fails, its task is reassigned to another tasktracker; this usually require restarting mapper tasks as well (to produce intermediate groups) 2. if a mapper fails, its task is reassigned to another tasktracker 3. if the jobtracker fails, the whole job should be re-initiated Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 20 / 71

  71. Joins in MapReduce Two datasets, A and B that we need to join for a MapReduce task If one of the dataset is small, it can be sent over fully to each tasktracker and exploited inside the map (and possibly reduce ) functions Otherwise, each dataset should be grouped according to the join key, and the result of the join can be computing in the reduce function Not very convenient to express in MapReduce. Much easier using Pig. Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 21 / 71

  72. Using MapReduce for solving a problem Prefer: Simple map and reduce functions Mapper tasks processing large data chunks (at least the size of distributed filesystem blocks) A given application may have: A chain of map functions (input processing, filtering, extraction. . . ) A sequence of several map - reduce jobs No reduce task when everything can be expressed in the map (zero reducers, or the identity reducer function) Not the right tool for everything(see further) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 22 / 71

  73. Outline MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 23 / 71

  74. Combiners A mapper task can produce a large number of pairs with the same key They need to be sent over the network to the reducer: costly It is often possible to combine these pairs into a single key-value pair Example (jaguar,1), (jaguar, 1), (jaguar, 1), (jaguar, 2) → (jaguar, 5) combiner : list ( V ′ ) → V ′ function executed (possibly several times) to combine the values for a given key, on a mapper node No guarantee that the combiner is called Easy case: the combiner is the same as the reduce function. Possible when the aggregate function α computed by reduce is distributive: α ( k 1 , α ( k 2 , k 3 )) = α ( k 1 , k 2 , k 3 ) Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 24 / 71

  75. Compression Data transfers over the network: From datanodes to mapper nodes (usually reduced using data locality) From mappers to reducers From reducers to datanodes to store the final output Each of these can benefit from data compression Tradeoff between volume of data transfer and (de)compression time Usually, compressing map outputs using a fast compressor increases efficiency Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart 25 / 71

Recommend


More recommend