MapReduce Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. MapReduce and SQL Injections OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 CS 3200 Final Lecture 1 2 Introduction MapReduce How to write software for a cluster? Abstraction to express computation while hiding 1000, 10,000, maybe more machines messy details • Failure or crash is not exception, but common phenomenon Inspired by map and reduce primitives in Lisp Parallelize computation Distribute data Apply map to each input record to create set of Balance load intermediate key-value pairs Makes implementation of conceptually Apply reduce to all values that share the same key (like straightforward computations challenging GROUP BY) Create inverted indices Automatically parallelized Representations of the graph structure of Web documents Number of pages crawled per host Re-execution as primary mechanism for fault Most frequent queries in a given day tolerance 3 4 Programming Model Example Count number of occurrences of each word in a document collection: Transforms set of input key-value pairs to set of output key-value pairs reduce( String key, Iterator values ): Map written by user map( String key, String value ): // key: a word Map: (k1, v1) list (k2, v2) // key: document name // values: a list of counts // value: document contents int result = 0; MapReduce library groups all intermediate pairs with for each word w in value: for each v in values: same key together EmitIntermediate( w, " 1“ ); result += ParseInt( v ); Emit( AsString(result) ); Reduce written by user Reduce: (k2, list (v2)) list (v2) Usually zero or one output value per group This is almost all the coding needed… Intermediate values supplied via iterator (to handle lists (need also mapreduce specification object with names of input and that do not fit in memory) output files, and optional tuning parameters) 5 6 1
Execution Implementation Overview Master notifies reducers about Focuses on large clusters intermediate file locations Relies on existence of reliable and highly available distributed file system Map invocations Automatically partition input data into M chunks (16-64 MB typically) Chunks processed in parallel Reduce invocations Partition intermediate key space into R pieces, e.g., using hash(key) mod R Master node controls program execution Reducers (i) read all data from mappers, Mappers inform master (ii) sort by intermediate key, (iii) perform about file locations computation for each group 7 8 Fault Tolerance Practical Considerations Master monitors tasks on mappers and reducers: idle, in- Conserve network bandwidth (“ Locality optimization”) Distributed file system assigns data chunks to local disks progress, completed Schedule map task on machine that already has a copy of the chunk, Worker failure (common) or one “nearby” Master pings workers periodically Choose M and R much larger than number of worker machines No response => assumes worker failed Load balancing and faster recovery (many small tasks from failed • Resets worker’s map tasks, completed or in progress, to idle state machine) (tasks now available for scheduling on other workers) Limitation: O(M+R) scheduling decisions and O(M*R) in-memory state • Completed tasks only on local disk, hence inaccessible at master • Same for reducer’s in -progress tasks Common choice: M so that chunk size is 16-64 MB, R a small multiple • Completed tasks stored in global file system, hence accessible of number of workers Reducers notified about change of mapper assignment Backup tasks to deal with machines that take unusually long for last few tasks Master failure (unlikely) For in-progress tasks when MapReduce near completion Checkpointing or simply abort computation 9 11 Applicability of MapReduce MapReduce vs. DBMS Map: assume table “ InputFile ” with schema (key1, val1) Machine learning algorithms, clustering is input; “ mapFct ” is a user -defined function that can Data extraction for reports of popular queries output a set with schema (key2, val2) Extraction of page properties, e.g., geographical location Graph computations SELECT mapFct( key1, val1) AS (key2, val2) // Not really correct SQL Google indexing system Sequence of 5-10 MapReduce operations FROM InputFile Smaller simpler code (3800 LOC -> 700 LOC) Easier to change code Reduce: assume MapOutput has schema (key2, val2); redFct is a user-defined function Easier to operate, because MapReduce library takes care of failures Easy to improve performance by adding more machines SELECT redFct( val2 ) FROM MapOutput GROUP BY key2 13 14 2
Parallel DBMS MapReduce Summary MapReduce = programming model that hides details of SQL specifies what to compute, not how to do it parallelization, fault tolerance, locality optimization, and load Perfect for parallel and distributed implementation balancing “Just” need an optimizer that can choose best plan in Simple model, but fits many common problems given parallel/distributed system Implementation on cluster scales to 1000s of machines and • Cost estimate includes disk, CPU, and network cost more Recent benchmarks show parallel DBMS can Open source implementation, Hadoop, is available Parallel DBMS, SQL are more powerful than MapReduce and significantly outperform MapReduce similarly allow automatic parallelization of “sequential code” But many programmers prefer writing Map and Never really achieved mainstream acceptance or broad open-source Reduce in familiar PL (C++, Java) support like Hadoop Recent trend: simplify coding in MapReduce by using DBMS Recent trend: High-level PL for writing MapReduce ideas programs with DBMS-inspired operators (Variants of) relational operators, implemented on top of Hadoop 15 16 SQL Injection Getting Started Exploits security vulnerability in database layer of a Assume we know nothing about Web application, except that it probably checks user email with query like this: Web application when user input is not sufficiently checked and sanitized SELECT attributeList Think DBMS access through Web forms FROM table WHERE attribute = ‘ $email ’; Main idea: pass carefully crafted string as parameter value for an SQL query Typical for Web form allowing user login and send password String executes harmful code to user’s email address $email is email address submitted by user through Web form • Reveals data to unauthorized user Try entering name@xyz.com’ in form: • Data modification by unauthorized user • Deletes entire table SELECT attributeList The following examples are from unixwiz.net FROM table WHERE attribute = ‘name@xyz.com’’; 17 18 First Code Injection Guess Names of Attributes Query has incorrect SQL syntax Try if “email” is the right attribute name: Getting syntax error message indicates that input is sent to server unsanitized SELECT attributeList Now try injecting additional “code”: FROM table WHERE attribute = ‘ x ’ AND email IS NULL; -- ’; SELECT attributeList FROM table Server error would indicate that attribute name “email” WHERE attribute = ‘ anything ’ OR ‘x’ = ‘ x ’; is probably wrong; if so, try others Legal query whose WHERE clause is always satisfied Valid response (e.g., “Address unknown”) indicates that attribute name was correctly guessed Might see response from system like “Your login info has been sent to somebody@somewhere.com” Can guess names of other attributes like “ passwd ”, Enough information to start exploring the actual query “ login_id ”, “ full_name ” and so on structure 19 20 3
Recommend
More recommend