1
play

1 Execution Implementation Overview Master notifies reducers - PDF document

MapReduce Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. MapReduce and SQL Injections OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 CS


  1. MapReduce  Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. MapReduce and SQL Injections OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 CS 3200 Final Lecture 1 2 Introduction MapReduce  How to write software for a cluster?  Abstraction to express computation while hiding  1000, 10,000, maybe more machines messy details • Failure or crash is not exception, but common phenomenon  Inspired by map and reduce primitives in Lisp  Parallelize computation  Distribute data  Apply map to each input record to create set of  Balance load intermediate key-value pairs  Makes implementation of conceptually  Apply reduce to all values that share the same key (like straightforward computations challenging GROUP BY)  Create inverted indices  Automatically parallelized  Representations of the graph structure of Web documents  Number of pages crawled per host  Re-execution as primary mechanism for fault  Most frequent queries in a given day tolerance 3 4 Programming Model Example Count number of occurrences of each word in a document collection:  Transforms set of input key-value pairs to set of output key-value pairs reduce( String key, Iterator values ):  Map written by user map( String key, String value ): // key: a word  Map: (k1, v1)  list (k2, v2) // key: document name // values: a list of counts // value: document contents int result = 0;  MapReduce library groups all intermediate pairs with for each word w in value: for each v in values: same key together EmitIntermediate( w, " 1“ ); result += ParseInt( v ); Emit( AsString(result) );  Reduce written by user  Reduce: (k2, list (v2))  list (v2)  Usually zero or one output value per group This is almost all the coding needed…  Intermediate values supplied via iterator (to handle lists (need also mapreduce specification object with names of input and that do not fit in memory) output files, and optional tuning parameters) 5 6 1

  2. Execution Implementation Overview Master notifies reducers about  Focuses on large clusters intermediate file locations  Relies on existence of reliable and highly available distributed file system  Map invocations  Automatically partition input data into M chunks (16-64 MB typically)  Chunks processed in parallel  Reduce invocations  Partition intermediate key space into R pieces, e.g., using hash(key) mod R  Master node controls program execution Reducers (i) read all data from mappers, Mappers inform master (ii) sort by intermediate key, (iii) perform about file locations computation for each group 7 8 Fault Tolerance Practical Considerations  Master monitors tasks on mappers and reducers: idle, in-  Conserve network bandwidth (“ Locality optimization”)  Distributed file system assigns data chunks to local disks progress, completed  Schedule map task on machine that already has a copy of the chunk,  Worker failure (common) or one “nearby”  Master pings workers periodically  Choose M and R much larger than number of worker machines  No response => assumes worker failed  Load balancing and faster recovery (many small tasks from failed • Resets worker’s map tasks, completed or in progress, to idle state machine) (tasks now available for scheduling on other workers)  Limitation: O(M+R) scheduling decisions and O(M*R) in-memory state • Completed tasks only on local disk, hence inaccessible at master • Same for reducer’s in -progress tasks  Common choice: M so that chunk size is 16-64 MB, R a small multiple • Completed tasks stored in global file system, hence accessible of number of workers  Reducers notified about change of mapper assignment  Backup tasks to deal with machines that take unusually long for last few tasks  Master failure (unlikely)  For in-progress tasks when MapReduce near completion  Checkpointing or simply abort computation 9 11 Applicability of MapReduce MapReduce vs. DBMS  Map: assume table “ InputFile ” with schema (key1, val1)  Machine learning algorithms, clustering is input; “ mapFct ” is a user -defined function that can  Data extraction for reports of popular queries output a set with schema (key2, val2)  Extraction of page properties, e.g., geographical location  Graph computations SELECT mapFct( key1, val1) AS (key2, val2) // Not really correct SQL  Google indexing system  Sequence of 5-10 MapReduce operations FROM InputFile  Smaller simpler code (3800 LOC -> 700 LOC)  Easier to change code  Reduce: assume MapOutput has schema (key2, val2); redFct is a user-defined function  Easier to operate, because MapReduce library takes care of failures  Easy to improve performance by adding more machines SELECT redFct( val2 ) FROM MapOutput GROUP BY key2 13 14 2

  3. Parallel DBMS MapReduce Summary  MapReduce = programming model that hides details of  SQL specifies what to compute, not how to do it parallelization, fault tolerance, locality optimization, and load  Perfect for parallel and distributed implementation balancing  “Just” need an optimizer that can choose best plan in  Simple model, but fits many common problems given parallel/distributed system  Implementation on cluster scales to 1000s of machines and • Cost estimate includes disk, CPU, and network cost more  Recent benchmarks show parallel DBMS can  Open source implementation, Hadoop, is available  Parallel DBMS, SQL are more powerful than MapReduce and significantly outperform MapReduce similarly allow automatic parallelization of “sequential code”  But many programmers prefer writing Map and  Never really achieved mainstream acceptance or broad open-source Reduce in familiar PL (C++, Java) support like Hadoop  Recent trend: simplify coding in MapReduce by using DBMS  Recent trend: High-level PL for writing MapReduce ideas programs with DBMS-inspired operators  (Variants of) relational operators, implemented on top of Hadoop 15 16 SQL Injection Getting Started  Exploits security vulnerability in database layer of a  Assume we know nothing about Web application, except that it probably checks user email with query like this: Web application when user input is not sufficiently checked and sanitized SELECT attributeList  Think DBMS access through Web forms FROM table WHERE attribute = ‘ $email ’;  Main idea: pass carefully crafted string as parameter value for an SQL query  Typical for Web form allowing user login and send password  String executes harmful code to user’s email address  $email is email address submitted by user through Web form • Reveals data to unauthorized user  Try entering name@xyz.com’ in form: • Data modification by unauthorized user • Deletes entire table SELECT attributeList  The following examples are from unixwiz.net FROM table WHERE attribute = ‘name@xyz.com’’; 17 18 First Code Injection Guess Names of Attributes  Query has incorrect SQL syntax  Try if “email” is the right attribute name:  Getting syntax error message indicates that input is sent to server unsanitized SELECT attributeList  Now try injecting additional “code”: FROM table WHERE attribute = ‘ x ’ AND email IS NULL; -- ’; SELECT attributeList FROM table  Server error would indicate that attribute name “email” WHERE attribute = ‘ anything ’ OR ‘x’ = ‘ x ’; is probably wrong; if so, try others  Legal query whose WHERE clause is always satisfied  Valid response (e.g., “Address unknown”) indicates that attribute name was correctly guessed  Might see response from system like “Your login info has been sent to somebody@somewhere.com”  Can guess names of other attributes like “ passwd ”,  Enough information to start exploring the actual query “ login_id ”, “ full_name ” and so on structure 19 20 3

Recommend


More recommend