Hive* – A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure Team CS 743, Fall 2014 Conference Data Engineering (ICDE), 2010 IEEE UNIVERSITY OF WATERLOO Presenter Malek NAOUACH, Nets&Dist Sys November 13 th , 2014 1
Overview* MapReduce Fault Big Data Massively Tolerant Processing Parallel Decisions Hadoop Linearly Making Scalable Familiarity Hive 2
Hive Data Structure* Complex Datatypes Composition Primitive Data Types list<map<string, struct<p1:int, p2:int>>> INT | TINYINT | SMALLINT | BIGINT | BOOLEAN | FLOAT Complex Schema Creation CREATE TABLE t1(st string, fl float, li list<map<string, struct<p1:int, p2:int>>>) Complex Data Types Hive Data Incorporation Associative arrays | Lists | Structs + SerDe Interface + ObjectInspector Interface + getObjectInspector method **Serialization Process of translating data structures or object state into a format that can be stored and reconstructed later. 3
Hive Query Language* HiveQL Data Insertion HiveQL Semantics (SQL) INSERT OVERWRITE SUBQUERIES | INNER, LEFT & RIGHT OUTER JOINS | CARTESIAN PROD | GROUP By | AGGREGATION HiveQL Supports Map-Red Programs | UNION | CREATE TABLE FROM ( MAP stocks USING 'python ce_mapper.py' NOT HiveQL Semantics AS (company,value) INSERT | UPDATE | DELETE FROM stocksStat CLUSTER BY value ) a Reduce company,value USING 'python ce_reduce.py' **HQL Hibernate Query Language 4
Data Storage* Hive MetaStore Library HDFS Schema Logical Partitioning MetaData Prune/Bucket Stocks Buckets Data …... /hive/stocks/ CREATE TABLE Stocks /hive/stocks/2014-11-13/ (Company STRING, val DOUBLE) /hive/stocks/2014-11-13/10 PARTITIONED BY (day /hive/stocks/2014-11-13/11 STRING, hr INT); /hive/stocks/2014-11-13/12 5
System Architecture(1/3)* Hive JDBC ODBC Web CLI Thrift Server Interface MetaStore Driver (Compiler, Optimizer, Executor) HADOOP (MAP-REDUCE + HDFS) Name Node Job Tracker Data Node + Task Tracker 6
System Architecture (2/3)* H Hive A 8. sendResults 6.2. jobDone Execution D Engine O 6.1. exeJob 5. exePhysPlan O P Thrift Interf. E.Client 7. fetchResults 6.1. metaDataOps ODBC Web UI Driver forDDLs Interf. 1. exeHiveQuery CLI JDBC 2. getExePhysPlan 5. sendExePhysPlan Interf. 4. sendMetaData Query MetaStore Compiler 3. getMetaData **Interoperability **Logical/Physical Plan is the ability of a system to work with Abstract Syntax Tree (AST) for the other systems without special effort on query, Query Block Tree, Involved 7 the customer side. Interfaces, Directed Acyclic Graph
System Architecture (3/3)* MapReduce 6.2. jobDone Job Tracker 6.1. exeJob MapReduce Tasks Task Trackers Task Trackers H (MAP) (Reduce) A D Map Op. Map Op. O Tree Tree O SerDe SerDe P HDFS Data Nodes 8
HiveQL to Phys. Plan Exp. (1/3)* FROM(SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds='2009-03-20')) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION (ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION (ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school 9
HiveQL to Phys. Plan Exp. (2/3)* status_updates profiles (userid, status, ds) (userid, school, gender) 10
HiveQL to Phys. Plan Exp. (3/3)* SELECT subq1.school, COUNT(1) SELECT subq1.gender, COUNT(1) GROUP BY subq1.school GROUP BY subq1.gender 11
Brief Recap.* ✔ Hive is created to simplify big data analysis. (1hour for new users to master) ✔ Hive is improving the performance of Hadoop. (+20% efficiency) ✔ Hive enables data processing at a fraction of the cost of more traditional WD. ✔ Hive is working towards to subsume SQL syntax. ✔ Hive is enhancing the Query Complier and the interoperability. http://hadoop.apache.org/ http://hive.apache.org/ 12
Thanks!* Questions? 13
Recommend
More recommend