Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop
Keuntae Park • IT Manager of SK Telecom, South Korea’s largest wireless communications provider • Work on commercial products (~’12) – T-FS: Distributed File System – Windows compatible layer on TimOS – T-MR: on-demand MapReduce service like E-MR • Open source activity (‘13~) – Committer of Apache Tajo project
Overview • Background – Telco requirements • Before Tajo – Commercial product – Open source (Hadoop) outsourcing • After Tajo – Issues & solutions – Performance • win-win between community and company • Future Works
Telco data characteristics • Huge amount of data – 40 TB/day (compressed) – 15 PB (estimated, end of 2014) • Report & OLAP ad-hoc query – Filtering – Summary – BI tools
Requirements - different size, different speed Filtering & Data re- Summary BI report Ad-hoc Query aggregation construction accumulated daily sum of entire Target mart data summary data for 5 minutes filtered data summary data every 5 daily or non-regularly Frequency ah-hoc ah-hoc minutes monthly (rare) Amount of hundreds of tens of tens of terabytes petabytes data terabytes gigabytes terabytes Response within a no strict within two within a hour within a hour time minute deadline seconds
Previous approach - DBMS based on MPP DBMS
Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS
Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS
Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS
Previous approach - Hadoop(MapReduce, Hive) + DBMS Hadoop MPP DBMS
Previous approach - Hadoop(MapReduce, Hive) + DBMS Working (but…) Hadoop MPP DBMS
Still has Problems • Hadoop outsourcing – quality of outcome is not good (actually bad) – communication overhead – hard to reflect requirements on open source • Data Warehouse and Mart becomes bigger
Solution - Tajo!! • It can replace both DBMS and Hadoop – High throughput for batch processing – Low latency for ad-hoc queries – ANSI SQL compatible • Can do by myself – very open community • easily make issues about what I really need – fast growing • issues solved very fast
About Tajo • Tajo (since 2010) – Big Data Warehouse System on Hadoop – Apache top-level project (entered the ASF in March 2013) • Features – SQL standard compliance – Fully distributed SQL query processing – HDFS as a primary storage – Relational model (will be extended to nested model in the future) – ETL as well as low-latency relational query processing (100 ms ~) • News – 0.2-incubating: released November 2013 – graduation to top-level: April 2014
Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum2 GroupBy Filter sel_> sel_< Projection Join ID QTY Date ID Tax Price Table A Table B
Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum1 aggr_sum2 GroupBy aggr_sum2 GroupBy Filter sel_> sel_< Join Projection sel_> sel_< Filter Join Projection ID QTY Date ID Tax Price ID QTY Date ID Price Tax Table A Table B Table A Table B
Tajo progressive optimization • dynamically adjust number of tasks input data • estimate data size at planning time execution block • check size and adjust plan intermediate data unknown priorly at execution time … shuffled shuffled shuffled how many tasks data data data • shuffle intermediate data (and workers)? � … over workers uniformly � execution block
Tajo progressive optimization • dynamically adjust join order or type Hash-Join Hash-Join
Tajo progressive optimization • dynamically adjust join order or type Hash-Join Broadcast-Join Hash-Join
Tajo - what is improved past 9 months ? • Resource Manager • Scheduler & Storage Manager • Data types & Functions • SQL Interface • Management
Tajo resource manager • Fine resource allocation Tajo Master Tajo Worker (as a query master) Tajo Worker Tajo Worker Tajo Worker (as a worker) (as a worker) (as a worker) TAJO-127 without YARN
Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Worker Query Master (as a query master) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker (as a worker) (as a worker) (as a worker) Tajo Worker (as a worker) (as a worker) TAJO-127 TAJO-275 without YARN separating Query master
Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Master Tajo Worker Query Master Query Master (as a query master) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker (as a worker) (as a worker) Tajo Worker Tajo Worker (as a worker) Tajo Worker (as a worker) Tajo Worker Tajo Worker (I/O-intensive) (I/O-intensive) (as a worker) (I/O-intensive) (I/O-intensive) (I/O-intensive) (CPU/memory) TAJO-127 TAJO-275 TAJO-317 without YARN separating Query master elaborate resource allocation
Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread
Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Storage Manager TAJO-84 considering disk load balance TAJO-178 asynchronous scan
Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread TAJO-134 Thread text compression Storage (gzip, snappy, lz4, bzip2) Manager TAJO-200 RCFile � TAJO-30 Parquet TAJO-84 TAJO-435 considering disk load balance intermediate file TAJO-178 asynchronous scan
Functions & data types • supporting more functions and UDFs function1 function2 Tajo Master function3 registered at startup (class name is coded in source)
Functions & data types • supporting more functions and UDFs function function function1 Tajo Master function2 Tajo Master user defined user defined function3 function function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system
Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system
Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} TAJO-52 ) standard SQL TAJO-408 data types Improve function system
JDBC Driver, HCatalog TAJO-16, 433 Hive metastore TAJO-176 HCatalog JDBC Driver JDBC ANSI SQL SQL parser Tajo Algebra Query Master expression HiveQL HiveQL parser TAJO-101 HiveQL converter
Management TAJO-239 Improving Web UI
Management TAJO-564 Execution block progress
Management TAJO-589 Task progress
Management TAJO-468 Task detail info
Management TAJO-474 Task admin utility
Recommend
More recommend