big telco bigger dw demands moving towards sql on hadoop
play

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae - PowerPoint PPT Presentation

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae Park IT Manager of SK Telecom, South Koreas largest wireless communications provider Work on commercial products (~12) T-FS: Distributed File System


  1. Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop

  2. Keuntae Park • IT Manager of SK Telecom, South Korea’s largest wireless communications provider • Work on commercial products (~’12) – T-FS: Distributed File System – Windows compatible layer on TimOS – T-MR: on-demand MapReduce service like E-MR • Open source activity (‘13~) – Committer of Apache Tajo project

  3. Overview • Background – Telco requirements • Before Tajo – Commercial product – Open source (Hadoop) outsourcing • After Tajo – Issues & solutions – Performance • win-win between community and company • Future Works

  4. Telco data characteristics • Huge amount of data – 40 TB/day (compressed) – 15 PB (estimated, end of 2014) • Report & OLAP ad-hoc query – Filtering – Summary – BI tools

  5. Requirements - different size, different speed Filtering & Data re- Summary BI report Ad-hoc Query aggregation construction accumulated daily sum of entire Target mart data summary data for 5 minutes filtered data summary data every 5 daily or non-regularly Frequency ah-hoc ah-hoc minutes monthly (rare) Amount of hundreds of tens of tens of terabytes petabytes data terabytes gigabytes terabytes Response within a no strict within two within a hour within a hour time minute deadline seconds

  6. Previous approach - DBMS based on MPP DBMS

  7. Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS

  8. Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS

  9. Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS

  10. Previous approach - Hadoop(MapReduce, Hive) + DBMS Hadoop MPP DBMS

  11. Previous approach - Hadoop(MapReduce, Hive) + DBMS Working (but…) Hadoop MPP DBMS

  12. Still has Problems • Hadoop outsourcing – quality of outcome is not good (actually bad) – communication overhead – hard to reflect requirements on open source • Data Warehouse and Mart becomes bigger

  13. Solution - Tajo!! • It can replace both DBMS and Hadoop – High throughput for batch processing – Low latency for ad-hoc queries – ANSI SQL compatible • Can do by myself – very open community • easily make issues about what I really need – fast growing • issues solved very fast

  14. About Tajo • Tajo (since 2010) – Big Data Warehouse System on Hadoop – Apache top-level project (entered the ASF in March 2013) • Features – SQL standard compliance – Fully distributed SQL query processing – HDFS as a primary storage – Relational model (will be extended to nested model in the future) – ETL as well as low-latency relational query processing (100 ms ~) • News – 0.2-incubating: released November 2013 – graduation to top-level: April 2014

  15. Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum2 GroupBy Filter sel_> sel_< Projection Join ID QTY Date ID Tax Price Table A Table B

  16. Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum1 aggr_sum2 GroupBy aggr_sum2 GroupBy Filter sel_> sel_< Join Projection sel_> sel_< Filter Join Projection ID QTY Date ID Tax Price ID QTY Date ID Price Tax Table A Table B Table A Table B

  17. Tajo progressive optimization • dynamically adjust number of tasks input data • estimate data size 
 at planning time execution block • check size and adjust plan 
 intermediate data unknown priorly at execution time … shuffled shuffled shuffled how many tasks 
 data data data • shuffle intermediate data (and workers)? � … over workers uniformly � execution block

  18. Tajo progressive optimization • dynamically adjust join order or type Hash-Join Hash-Join

  19. Tajo progressive optimization • dynamically adjust join order or type Hash-Join Broadcast-Join Hash-Join

  20. Tajo - what is improved past 9 months ? • Resource Manager • Scheduler & Storage Manager • Data types & Functions • SQL Interface • Management

  21. Tajo resource manager • Fine resource allocation Tajo Master Tajo Worker 
 (as a query master) Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 (as a worker) (as a worker) (as a worker) TAJO-127 without YARN

  22. Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Worker 
 Query Master (as a query master) Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 (as a worker) (as a worker) (as a worker) Tajo Worker (as a worker) (as a worker) TAJO-127 TAJO-275 without YARN separating Query master

  23. Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Master Tajo Worker 
 Query Master Query Master (as a query master) Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 Tajo Worker 
 Tajo Worker Tajo Worker Tajo Worker 
 (as a worker) (as a worker) Tajo Worker Tajo Worker (as a worker) Tajo Worker (as a worker) Tajo Worker Tajo Worker (I/O-intensive) (I/O-intensive) (as a worker) (I/O-intensive) (I/O-intensive) (I/O-intensive) (CPU/memory) TAJO-127 TAJO-275 TAJO-317 without YARN separating Query master elaborate resource allocation

  24. Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread

  25. Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Storage Manager TAJO-84 considering disk load balance TAJO-178 asynchronous scan

  26. Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread TAJO-134 Thread text compression Storage (gzip, snappy, lz4, bzip2) Manager TAJO-200 RCFile � TAJO-30 Parquet TAJO-84 TAJO-435 considering disk load balance intermediate file TAJO-178 asynchronous scan

  27. Functions & data types • supporting more functions and UDFs function1 function2 Tajo Master function3 registered at startup (class name is coded in source)

  28. Functions & data types • supporting more functions and UDFs function function function1 Tajo Master function2 Tajo Master user defined user defined function3 function function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system

  29. Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system

  30. Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} TAJO-52 ) standard SQL TAJO-408 data types Improve function system

  31. JDBC Driver, HCatalog TAJO-16, 433 Hive metastore TAJO-176 HCatalog JDBC Driver JDBC ANSI SQL SQL parser Tajo Algebra Query Master expression HiveQL HiveQL parser TAJO-101 HiveQL converter

  32. Management TAJO-239 Improving Web UI

  33. Management TAJO-564 Execution block progress

  34. Management TAJO-589 Task progress

  35. Management TAJO-468 Task detail info

  36. Management TAJO-474 Task admin utility

Recommend


More recommend