Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son - PowerPoint PPT Presentation

Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc.

About Me ● Jihoon Son (@jihoonson) Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ 2

Outline ● About Tajo ● Features of the Recent Release ● Demo ● Roadmap 3

What is Tajo? ● Tajo / tά ː zo / 타조 An ostrich in Korean ○ The world's fastest two-legged animal ○ 4

What is Tajo? ● Apache Top-level Project Big data warehouse system ○ ANSI-SQL compliant ■ Mature SQL features ■ Various types of join, window functions ● Rapid query execution with own distributed DAG engine ○ Low latency, and long running batch queries with a single ■ system ■ Fault-tolerance Beyond SQL-on-Hadoop ○ Support various types of storage ■ 5

Architecture Overview Tajo Master DBMS Tajo Master Submit Tajo Master REST API JDBC client a query Catalog Server Catalog Server Catalog Server WebUI TSQL HCatalog Manage metadata Allocate a query Tajo Worker Tajo Worker Tajo Worker Query Master Query Master Query Master Query Executor Query Executor Query Executor Send tasks Send tasks & monitor & monitor Storage Service Storage Service Storage Service Storage 6

Who are Using Tajo? ● Use cases: replacement of commercial DW 1 st telco in South Korea ○ Replacement of long-running ETL workloads on several ■ TB datasets ■ Lots of daily reports about user behavior ■ Ad ‐ hoc analysis on TB datasets Benefits ○ Simplified architecture for data analysis ■ ● An unified system for DW ETL, OLAP, and Hadoop ETL ■ Much less cost, more data analysis within same SLA ● Saved license fee of commercial DW 7

Who are Using Tajo? ● Use cases: data discovery Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ■ Benefits ○ Interactive query on large datasets ■ Data analysis with familiar BI tools ■ 8

Recent Release: 0.11 ● Feature highlights Query federation ○ JDBC-based storage support ○ Self-describing data formats support ○ Multi-query support ○ More stable and efficient join execution ○ Index support ○ Python UDF/UDAF support ○ 9

Recent Release: 0.11 ● Today's topic Query federation ○ JDBC-based storage support ○ Self-describing data formats support ○ Multi-query support ○ More stable and efficient join execution ○ Index support ○ Python UDF/UDAF support ○ 10

Query Federation with Tajo 11

Your Data ● Your data might be spread on multiple heterogeneous sites Cloud, DBMS, Hadoop, NoSQL, … ○ Application On-premise NoSQL DBMS storage Cloud storage 12

Your Data ● Even in a single site, your data might be stored in different data formats ... CSV JSON Parquet ORC Log 13

Your Data ● How to analyze distributed data? Traditionally ... ○ ● Long delivery Complex data flow ● Human-intensive ● Global view ETL transform On-premise Application DBMS NoSQL Cloud storage storage 14

Your Data with Tajo ● Query federation Global view Fast delivery ● Easy maintenance ● ● Simple data flow On-premise Application DBMS NoSQL Cloud storage storage 15

Storage and Data Format Support Data formats Storage types 16

Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1'; > CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH ('orc.compression.kind'='snappy') LOCATION 's3://user'; > CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION 'hbase:zk://localhost:2181/uptodate'; Data format Storage > ... URI 17

Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive1'; > CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive2'; > CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive3'; > ... Too tedious! 18

Introduction to Tablespace ● Tablespace Registered storage space ○ A tablespace is identified by an unique URI ○ Configurations and policies are shared by all tables in a ○ tablespace Storage type ■ Default data format and supported data formats ■ It allows users to reuse registered storage ○ configurations and policies 19

Tablespaces, Databases, and Tables Namespace Physical space Tablespace1 Table1 Database1 Storage1 Table2 Tablespace2 Database1 Tablespace3 Table3 Storage2 ... ... ... ... 20

Tablespace Configuration { Tablespace name "spaces" : { Tablespace URI "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='\\N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, "hbase1" : { "uri" : "hbase:zk://localhost:2181/table1" } } } 21

Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; Tablespace name Data format is omitted. Default data format is TEXT. "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='\\N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, 22

Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc WITH ('orc.compression.kind'='snappy'); > CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1; > ... 23

Querying on Different Data Silos ● How does a worker access different data sources? Storage service ○ Return a proper handler for underlying storage ■ > SELECT ... FROM hdfs_table, hbase_table, ... Tajo Worker Tajo Worker Tajo Worker Query Storage Query Storage Query Storage Engine Service Engine Service Engine Service HDFS HDFS HBase handler handler handler HDFS HBase 24

JDBC-based Storage Support 25

JDBC-based Storage ● Storage providing the JDBC interface PostgreSQL, MySQL, MariaDB, ... ○ ● Databases of JDBC-based storage are mapped to Tajo databases JDBC-based storage Tajo jdbc_db2 jdbc_db1 tajo_db1 tajo_db2 Table1 Table1 Table1 Table1 … … Table2 Table2 Table2 Table2 Table3 Table3 Table3 Table3 26

Tablespace Configuration { Tablespace name "spaces": { PostgreSQL "pgsql_db1": { database name "uri": "jdbc:postgresql://hostname:port/db1" Tajo "configs": { database name "mapped_database": "tajo_db1" "connection_properties": { "user": "tajo", "password": "xxxx" } } } } } 27

Return to Query Federation ● How to correlate data on JDBC-based storage and others? Need to have a global view of metadata across different ○ storage types Tajo also has its own metadata for its data ■ Each JDBC-based storage has own metadata for its data ■ Each NoSQL storage has metadata for its data ■ … ■ 28

Metadata Federation ● Federating metadata of underlying storage Catalog Interface Tablespace ● ● Database Linked Metadata Manager Tables ● ● Schema names DBMS metadata provider Tajo catalog metadata provider NoSQL metadata provider ... DBMS HCatalog 29

Querying on JDBC-based Storage ● A plan is converted into a SQL string ● Query generation Diverse SQL syntax of different types of storage ○ Different SQL builder for each storage type ○ Query plan SELECT ... SELECT ... JDBC-based Tajo Master Tajo Worker storage 30

Operation Push Down ● Tajo can exploit the processing capability of underlying storage DBMSs, MongoDB, HBase, … ○ ● Operations are pushed down into underlying storage Leveraging the advanced features provided by ○ underlying storage Ex) DBMSs' query optimization, index, ... ■ 31

Example 1 group by SELECT count(*) count(*) FROM account ac, archive ar group by count(*) WHERE ac.key = ar.id and join ac.key = ar.id ac.name = 'tajo' scan account scan archive ac.name = 'tajo' Push Full scan Result only operation HDFS DBMS archive account 32

Example 2 group by SELECT count(*) ac.name, count(*) FROM scan account account ac GROUP BY Push Result only ac.name operation DBMS account 33

Self-describing Data Formats Support 34

Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son - PowerPoint PPT Presentation

Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc. About Me Jihoon Son (@jihoonson) Tajo project co-founder Committer and PMC member of Apache Tajo Research engineer at Gruter 2 Outline

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Measuring the Capacity of a Port System A Case Study on a Southeast Asian Port Author: Jason

The Kansas Unified Testing Strategy Using diagnostic, screening, and surveillance COVID-19 testing

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Toward Tech Savvy Trustees WebJunction Webinar January 26, 2017 Bonnie McKewon State Library of

Experimental Linguistics Session B: Experiment Creation II Lecturer: Roland Mhlenbernd

Privacy Battles in M&A Transactions Kate Black Greenberg Traurig Jill Green Morris Green

Conformal transformations and gluing formulas Klaus Kirsten Baylor University Microlocal and

The Peirce Lesson of Semiotics Gianfranco Basti (basti@pul.va) Faculty of Philosophy

Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son - PowerPoint PPT Presentation

Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc. About Me Jihoon Son (@jihoonson) Tajo project co-founder Committer and PMC member of Apache Tajo Research engineer at Gruter 2 Outline

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Europe Manchester, England North America - Factory Lehi, UT HQ &amp; Warehouse Salt Lake

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Measuring the Capacity of a Port System A Case Study on a Southeast Asian Port Author: Jason

The Kansas Unified Testing Strategy Using diagnostic, screening, and surveillance COVID-19 testing

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Toward Tech Savvy Trustees WebJunction Webinar January 26, 2017 Bonnie McKewon State Library of

Experimental Linguistics Session B: Experiment Creation II Lecturer: Roland Mhlenbernd

Privacy Battles in M&amp;A Transactions Kate Black Greenberg Traurig Jill Green Morris Green

Conformal transformations and gluing formulas Klaus Kirsten Baylor University Microlocal and

The Peirce Lesson of Semiotics Gianfranco Basti (basti@pul.va) Faculty of Philosophy

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Privacy Battles in M&A Transactions Kate Black Greenberg Traurig Jill Green Morris Green