Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 ｜@ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io

Agenda n What’s Apache Kylin? n Tech Highlights n Performance n Open Source n Q & A

What’s Kylin kylin ¡ ¡/ ¡ˈkiːˈlɪn ¡/ ¡ 麒麟 ¡ -‑-‑n. ¡(in ¡Chinese ¡art) ¡a ¡mythical ¡animal ¡of ¡composite ¡form ¡ ¡ Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets • Open ¡Sourced ¡on ¡Oct ¡1st, ¡2014 ¡ • Be ¡accepted ¡as ¡Apache ¡Incubator ¡Project ¡on ¡Nov ¡25th, ¡2014 ¡ http://kylin.io

Big Data Era n More and more data becoming available on Hadoop n Limitations in existing Business Intelligence (BI) Tools n Limited support for Hadoop n Data size growing exponentially n High latency of interactive queries n Scale-Up architecture n Challenges to adopt Hadoop as interactive analysis system n Majority of analyst groups are SQL savvy n No mature SQL interface on Hadoop n OLAP capability on Hadoop ecosystem not ready yet http://kylin.io

Business Needs for Big Data Analysis n Sub-second query latency on billions of rows n ANSI SQL for both analysts and engineers n Full OLAP capability to offer advanced functionality n Seamless Integration with BI Tools n Support of high cardinality and high dimensions n High concurrency – thousands of end users n Distributed and scale out architecture for large data volume n Open source solution http://kylin.io

Why not Build an engine from scratch? 6 http://kylin.io

Analytics Query Taxonomy Kylin ¡is ¡designed ¡to ¡accelerate ¡80+% ¡analyNcs ¡queries ¡performance ¡on ¡Hadoop ¡ High ¡Level ¡ • Very ¡High ¡Level, ¡e.g ¡GMV ¡by ¡ AggregaNon ¡ site ¡by ¡verNcal ¡by ¡weeks ¡ Strategy ¡ • Middle ¡level, ¡e.g ¡GMV ¡by ¡site ¡by ¡verNcal, ¡by ¡ Analysis ¡ category ¡(level ¡x) ¡past ¡12 ¡weeks ¡ Query ¡ OLAP ¡ Drill ¡Down ¡ OperaNon ¡ • Detail ¡Level ¡(Summary ¡Table) ¡ to ¡Detail ¡ Low ¡Level ¡ • First ¡Level ¡ AggregaNon ¡ AggragaNon ¡ OLTP ¡ TransacNon ¡ TransacNon ¡ • TransacNon ¡Data ¡ Level ¡ http://kylin.io

Technical Challenges Huge volume data n n Table scan Big table joins n n Data shuffling Analysis on different granularity n n Runtime aggregation expensive Map Reduce job n n Batch processing http://kylin.io

OLAP Cube – Balance between Space and Time Cuboid = one combination of dimensions • • Cube = all combination of dimensions (all cuboids) 0- D(apex) cuboid time item location supplier 1- D cuboids time, item time, location location, supplier item, location 2- D cuboids item, supplier Time, supplier time, location, supplier 3- D cuboids time, item, supplier time, item, location item, location, supplier 4- D(base) cuboid time, item, location, supplier • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells (9/15, milk, Urbana, Dairy_land) - < time, item, location, supplier > > 1. (9/15, milk, Urbana, *) - < time, item, location > > 2. (*, milk, Urbana, *) - < item, location > > 3. (*, milk, Chicago, *) - < item, location > > 4. (*, milk, *, *) - < item > > 5. http://kylin.io 9

From Relational to Key-Value http://kylin.io

Kylin Architecture Overview 3rd ¡Party ¡App ¡ SQL-‑Based ¡Tool ¡ Ø Online ¡Analysis ¡Data ¡Flow ¡ Ø Offline ¡Data ¡Flow ¡ (Web ¡App, ¡Mobile…) (BI ¡Tools: ¡Tableau…) ¡ Ø Clients/Users ¡interacNve ¡with ¡ REST ¡API JDBC/ODBC Kylin ¡via ¡SQL ¡ Ø OLAP ¡Cube ¡is ¡transparent ¡to ¡ users ¡ SQL SQL REST ¡Server ¡ Query ¡Engine ¡ Mid ¡Latency ¡-‑ ¡Minutes Low ¡ ¡Latency ¡-‑ ¡Seconds RouNng Metadata ¡ Data ¡ Hadoop OLAP ¡ Cube Hive Cube ¡ (HBase) Cube ¡Build ¡Engine ¡ (MapReduce…) Star ¡Schema ¡Data Key ¡Value ¡Data http://kylin.io 11

Features Highlights Extremely Fast OLAP Engine at Scale n Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data ANSI SQL Interface on Hadoop n Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions Seamless Integration with BI Tools n Kylin currently offers integration capability with BI Tools like Tableau. Interactive Query Capability n Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset MOLAP Cube n User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records http://kylin.io

Features Highlights Cons Compression and Encoding Support n Incremental Refresh of Cubes n Approximate Query Capability for distinct Count (HyperLogLog) n Leverage HBase Coprocessor for query latency n Job Management and Monitoring n Easy Web interface to manage, build, monitor and query cubes n Security capability to set ACL at Cube/Project Level n Support LDAP Integration n http://kylin.io

How Does Kylin Utilize Hadoop Components? Hive n n Input source n Pre-join star schema during cube building MapReduce n n Pre-aggregation metrics during cube building HDFS n n Store intermediated files during cube building. HBase n n Store data cube. n Serve query on data cube. n Coprocessor is used for query processing. http://kylin.io

Why Kylin is Fast? Pre-built cube – query result already be calculated n Leveraging distributed computing infrastructure n No runtime Hive table scan and MapReduce job n Compression and encoding n Put “Computing” to “Data” n Cached n http://kylin.io

Agenda n What’s Kylin n Tech Highlights n Performance n Open Source n Q & A

How to Define Cube? Data Modeling End ¡User ¡ Cube ¡Modeler ¡ Admin ¡ Cube: ¡… ¡ Row ¡Key Fact ¡Table: ¡… ¡ Column Dim row ¡A Val ¡1 Dimensions: ¡… ¡ Measures: ¡… ¡ row ¡B Val ¡2 Fact Storage(HBase): ¡… row ¡C Val ¡3 Dim Dim Column ¡Family Source ¡ Mapping ¡ Target ¡ ¡ Star ¡Schema Cube ¡Metadata HBase ¡Storage http://kylin.io

How to Define Cube? Cube Metadata • Dimension – Normal – Mandatory – Hierarchy – Derived • Measure – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) http://kylin.io

How to Define Cube? Mandatory Dimension Dimension that must present on cuboid n n E.g. Date Normal ¡ A ¡is ¡mandatory ¡ A B C A B C A B -‑ A B -‑ -‑ B C A -‑ C A -‑ C A -‑ -‑ A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Define Cube? Hierarchy Dimension Dimensions that form a “contains” relationship where parent level is n required for child level to make sense. n E.g. Year -> Month -> Day; Country -> City Normal ¡ A ¡-‑> ¡B ¡-‑> ¡C ¡is ¡hierarchy ¡ A B C A B C A B -‑ A B -‑ -‑ B C A -‑ -‑ A -‑ C -‑ -‑ -‑ A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Define Cube? Derived Dimension Dimensions on lookup table that can be derived by PK n n E.g. User ID -> [Name, Age, Gender] Normal ¡ A, ¡B, ¡C ¡is ¡derived ¡by ¡ID ¡ A B C ID A B -‑ -‑ -‑ B C A -‑ C A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Build Cube? Cube Build Job Flow http://kylin.io

How to Build Cube? Cube Build Result http://kylin.io

How to Query Cube? Query Engine – Calcite Dynamic ¡data ¡management ¡framework. ¡ n Formerly ¡known ¡as ¡OpNq, ¡Calcite ¡is ¡an ¡Apache ¡incubator ¡project, ¡used ¡by ¡ n Apache ¡Drill ¡and ¡Apache ¡Hive, ¡among ¡others. ¡ hjp://opNq.incubator.apache.org ¡ ¡ ¡ n http://kylin.io

How to Query Cube? Calcite Plugins • Metadata SPI Me SPI – Provide table schema from kylin metadata • Optimize imize Rule le – Translate the logic operator into kylin operator • Rela latio ional l Opera rator r – Find right cube – Translate SQL into storage engine api call – Generate physical execute plan by linq4j java implementation • Resu sult lt En Enume mera rator r – Translate storage engine result into java implementation result. • SQL Funct SQ ctio ion – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) http://kylin.io

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io Agenda n Whats Apache Kylin? n Tech Highlights

Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io

Speed up Mission-Critical Analytics in the Cloud Billy Liu, VP of Kyligence, Apache Kylin PMC

What's New in Apache Syncope 1.2.0 Dr. Colm higeartaigh Speaker Introduction 11/14/14 2

Lazy beats Smart & Fast Julian Hyde | DataEngConf SF 2018/04/17 @julianhyde SQL Query

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

How Apache works JB Onofr <jbonofre@apache.org> Who am I JB Onofr

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

mod_rewrite Introduction to mod_rewrite Rich Bowen, Web Guy, Asbury College rbowen@apache.org

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Analyzing Weather Data with Apache Spark Jeremie Juban Tom Kunicki Introduction Who we

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Web Service development Talk Outline w ith Apache Axis2 and ODE Apache Axis2 Overview and

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

Introduction to Apache Axis2: Next Generation Web Services Asst. Prof. Dr. Kanda Runapongsa

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io Agenda n Whats Apache Kylin? n Tech Highlights

Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io

Speed up Mission-Critical Analytics in the Cloud Billy Liu, VP of Kyligence, Apache Kylin PMC

What's New in Apache Syncope 1.2.0 Dr. Colm higeartaigh Speaker Introduction 11/14/14 2

Lazy beats Smart &amp; Fast Julian Hyde | DataEngConf SF 2018/04/17 @julianhyde SQL Query

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

How Apache works JB Onofr &lt;jbonofre@apache.org&gt; Who am I JB Onofr

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

mod_rewrite Introduction to mod_rewrite Rich Bowen, Web Guy, Asbury College rbowen@apache.org

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Analyzing Weather Data with Apache Spark Jeremie Juban Tom Kunicki Introduction Who we

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

Web Service development Talk Outline w ith Apache Axis2 and ODE Apache Axis2 Overview and

Apache Arrow &amp; TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

Introduction to Apache Axis2: Next Generation Web Services Asst. Prof. Dr. Kanda Runapongsa

Lazy beats Smart & Fast Julian Hyde | DataEngConf SF 2018/04/17 @julianhyde SQL Query

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

How Apache works JB Onofr <jbonofre@apache.org> Who am I JB Onofr

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Arrow & TDataFrame Giulio Eulisse (CERN) 22 Mar 2018 1 Apache Arrow: the project