OLX Data Hub Jakub Orowski, Krzysztof Antoczak, Facundo Guerrero - PowerPoint PPT Presentation

OLX Data Hub Jakub Orłowski, Krzysztof Antończak, Facundo Guerrero Presto Summit 2019, New York City

Meet OLX, the biggest Web company you’ve never heard of

Within classified ads, OLX Group is the largest global player Present in >300m 30 markets, Leading position MAUs in 27 4 Source: Company Information; Leading position refers to top 3 position based on MAUs as per SimilarWeb, Oct 2019; MAUs refers to Monthly Active Users

… with a strong local presence + 5,500 dedicated + 30 offices employees globally 5

Anatomy of a typical “BI Stack” Typical Data Stack S3, Redshift, GitLab, Jenkins - Tight coupling between compute nodes and storage - Data is stored on the compute nodes - Low usage of S3 (Spectrum adoption is slower than expected) - Limited dependency management - No scheduling standards (random low quality python scripts)

What are the problems we aim to solve? - Complex cross-stack synchronisation mechanism - “Reservoir” design discourages building on each other - Use of multiple AWS regions makes sharing difficult and increase costs - Separated ETL scheduling standards Divergent Solutions? Shared Data Lake Solutions

...and what if? Divergent Solutions? Shared Data Lake Solutions Use of Redshift will be an eng. choice and it’s expected to get lower Shared synchronisation Shared storage in a system and code single AWS region and Shared support of repository (and, same account multiple execution hopefully, standards) engines: Redshift, Athena, Presto, Spark Divergent Solutions? Shared Data Hub Multiple Solutions Execution Data Lake (Odyn) Engines

OLX Data Hub (“Odyn”) high level architecture overview Applications App 1 App 2 App 3 App ... Operator Scheduler ODYN Data Hub Storage Config

Actual OLX Data Hub (“Odyn”) task configuration example

Migrating to Presto Why we decided to move out of the Redshift comfort zone

Typical data workflow of a “BI stack” L OAD E XTRACT T RANSFORM

“If you were entering Hadoop ecosystem 8-10 years ago, there was this mantra: bring compute to your storage, tie them together; shipping data is so expensive. That is no longer true. All modern architectures right now separate storage from compute. Grow your data without limit, scale your compute power whenever you need.” Kamil Bajda-Pawlikowski, Data Council NY, Nov 7-8, 2018

Introduced Athena for querying raw data L OAD E XTRACT T RANSFORM

Athena adoption failed :-( ● Query exhausted resources ● The query timeout is 30 minutes ● Generic raw data not so friendly for queries ● CTaS usage increase

Looking for the best query execution engine for our needs

Introduced Presto for processing data L OAD E XTRACT T RANSFORM

Presto in production at OLX ● 30+ nodes in AWS (r5.8xlarge) ● 20K+ queries daily ● 100+ users in 20 teams over 5 countries ● 1PB+ data on S3 (Parquet, ORC, JSON)

prestosql.io

OLX Data Platform

Presto Infrastructure Where and how we run Presto

Where Presto is Running? ● Kubernetes cluster ○ AWS EKS in Ireland ○ Staging and Production ○ Single Amazon availability zone ● We move Presto from EMR to Kubernetes (EKS) using a mix of spot and on-demand instances ● Store metrics in Prometheus and show them in Grafana Sizes: ● Production = 25 * r5.8xlarge ● Staging = 16 * r4.2xlarge

Challenge Presto has a static size for the cluster even where there is nothing to do, we need to have the workers nodes up

Presto “AutoScaling” We developed our own “auto-scaling” solution for presto workers, allowing us to reduce the cost of the cluster when no queries are running on it

Next challenges Presto still not 100% integrated in our current ecosystem. ● Cluster for analysts login using our Single Sign (OKTA) on system ● Use different IAM roles depending on user / catalog / table (GDPR). ● Cost-Based Optimizer (using Hive Metastore) joinolx.com

OLX Data Hub Jakub Orowski, Krzysztof Antoczak, Facundo Guerrero - PowerPoint PPT Presentation

OLX Data Hub Jakub Orowski, Krzysztof Antoczak, Facundo Guerrero Presto Summit 2019, New York City Meet OLX, the biggest Web company youve never heard of Within classified ads, OLX Group is the largest global player Present in

Hub and Spoke Gareth Jones Hub and Spoke What is in the DH Hub and Spoke proposal? NPA

Kinnwood Central Elementary School Best Start Hub - Forest (NBHD 13) Hub location : Full hub

Introduction to the Tropical Ecosystems Hub of the National Environmental Research Program Peter

The hub plan, 30 Van Ness avenue project, 98 franklin street project, and hub housing

Teaching + Learning Commons Academic Achievement Hub Engaged Teaching Hub Writing Hub Meet the

NSF F South Big Data Hub The South Big Data Innova6on Hub

The #GrowMySME Programme Jon Brunton Growth Hub Manager What is the Growth Hub? One-stop-shop

CASE STUDY GOOD FOOD NETWORK FOOD HUB 1 CONFERENCE MARCH 2018 Food Hub 1 is a 3rd generation

Encouraging HUB Participation Learning Objectives What is a Historically Underutilized

EXPANSION HUB REV ROBOTICS - EXPANSION HUB revrobotics.com ANOTHER CONTROLLER CHOICE MODERN

Dairy Hub AAAP Seminar Bangkok Nov 29, 2012 2012-11-29 Agenda Background Dairy Hub

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Cheshire and Warrington Growth Hub update Paul Ward Operations Director Growth Hub Services

The Architecture of Wemlin Hub Ognen Ivanovski, Netcetera Jug.ch 15 Wemlin Data Data

An introduction to the Census and Administrative data LongitudinaL Studies Hub (CALLS Hub) Dr

Building Leadership Building Leadership One Bennie at a Time One Bennie at a Time Mary Dana

Identifying Community Partners & Champions: Senior Housing What Weve Learned What

DEUTSCHE TELEKOM CAPITAL MARKETS DAY 2018 FInAnCE THOMAS DAnnEnFELDT, CFO KEY MESSAGES 01 04

Mars Treatymaking Workshop Results and Insights from ISU SSP15 J. Reinert, M.B. Eide, R. Gourdon,

The links between property, commercialism and economic development Mark Bradbury Director,

Fulford Primary School Safer Internet Day 2017 Info for Parent and Carers #SID2017 Children and

CRUSHISM AND FABIO FERRONE VIOLA Viola Girolami Giulia Canfora WHAT IS CRUSHISM ? The term

AGENDA 2 Noon (EST) 12:05 PM: Introductory Remarks Dr. Eitan Yudilevich, Executive Director,

OLX Data Hub Jakub Orowski, Krzysztof Antoczak, Facundo Guerrero - PowerPoint PPT Presentation

OLX Data Hub Jakub Orowski, Krzysztof Antoczak, Facundo Guerrero Presto Summit 2019, New York City Meet OLX, the biggest Web company youve never heard of Within classified ads, OLX Group is the largest global player Present in

Hub and Spoke Gareth Jones Hub and Spoke What is in the DH Hub and Spoke proposal? NPA

Kinnwood Central Elementary School Best Start Hub - Forest (NBHD 13) Hub location : Full hub

Introduction to the Tropical Ecosystems Hub of the National Environmental Research Program Peter

The hub plan, 30 Van Ness avenue project, 98 franklin street project, and hub housing

Teaching + Learning Commons Academic Achievement Hub Engaged Teaching Hub Writing Hub Meet the

NSF F South Big Data Hub The South Big Data Innova6on Hub

The #GrowMySME Programme Jon Brunton Growth Hub Manager What is the Growth Hub? One-stop-shop

CASE STUDY GOOD FOOD NETWORK FOOD HUB 1 CONFERENCE MARCH 2018 Food Hub 1 is a 3rd generation

Encouraging HUB Participation Learning Objectives What is a Historically Underutilized

EXPANSION HUB REV ROBOTICS - EXPANSION HUB revrobotics.com ANOTHER CONTROLLER CHOICE MODERN

Dairy Hub AAAP Seminar Bangkok Nov 29, 2012 2012-11-29 Agenda Background Dairy Hub

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &amp;

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Cheshire and Warrington Growth Hub update Paul Ward Operations Director Growth Hub Services

The Architecture of Wemlin Hub Ognen Ivanovski, Netcetera Jug.ch 15 Wemlin Data Data

An introduction to the Census and Administrative data LongitudinaL Studies Hub (CALLS Hub) Dr

Building Leadership Building Leadership One Bennie at a Time One Bennie at a Time Mary Dana

Identifying Community Partners &amp; Champions: Senior Housing What Weve Learned What

DEUTSCHE TELEKOM CAPITAL MARKETS DAY 2018 FInAnCE THOMAS DAnnEnFELDT, CFO KEY MESSAGES 01 04

Mars Treatymaking Workshop Results and Insights from ISU SSP15 J. Reinert, M.B. Eide, R. Gourdon,

The links between property, commercialism and economic development Mark Bradbury Director,

Fulford Primary School Safer Internet Day 2017 Info for Parent and Carers #SID2017 Children and

CRUSHISM AND FABIO FERRONE VIOLA Viola Girolami Giulia Canfora WHAT IS CRUSHISM ? The term

AGENDA 2 Noon (EST) 12:05 PM: Introductory Remarks Dr. Eitan Yudilevich, Executive Director,

The London Cancer Hub Minute Annex Nick Smales Programme Director London Cancer Hub &

Identifying Community Partners & Champions: Senior Housing What Weve Learned What