altinity
play

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse - PowerPoint PPT Presentation

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet, Altinity Percona Live Dublin, 2017 Who am I Graduated Moscow State University in 1999 Software engineer since 1997 Developed distributed


  1. Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet, Altinity Percona Live Dublin, 2017

  2. Who am I • Graduated Moscow State University in 1999 • Software engineer since 1997 • Developed distributed systems since 2002 • Focused on high performance analytics since 2007 • Director of Engineering in LifeStreet • Co-founder of Altinity

  3. • Ad Tech company (ad exchange, ad server, RTB, DMP etc.) since 2006 • 10,000,000,000+ events/day • 2K/event • 3 months retention (90-120 days) 10B * 2K * [90-120] = [1.8-2.4]PB

  4. Tried/used/evaluated: • – MySQL (TokuDB, ShardQuery) – InfiniDB ClickHouse – MonetDB – InfoBright EE – Paraccel (now RedShift) – Oracle – Greenplum – Snowflake DB – Vertica

  5. Flashback: ClickHouse at 08/2016 • 1-2 months in Open Source • Internal Yandex product – no other installations • No support, roadmap, communicated plans • 3 official devs • A number of visible limitations (and many invisible) • Stories of other doomed open-sourced DBs

  6. Develop production system with “that”?

  7. • Transactions • Constraints • Consistency • UPDATE/DELETE ClickHouse • NULLs (added few months ago) is/was • Milliseconds missing: • Implicit type conversions • Standard SQL support • Partitioning by any column (date only) • Enterprise operation tools

  8. SQL developers reaction:

  9. But we tried and succeeded

  10. Before you go: ü Confirm your use case ü Check benchmarks ü Run your own ü Consider limitations, not features ü Make a POC

  11. Migration problem: basic things do n

  12. Main Challenges • Efficient schema – Use ClickHouse bests – Workaround limitations • Reliable data ingestion • Sharding and replication • Client interfaces

  13. LifeStreet Use Case • Publisher/Advertiser performance • Campaign/Creative performance prediction • Realtime algorithmic bidding • DMP

  14. LifeStreet Requirements • Load 10B rows/day, 500 dimensions/row • Ad-hoc reports on 3 months of data • Low data and query latency • High Availability

  15. Multi-Dimensional Analysis OLAP query: aggregation + filter + group by N-dimensional Range filter cube Query result slice M- dimensional Disclaimer: averages lie projection

  16. Typical schema: “star” • Facts • Dimensions • Metrics • Projections

  17. Star Schema Approach De-normalized: Normalized: dimensions in a fact table dimension keys in a fact table separate dimension tables Single table, simple Multiple tables Simple queries, no joins More complex queries with joins Data can not be changed Data in dimension tables can be changed Sub-efficient storage Efficient storage Sub-efficient queries More efficient queries

  18. Normalized schema: traditional approach - joins • Limited support in ClickHouse (1 level, cascade sub-selects for multiple) • Dimension tables are not updatable

  19. Dictionaries - ClickHouse dimensions approach • Lookup service: key -> value • Supports different external sources (files, databases etc.) • Refreshable

  20. Dictionaries. Example SELECT country_name, sum(imps) FROM T ANY INNER JOIN dim_geo USING (geo_key) GROUP BY country_name; vs SELECT dictGetString(‘dim_geo’, ‘country_name’, geo_key) country_name, sum(imps) FROM T GROUP BY country_name;

  21. Dictionaries. Configuration <dictionary> <name></name> <source> … </source> <lifetime> ... </lifetime> <layout> … </layout> <structure> <id> ... </id> <attribute> ... </attribute> <attribute> ... </attribute> ... </structure> </dictionary>

  22. Dictionaries. Sources • file • mysql table • clickhouse table • odbc data source • executable script • http service

  23. Dictionaries. Layouts • flat • hashed • cache • complex_key_hashed • range_hashed

  24. Dictionaries. range_hashed • ‘Effective Dated’ queries <layout> <range_hashed /> </layout> dictGetFloat32('srv_ad_serving_costs', <structure> 'ad_imps_cpm', toUInt64(0), event_day) <id> <name>id</name> </id> <range_min> <name>start_date</name> </range_min> <range_max> <name>end_date</name> </range_max>

  25. Dictionaries. Update values • By timer (default) • Automatic for MySQL MyISAM • Using ‘invalidate_query’ • Manually touching config file • Warning: N dict * M nodes = N * M DB connections

  26. Dictionaries. Restrictions • ‘Normal’ keys are only UInt64 • No on demand update (added in Sep 2017 1.1.54289) • Every cluster node has its own copy • XML config (DDL would be better)

  27. Dictionaries vs. Tables + No JOINs + Updatable + Always in memory for flat/hash (faster) - Not a part of the schema - Somewhat inconvenient syntax

  28. Tables • Engines • Sharding/Distribution • Replication

  29. Engine = ? In memory: • Interface: • – Distributed – Memory – Merge – Buffer – Dictionary – Join Special purpose: • – Set – View On disk: • – Materialized View – Log, TinyLog – Null – MergeTree family

  30. Merge tree • What is ‘merge’ Block 1 Block 2 • PK sorting and index • Date partitioning Merged block • Query performance PK index See details at: https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324

  31. MergeTree family Replacing Collapsing Replicated + + MergeTree Summing Aggergating Graphite

  32. Data Load • Multiple formats are supported, including CSV, TSV, JSONs, native binary • Error handling • Simple Transformations • Load locally (better) or distributed (possible) • Temp tables help • Replicated tables help with de-dup

  33. The power of Materialized Views • MV is a table, i.e. engine, replication etc. • Updated synchronously • Summing/AggregatingMergeTree – consistent aggregation • Alters are problematic

  34. Data Load Diagram Realtime producers Log Files MySQL INSERT Buffer tables INSERT Temp tables (local) (local) INSERT Buffer flush Fact tables (shard) Dictionaries MV MV SummingMergeTree SummingMergeTree (shard) (shard) CLICKHOUSE NODE

  35. Updates and deletes • Dictionaries are refreshable • Replacing and Collapsing merge trees – eventually updates – SELECT … FINAL • Partitions

  36. Sharding and Replication • Sharding and Distribution => Performance – Fact tables and MVs – distributed over multiple shards – Dimension tables and dicts – replicated at every node (local joins and filters) • Replication => Reliability – 2-3 replicas per shard – Cross DC

  37. Distributed Query SELECT foo FROM distributed_table GROUP by col1 Server 1, 2 or 3 SELECT foo FROM local_table GROUP BY col1 • Server 1 SELECT foo FROM local_table GROUP BY col1 • Server 2 SELECT foo FROM local_table GROUP BY col1 • Server 3

  38. Replication • Per table topology configuration: – Dimension tables – replicate to any node – Fact tables – replicate to mirror replica • Zookeper to communicate the state – State: what blocks/parts to replicate • Asynchronous => faster and reliable enough • Synchronous => slower • Isolate query to replica • Replication queues

  39. SQL • Supports basic SQL syntax • Non-standard JOINs implementation: – 1 level only – ANY vs ALL – only USING • Aliasing everywhere • Array and nested data types, lambda-expressions, ARRAY JOIN • GLOBAL IN, GLOBAL JOIN • Approximate queries • Some analytical functions

  40. Hardware and Deployment • Load is CPU intensive => more cores • Query is disk intensive => faster disks • 10-12 SATA RAID10 – SAS/SSD => x2 performance for x2 price for x0.5 capacity • 10 TB/server seems optimal • Zookeper – keep in on DC for fast quorum • Remote DC work bad (e.g. East an West coast in US)

  41. Main Challenges Revisited • Design efficient schema – Use ClickHouse bests – Workaround limitations • Design sharding and replication • Reliable data ingestion • Client interfaces

  42. Migration project timelines August 2016: POC • October 2016: first test runs • December 2016: production scale data load: • – 10-50B events/ day, 20TB data/day – 12 x 2 servers with 12x4TB RAID10 March 2017: Client API ready, starting migration • – 30+ client types, 20 req/s query load May 2017: extension to 20 x 3 servers • June 2017: migration completed! • – 2-2.5PB uncompressed data

  43. Few examples :) select count(*) from dw.ad8_fact_event where access_day=today()-1; SELECT count(*) FROM dw.ad8_fact_event WHERE access_day = (today() - 1) ┌────count()─┐ │ 7585106796 │ └────────────┘ 1 rows in set. Elapsed: 0.503 sec. Processed 12.78 billion rows, 25.57 GB (25.41 billion rows/s., 50.82 GB/s.)

More recommend