How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa � )
≦ 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps & APIs in 150 countries HTTP requests/second
Anatomy of a DNS query 30+ $ dig www.cloudflare.com Fields ; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.cloudflare.com. IN A ;; ANSWER SECTION: www.cloudflare.com. 5 IN A 198.41.215.162 www.cloudflare.com. 5 IN A 198.41.214.162 ;; Query time: 34 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sat Sep 2 10:48:30 2017 ;; MSG SIZE rcvd: 68
Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed
Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed
What did we want? ≦ 3M Queries Per Second - Multidimensional query analytics 100+ - Complex ad-hoc queries Edge Points of Presence - Capable of current and expected future scale - Gracefully handle late arriving log data 20+ - Roll-ups/aggregations for long term storage Query Dimensions - Highly available and replicated architecture 5+ Years of stored aggregation
Kafka, Apache Spark and Parquet - Scanning firehose is slow and Logs are received and adding filters is time consuming de-multiplexed - Offline analysis is difficult with Logs are written into large amounts of data various kafka topics - Not a fast or friendly user experience Download and filter data from Kafka using Apache Spark - Doesn’t work for customers Converted into Parquet and written to HDFS
Let’s aggregate everything... with streams Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR Time Bucket QName QType RCODE Count p50 Response Time 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
Let’s aggregate everything... with streams - Counters - Total number of queries - Query types - Response codes - Top-n query names - Top-n query sources - Response time/size quantiles
Aggregating with Spark Streaming - Spark experience in-house, though Logs are received and Java/Scala de-multiplexed - Batch-oriented and need a DB to Logs are written into serve online queries various kafka topics - Difficult to support ad-hoc analysis Produce low cardinality aggregates with Spark Streaming - Low resolution aggregates - Scanning raw data is slow - Late arriving data
Aggregating with Spark Streaming - Spark experience in-house, though Logs are received and Java/Scala de-multiplexed - Batch-oriented and need a DB to Logs are written into serve online queries various kafka topics - Difficult to support ad-hoc analysis Produce low cardinality aggregates with Spark Streaming - Low resolution aggregates - Scanning raw data is slow - Late arriving data
Spark Streaming + CitusDB - Distributed time-series DB Logs are received and de-multiplexed - Existing deployments of CitusDB - High cardinality aggregations are Logs are written into various kafka topics tricky due to insert performance - Late arriving data Produce low cardinality aggregates with Spark Streaming - SQL API Insert aggregate rows into CitusDB cluster for reads
Apache Flink + (CitusDB?) - Dataflow API and support for Logs are received and stream watermarks de-multiplexed - Checkpoint performance issues Logs are written into various kafka topics - High cardinality aggregations are tricky due to insert performance Produce low cardinality aggregates with Flink - SQL API Insert aggregate rows into CitusDB cluster for reads
Druid - Insertion rate couldn’t keep up in Logs are received and our initial tests de-multiplexed - Estimated costs of a suitable cluster Logs are written into were way expensive various kafka topics - Seemed performant for random Insert into a cluster of reads but not the best we’d seen Druid nodes - Operational complexity seemed high
Let’s aggregate everything... with streams - Raw data isn’t easily queried ad-hoc Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA - Backfilling new aggregates is impossible or can be very difficult without custom tools 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR - A stream can’t serve actual queries - Can be costly for high cardinality dimensions Time Bucket QName QType RCODE Count p50 Response Time *https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
� ClickHouse - Tabular, column-oriented data store - Single binary, clustered architecture - Familiar SQL query interface Lots of very useful built-in aggregation functions - Raw log data stored for 3 months ~7 trillion rows - Aggregated data for ∞ 1m, 1h aggregations across 3 dimensions
Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server Go Inserters write the data in parallel HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed Multi-tenant ClickHouse cluster stores data
ClickHouse Cluster TinyLog dnslogs_2016_01_01_14_30_pN ReplicatedMergeTree Initial table design dnslogs_2016_01_01 - Raw logs are inserted into ReplicatedMergeTree sharded tables dnslogs_2016_01 - Sidecar processes aggregates data into day/month/year tables ReplicatedMergeTree dnslogs_2016
ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs First attempt in prod. - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas
Speeding up typical queries - SUM() and COUNT() over a few low-cardinality dimensions - Global overview (trends and monitoring) - Storing intermediate state for non-additive functions
ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs Today... ReplicatedAggregatingMergeTree dnslogs_rollup_X - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas - Aggregate tables for long-term storage
Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions
Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions Multi-tenant ClickHouse cluster 33 8M+ 4GB+ 2PB+ Nodes Row Insertion/s Insertion Throughput/s Raid-0 Spinning Disks
ClickHouse Today… 12 Trillion Rows SELECT table, sum(rows) AS total FROM system.cluster_parts WHERE database = 'r0' GROUP BY table ORDER BY total DESC ┌─table──────────────────────────────┬─────────────total─┐ │ ███████████████ │ 9,051,633,001,267 │ │ ████████████████████ │ 2,088,851,716,078 │ │ ███████████████████ │ 847,768,860,981 │ │ ██████████████████████ │ 259,486,159,236 │ │ … │ … │
Recommend
More recommend