How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa � )

≦ 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps & APIs in 150 countries HTTP requests/second

Anatomy of a DNS query 30+ $ dig www.cloudflare.com Fields ; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.cloudflare.com. IN A ;; ANSWER SECTION: www.cloudflare.com. 5 IN A 198.41.215.162 www.cloudflare.com. 5 IN A 198.41.214.162 ;; Query time: 34 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sat Sep 2 10:48:30 2017 ;; MSG SIZE rcvd: 68

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed

What did we want? ≦ 3M Queries Per Second - Multidimensional query analytics 100+ - Complex ad-hoc queries Edge Points of Presence - Capable of current and expected future scale - Gracefully handle late arriving log data 20+ - Roll-ups/aggregations for long term storage Query Dimensions - Highly available and replicated architecture 5+ Years of stored aggregation

Kafka, Apache Spark and Parquet - Scanning firehose is slow and Logs are received and adding filters is time consuming de-multiplexed - Offline analysis is difficult with Logs are written into large amounts of data various kafka topics - Not a fast or friendly user experience Download and filter data from Kafka using Apache Spark - Doesn’t work for customers Converted into Parquet and written to HDFS

Let’s aggregate everything... with streams Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR Time Bucket QName QType RCODE Count p50 Response Time 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms

Let’s aggregate everything... with streams - Counters - Total number of queries - Query types - Response codes - Top-n query names - Top-n query sources - Response time/size quantiles

Aggregating with Spark Streaming - Spark experience in-house, though Logs are received and Java/Scala de-multiplexed - Batch-oriented and need a DB to Logs are written into serve online queries various kafka topics - Difficult to support ad-hoc analysis Produce low cardinality aggregates with Spark Streaming - Low resolution aggregates - Scanning raw data is slow - Late arriving data

Spark Streaming + CitusDB - Distributed time-series DB Logs are received and de-multiplexed - Existing deployments of CitusDB - High cardinality aggregations are Logs are written into various kafka topics tricky due to insert performance - Late arriving data Produce low cardinality aggregates with Spark Streaming - SQL API Insert aggregate rows into CitusDB cluster for reads

Apache Flink + (CitusDB?) - Dataflow API and support for Logs are received and stream watermarks de-multiplexed - Checkpoint performance issues Logs are written into various kafka topics - High cardinality aggregations are tricky due to insert performance Produce low cardinality aggregates with Flink - SQL API Insert aggregate rows into CitusDB cluster for reads

Druid - Insertion rate couldn’t keep up in Logs are received and our initial tests de-multiplexed - Estimated costs of a suitable cluster Logs are written into were way expensive various kafka topics - Seemed performant for random Insert into a cluster of reads but not the best we’d seen Druid nodes - Operational complexity seemed high

Let’s aggregate everything... with streams - Raw data isn’t easily queried ad-hoc Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA - Backfilling new aggregates is impossible or can be very difficult without custom tools 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR - A stream can’t serve actual queries - Can be costly for high cardinality dimensions Time Bucket QName QType RCODE Count p50 Response Time *https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms

� ClickHouse - Tabular, column-oriented data store - Single binary, clustered architecture - Familiar SQL query interface Lots of very useful built-in aggregation functions - Raw log data stored for 3 months ~7 trillion rows - Aggregated data for ∞ 1m, 1h aggregations across 3 dimensions

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server Go Inserters write the data in parallel HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed Multi-tenant ClickHouse cluster stores data

ClickHouse Cluster TinyLog dnslogs_2016_01_01_14_30_pN ReplicatedMergeTree Initial table design dnslogs_2016_01_01 - Raw logs are inserted into ReplicatedMergeTree sharded tables dnslogs_2016_01 - Sidecar processes aggregates data into day/month/year tables ReplicatedMergeTree dnslogs_2016

ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs First attempt in prod. - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas

Speeding up typical queries - SUM() and COUNT() over a few low-cardinality dimensions - Global overview (trends and monitoring) - Storing intermediate state for non-additive functions

ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs Today... ReplicatedAggregatingMergeTree dnslogs_rollup_X - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas - Aggregate tables for long-term storage

Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions

Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions Multi-tenant ClickHouse cluster 33 8M+ 4GB+ 2PB+ Nodes Row Insertion/s Insertion Throughput/s Raid-0 Spinning Disks

ClickHouse Today… 12 Trillion Rows SELECT table, sum(rows) AS total FROM system.cluster_parts WHERE database = 'r0' GROUP BY table ORDER BY total DESC ┌─table──────────────────────────────┬─────────────total─┐ │ ███████████████ │ 9,051,633,001,267 │ │ ████████████████████ │ 2,088,851,716,078 │ │ ███████████████████ │ 847,768,860,981 │ │ ██████████████████████ │ 259,486,159,236 │ │ … │ … │

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa ) 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps &

DNSSEC at Scale Dani Grant | DNS @ CloudFlare CloudFlare - Authoritative DNS provider (includes

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

CloudFlare DNS Anycast Services lafur Gu mundsson | olafur@cloudflare.com Network Over 80

Four years of Go at CloudFlare John Graham-Cumming CloudFlare You

Introduction to Cloudflare Jrme Fleury BNIX meeting, Thursday 29th 2016, Brussels. What is

Hidden Linux Metrics with ebpf_exporter Ivan Babrou @ibobrik Performance team @Cloudflare What

XDP in Practice DDoS Mitigation @Cloudflare Gilberto Bertin About me Systems Engineer at

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Luke Valenta Gabbi Fisher @lukevalenta @gabbifish Systems Engineer, Cloudflare Systems

Event-driven network automation and orchestration Tom Strickx UKNOF 40 Cloudflare, London

Congressional Budget Office June 16, 2020 How CBO Analyzes Approaches to Improve Health Through

T his months column analyzes the IRSs latest ate a potential issue at the time, however,

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

choose your region: Earth; Cloudflare Workers hi, im @ag_dubs performance accessibility

Adding new DNSSEC algorithms: reality check lafur Gu mundsson olafur at CloudFlare dot com

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Massive Multitenancy with V8 Isolates Kenton Varda - Tech Lead, Cloudflare Workers The Challenge

Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock

A Dichotomy for Non-Repeating Queries with Negation in Probabilistic Databases Robert Fink and Dan

The Changing Face of Cyber Risk 1 About Advisen Advisen generates, integrates, analyzes and

CSE 344 Section 3 [Srini] 1. Connecting to Azure and running queries 2. Nested Queries Connect

Scaling Up Symbolic Reasoning for Relational Queries or, speeding up debugging & verification

Documents Transliterated Queries Transliterated Documents Native script Queries 5 teams, 25

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa ) 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps &

DNSSEC at Scale Dani Grant | DNS @ CloudFlare CloudFlare - Authoritative DNS provider (includes

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

CloudFlare DNS Anycast Services lafur Gu mundsson | olafur@cloudflare.com Network Over 80

Four years of Go at CloudFlare John Graham-Cumming CloudFlare You

Introduction to Cloudflare Jrme Fleury BNIX meeting, Thursday 29th 2016, Brussels. What is

Hidden Linux Metrics with ebpf_exporter Ivan Babrou @ibobrik Performance team @Cloudflare What

XDP in Practice DDoS Mitigation @Cloudflare Gilberto Bertin About me Systems Engineer at

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Luke Valenta Gabbi Fisher @lukevalenta @gabbifish Systems Engineer, Cloudflare Systems

Event-driven network automation and orchestration Tom Strickx UKNOF 40 Cloudflare, London

Congressional Budget Office June 16, 2020 How CBO Analyzes Approaches to Improve Health Through

T his months column analyzes the IRSs latest ate a potential issue at the time, however,

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

choose your region: Earth; Cloudflare Workers hi, im @ag_dubs performance accessibility

Adding new DNSSEC algorithms: reality check lafur Gu mundsson olafur at CloudFlare dot com

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Massive Multitenancy with V8 Isolates Kenton Varda - Tech Lead, Cloudflare Workers The Challenge

Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock

A Dichotomy for Non-Repeating Queries with Negation in Probabilistic Databases Robert Fink and Dan

The Changing Face of Cyber Risk 1 About Advisen Advisen generates, integrates, analyzes and

CSE 344 Section 3 [Srini] 1. Connecting to Azure and running queries 2. Nested Queries Connect

Scaling Up Symbolic Reasoning for Relational Queries or, speeding up debugging &amp; verification

Documents Transliterated Queries Transliterated Documents Native script Queries 5 teams, 25

Scaling Up Symbolic Reasoning for Relational Queries or, speeding up debugging & verification