Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO - PowerPoint PPT Presentation

Streaming Log Analytics   with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time.

Why this talk? • Humio is a Log Analytics system • Designed to run “on-prem” • High volume, real time responsiveness. • We decided to delegate the ‘hard parts’ of distributed systems to Kafka. This is a talk about our experiences.

Data Driven SecOps Humio Alerts/dashboards 30k PC’s ~1M/sec CEP 6 AD’s 2k servers 20TB/day Log Store BRO network Incident Response

Humio Ingest Data Flow API/ Agent Digest Storage Ingest • Send data • HTTP/TCP API • Streaming queries • Replication • Authenticate • Write segment files • Field Extraction

/error/i | count() Query State Machine State Machine count: 473 Event Store count: 243,565

Humio Query Flow Browser API Digest Storage • Start Query • Initiate Query • Provide results for   • Provide results for   • Poll Status • Merge results live data   historic data   • Schedule polls (materialized view) (ad-hoc query)

Real-time Processing Brute-Force Search • “Materialized views”   • Shift CPU load to   for dashboards/alerts. query time • Processed when data   • Data compression is in-memory anyway. • Allows ad-hoc queries • Fast response times   • Requires “Full stack”   for “known” queries. ownership  

Use Kafka for the ‘hard parts’ • Coordination • Commit-log / ingest buffer • Transient data • No KSQL

Kafka 101 • Kafka is a reliable distributed log/queue system • A Kafka queue consists of a number of partitions • Messages within a partition are sequenced • Partitions are replicated for durability • Use ‘partition consumers’ to parallelise work

Kafka 101 topic partition #1 consumer producer partition=hash(key) partition #2 consumer partition #3

Coordination ‘global data’ • Zookeeper-like system in-process • Hierarchical key/value store • Make decisions locally/fast without crossing a network boundary. • Allows in-memory indexes of meta data.

Coordination ‘global data’ • Coordinated via single-partition Kafka queue • Ops-based CRDT-style event sourcing • Bootstrap from snapshot from any node • Kafka config: low latency

Log Store Design • Build minimal index and compress data Store order of magnitude more events • Fast “grep” for filtering events Filtering and time/metadata selection   reduces the problem space

Event Store 10 GB (start-time, end-time, metadata) 10 GB (start-time, end-time, metadata) 10 GB (start-time, end-time, metadata) . . . 10 GB (start-time, end-time, metadata)

Event Store 1 month x 30GB/day ingest 1 month x 1TB/day ingest 90GB data, <1 MB index 4TB data, <1 MB index 1 GB (start-time, end-time, metadata) 1 GB (start-time, end-time, metadata) compress 1 GB (start-time, end-time, metadata) . . . 1 GB (start-time, end-time, metadata)

Query datasource #ds1, #web 1 GB 1 GB 1 GB 1 GB 1 GB #ds1, #app 1 GB 1 GB 1 GB #ds2, #web 1 GB 1 GB time

Query 10 GB datasource #ds1, #web 1 GB 1 GB 1 GB 1 GB 1 GB #ds1, #app 1 GB 1 GB 1 GB #ds2, #web 1 GB 1 GB time

Humio Query Flow Browser API Digest Storage • Start Query • Schedule Query • Provide results for   • Provide results for   • Poll Status • Merge results live data   historic data   (materialized view) (ad-hoc query)

Durability • Don’t loose people’s data. • Control and manage data life expectancy • Store, Replicate, Archive, Multi-tier Data storage

Durability Kafka Agent Ingest Digest Storage • Send data • Authenticate • Streaming queries • Replication • Field Extraction • Write segment files • Queries on ‘old data’

Durability API/ Agent Kafka Ingest HTTP 200 response => Kafka ACK’ed the store

File records last consumed   Durability sequence number from disk Digest WIP   Segment QE Kafka (buffer) Retention must be long enough to deal with crash

Durability Digest WIP   Segment Ingest QE Kafka Kafka (buffer) ingest latency p50 p99

Hash? topic partition #1 consumer producer ? partition=hash(key) partition #2 consumer partition #3

Partitions falling behind… • Reasons: • Data volume • Processing time for real-time processing • Measure ingest latency • Increase parallelism when running 10s behind • Log scale (1, 2, 4, …) randomness added to key.

Data Sources topic multiplexing partition #1 partition #2 … 100.000 … 100.000 partition #3

Data Model * * Repository Data Source Event • Storage limits • Time series identified by   • Timestamp +   • User admin set of key-value ‘tags’ Map[String,String] Hash ( ) #type=accesslog,#host=ops01

High variability tags ‘auto grouping’ • Tags (hash key) may be chosen with large value domain • User name • IP-address • This causes many datasources => growth in metadata, resource issues.

High variability tags ‘auto grouping’ • Tags (hash key) may be chosen with large value domain • User name • IP-address • Humio sees this and hashes tag value into a smaller value domain before the Kafka partition hash.

High variability tags ‘auto grouping’ • For example, before Kafka ingest hash(“kresten”)   #user=kresten => #user=13 • Store the actual value ‘ kresten ’ in the event • At query time, a search is then rewritten to search the data source #user=13 , and re-filter based on values.

Multiplexing in Kafka • Ideally, we would just have 100.000 dynamic topics that perform well and scales infinitely. • In practice, you have to know your data, and control the sharding. Default Kafka configs work for many workloads, but for maximum utilisation you have to do go beyond defaults.

Using Kafka in an on-prem Product • Leverage the stability and fault tolerance of Kafka • Large customers often have Kafka knowledge • We provide kafka/zookeeper docker images • Only real issue is Zookeper dependency • Often runs out of disk space in small setups

Other Issues • Observed GC pauses in the JVM • Kafka and HTTP libraries compress data • JNI/GC interactions with byte[] can block global GC. • We replaced both with custom compression • JLibGzip (gzip in pure Java) • LZ4/JNI using DirectByteBu ff er

Resetting Kafka/Zookeeper • Kafka provides a ‘cluster id’ we can use as epoch • All Kafka sequence numbers (o ff sets) are reset • Recognise this situation, no replay beyond such a reset.

What about KSQL? • Kafka now has KSQL which is in many ways similar to the engine we built • Humio moves computation to the data, • KSQL moves the data to the computation • We provide interactive end-user friendly experience

Final thoughts • Many di ffi cult problems go away by using Kafka. • We’ve been happy with the decision to defer the ‘hard parts’ of distributed systems to Kafka. • Some day we may build our own persistent commit log, but for how it is not worth the trouble.

Thanks for your time. Kresten Krab Thorup Humio CTO

Filter 1GB data

Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO - PowerPoint PPT Presentation

Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real time

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Apache Kafka + Apache Mesos Highly Scalable Streaming Microservices with Kafka Streams Kai

Day 3 Lab1: Spark Streaming with Kafka Example Introductions In this example, we will write a

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Kafka Needs No Keeper Colin McCabe 2 Introduction Kafka has gotten its mileage out of

READING KAFKA IN QATAR Qatar-TESOL Conference, April 2011 Magdalena Rostron Academic Bridge

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

Kafka in Jail Running Kafka in container orchestrated clusters Sean Glover, Lightbend @seg1o

Introduction to Kafka Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2. Computer

The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

Waterloo Wellington Diabetes Regional Coordination Centre (RCC) November 2011 Host Organization

Mining Regulation (ESDM) Ministerial Decree No. 7/ 2014 : Reclamation & Post Closure

Application No: DC/19/02363 Address: Land at Hill House Lane, Needham Market Slide 2 Site

Provisional Designation of JCM TPE Capacity Building for Local Entities Indonesia JCM Secretariat

Association Summary Establishment : in 1964 Executive Board : 15 persons Member : 51

COMPANY PR OF IL E 1 20/ 03/ 2018 Sun For The Future CONT E NT S Vision &

Significant benefit of OMPs Industry views Adam Heathfield, Pfizer Co-Chair of EFPIA/EuropaBio

DNSSEC Workshop Singapore ICANN Meeting 22 June 2011 1 Program Committee Steve Crocker,

Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO - PowerPoint PPT Presentation

Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real time

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Apache Kafka + Apache Mesos Highly Scalable Streaming Microservices with Kafka Streams Kai

Day 3 Lab1: Spark Streaming with Kafka Example Introductions In this example, we will write a

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Kafka Needs No Keeper Colin McCabe 2 Introduction Kafka has gotten its mileage out of

READING KAFKA IN QATAR Qatar-TESOL Conference, April 2011 Magdalena Rostron Academic Bridge

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

Kafka in Jail Running Kafka in container orchestrated clusters Sean Glover, Lightbend @seg1o

Introduction to Kafka Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2. Computer

The Power of the Log LSM &amp; Append Only Data Structures Ben Stopford Confluent Inc

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

Waterloo Wellington Diabetes Regional Coordination Centre (RCC) November 2011 Host Organization

Mining Regulation (ESDM) Ministerial Decree No. 7/ 2014 : Reclamation &amp; Post Closure

Application No: DC/19/02363 Address: Land at Hill House Lane, Needham Market Slide 2 Site

Provisional Designation of JCM TPE Capacity Building for Local Entities Indonesia JCM Secretariat

Association Summary Establishment : in 1964 Executive Board : 15 persons Member : 51

COMPANY PR OF IL E 1 20/ 03/ 2018 Sun For The Future CONT E NT S Vision &amp;

Significant benefit of OMPs Industry views Adam Heathfield, Pfizer Co-Chair of EFPIA/EuropaBio

DNSSEC Workshop Singapore ICANN Meeting 22 June 2011 1 Program Committee Steve Crocker,

The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc

Mining Regulation (ESDM) Ministerial Decree No. 7/ 2014 : Reclamation & Post Closure

COMPANY PR OF IL E 1 20/ 03/ 2018 Sun For The Future CONT E NT S Vision &