Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie

In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future 5. Questions 6.

Overview

Life of an Event Http Clients Http Endpoint Clients Client Daemon Clients log events specifying a Category ● Clients Client Daemon name. Eg ads_view, login_event ... Client Daemon Events are grouped together across all ● clients into the Category Events are stored on Hadoop Distributed ● File System, bucketed every hour into Aggregated by Category separate directories /logs/ads_view/2017/05/01/23 ○ /logs/login_event/2017/05/01/23 ○ Storage HDFS

Event Log Stats >1T ~3PB Trillion Events a Day of Data a Day Across millions of Incoming clients uncompressed >600 <1500 Categories Nodes Event groups by Collocated with category HDFS datanodes

Event Log Architecture Remote Clients Inside Clients Clients HTTP DataCenter Local log collection daemon Log Aggregate log events grouped Processor by Category Storage (HDFS) Storage Storage (HDFS) (Streaming) Storage (HDFS) Log Storage (HDFS) Replicator

Event Log Architecture Inside Inside Events Events Events Events DC1 DC2 RT Storage (HDFS) RT Storage (HDFS) DW Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS) Prod Storage (HDFS)

Collection

Event Collection Overview Past Future Present Scribe Scribe Flume Client Client Client Daemon Daemon Daemon Scribe Flume Flume Aggregator Aggregator Aggregator Daemons Daemon Daemon

Event Collection Past Challenges with Scribe Too many open file handles to HDFS ● 600 categories x 1500 aggregators x 6 per hour =~ 5.4M files per hour ○ High IO wait on DataNodes at scale ● Max limit on throughput per aggregator ● Difficult to track message drops ● No longer active open source development ●

Event Collection Present Apache Flume Flume Agent Source Sink HDFS Client Well defined interfaces ● Open source ● Concept of transactions ● Existing implementations of ● Channel interfaces

Event Collection Category 1 Category 3 Category 2 Present Category Group Combine multiple related ● categories into a category group Provide different ● properties per group Agent 3 Agent 2 Agent 1 Contains multiple events ● to generate fewer combined sequence files Category Group

Category Groups Event Collection Group 1 Present Group 2 Aggregator Group A set of aggregators ● Aggregator Group 1 Aggregator Group 2 hosting same set of Agent 1 Agent 2 Agent 3 Agent 8 category groups Easy to manage ● group of aggregators hosting subset of categories

Event Collection Present Flume features to support groups Extend Interceptor to multiplex events into groups ● Implement Memory Channel Group to have separate memory ● channel per category group ZooKeeper registration per category group for service discovery ● Metrics for category groups ●

Event Collection Present Flume performance improvements HDFSEventSink batching increased (5x) throughput reducing ● spikes on memory channel Implement buffering in HDFSEventSink instead of using ● SpillableMemoryChannel Stream events close to network speed ●

Processing

Log Processor Stats Processing Trillion Events per Day 8 >1PB 20-50% Wall Clock Hours Data per Day Disk Space To process one Saved by Output of cleaned, day of data processing Flume compressed, sequence files consolidated, and converted

Log Processor Needs Processing Trillion Events per Day Make processing log data easier for analytics teams ● Disk space is at a premium on analytics clusters ● Still too many files cause increased pressure on the NameNode ● Log data is read many times and different teams all perform the same ● pre-processing steps on the same data sets

Log Processor Steps Datacenter 1 Category Groups Demux Jobs Categories ads_click/yyyy/mm/dd/hh ads_group/yyyy/mm/dd/hh ads_group_demuxer ads_view/yyyy/mm/dd/hh login_group_demuxer login_event/yyyy/mm/dd/hh login_group/yyyy/mm/dd/hh

Log Processor Steps 1 4 Decode Compress Base64 encoding from logged Logged data to the highest level to data save disk space. From LZO level 3 to LZO level 7 2 5 Consolidate Demux Category groups into individual Small files to reduce pressure on categories for easier consumption by the NameNode analytics teams 3 6 Convert Clean Corrupt, empty, or invalid records Some categories into Parquet for so data sets are more reliable fastest use in ad-hoc exploratory tools

Why Base64 Decoding? Legacy Choices ● Scribe’s contract amounts to sending a binary blob to a port ● Scribe used new line characters to delimit records in a binary blob batch of records ● Valid records may include new line characters ● Scribe base64 encoded received binary blobs to avoid confusion with record delimiter ● Base 64 encoding is no longer necessary because we have moved to one serialized Thrift object per binary blob

Log Demux Visual /raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo /logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo

Log Processor Daemon One log processor daemon per RT Hadoop cluster, where Flume ● aggregates logs Primarily responsible for demuxing category groups out of the Flume ● sequence files The daemon schedules Tez jobs every hour for every category group in a ● thread pool Daemon atomically presents processed category instances so partial data ● can’t be read Processing proceeds according to criticality of data or “tiers” ●

Why Tez? ● Some categories are significantly larger than other categories (KBs v TBs) ● MapReduce demux? Each reducer handles a single category ● Streaming demux? Each spout or channel handles a single category ● Massive skew in partitioning by category causes long running tasks which slows down job completion time ● Relatively well understood fault tolerance semantics similar to MapReduce, Spark, etc

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Joel BARCIAUSKAS

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

Network Serialization and Routing in World of Warcraft Joe Rumsey jrumsey@blizzard.com Twitter:

Network Serialization and Routing in World of Warcraft Joe Rumsey jrumsey@blizzard.com Twitter:

#BREXIT ON TWITTER THE BIG QUETTION What is the relationship between social media and

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

1 Routing Table Routing Table Destination network Next router

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

CS 557 ARPANet Routing Algorithms An Overview of the New Routing Algorithm for the ARPANET J.

Requisition Routing In depth look into the structure of Requisition Routing Goals for this

Occupy Central Coverage 2014 Coverage via Facebook Coverage via Twitter Liveblogging the Events

CS 557 Landmark Routing The Landmark Hierarchy: A New Hierarchy For Routing in Very Large

Internet routing is based on routing protocols that collect the input data No off-line route

NDN ROUTING SECURITY Lan Wang, Beichuan Zhang 2/9/2015 www.named-data.net 2 Routing Security

What is the difference between routing and forwarding? Routing protocols: Implemented

Excellence Framework Follow us on Twitter at REF consultation events #REF2021 David Sweeney

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

Log all the things! Honza Krl @honzakral Logs? Events! Log lines Twitter feed Invoices