Building Scalable and Extendable Data Pipeline for Call of Duty - PowerPoint PPT Presentation

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision

1+ Data lake size (AWS S3) PB

Number of topics in the biggest cluster 500+ (Apache Kafka)

Messages per second 10k - 100k+ (Apache Kafka)

Scaling the data pipeline even further Volume Games Use-cases Industry best practices Using previous experience Completely unpredictable Complexity

Kafka topics are partitioned and replicated 0 1 2 3 4 5 6 7 8 Partition 1 Consumer 0 1 2 3 4 5 6 7 8 9 or Partition 2 Producer 0 1 2 3 4 5 6 7 8 9 Partition 3 Kafka topic

Scaling the pipeline in terms of Volume

Producers Consumers

Scaling producers • Asynchronous / non-blocking writes (default) • Compression and batching • Sampling • Throttling • Acks? 0, 1, -1 • Standard Kafka producer tuning: batch.size, linger.ms, buffer.memory, etc.

Each approach has pros and cons • Simple • Flexible • Low-latency connection • Possible to do basic enrichment • Number of TCP connections per • Easier to manage Kafka clusters broker starts to look scary • Really hard to do maintenance on Kafka clusters

Simple rule for high-performant producers? Just write to Kafka, nothing else 1 . 1. Not even auth?

Scaling Kafka clusters • Just add more nodes! • Disk IO is extremely important • Tuning io.threads and network.threads • Retention • For more: “Optimizing Your Apache Kafka Deployment” whitepaper from Confluent

It’s not always about tuning. Sometimes we need more than one cluster. Different workloads require different topologies.

● Stream processing ● Short retention ● More partitions ● Ingestion (HTTP Proxy) ● Lots of consumers ● Long retention ● Medium retention ● High SLA ● ACL

Scaling consumers is usually pretty trivial - just increase the number of partitions. Unless… you can’t. What then?

Block Storage Work Queue Microbatch Archiver Populator Metadata Metadata Message Queue

Even if you can add more partitions • Still can have bottlenecks within a partition (large messages) • In case of reprocessing, it’s really hard to quickly add A LOT of new partitions AND remove them after • Also, number of partitions is not infinite

You can’t be sure about any improvements without load testing. Not only for a cluster, but producers and consumers too.

Scaling and extending the pipeline in terms of Games and Use-cases

We need to keep the number of topics and partitions low • More topics means more operational burden • Number of partitions in a fixed cluster is not infinite • Autoscaling Kafka is impossible, scaling is hard

Topic naming convention Unique game id “CoD WW2 on PSN” Producer $env . $source . $title . $category - $version prod.glutton.1234.telemetry_match_event-v1

A proper solution has been invented decades ago. Think about databases.

Messaging system IS a form of a database Data topic = Database + Table. Data topic = Namespace + Data type.

Compare this prod.glutton.1234.telemetry_match_event-v1 telemetry.matches dev.user_login_records.4321.all-v1 user.logins prod.marketplace.5678.purchase_event-v1 marketplace.purchases

Each approach has pros and cons • Topics that use metadata for their • These dynamic fields can and will names are obviously easier to track change. Producers (sources) and and monitor (and even consume). consumers will change. • As a consumer, I can consume • Very efficient utilization of topics exactly what I want, instead of and partitions. consuming a single large topic and • Finally, it’s impossible to enforce extracting required values. any constraints with a topic name. And you can always end up with dev data in prod topic and vice versa.

After removing necessary metadata from the topic names stream processing becomes mandatory.

Stream processing becomes mandatory Measuring → Validating → Enriching → Filtering & routing

Having a single message schema for a topic is more than just a nice-to-have.

Number of supported 8 message formats

Stream processor JSON ? ? Protobuf ? ? Custom Avro

Custom deserialization // Application.java props.put("value.deserializer", "com.example.CustomDeserializer"); // CustomDeserializer.java public class CustomDeserializer implements Deserializer< ??? > { @Override public ??? deserialize( String topic, byte[] data) { ??? } }

Message envelope anatomy Header / Metadata ID, env, timestamp, source, game, ... Body / Payload Event Message

Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }

Schema Registry • API to manage message schemas • Single source of truth for all producers and consumers • It should be impossible to send a message to the pipeline without registering its schema in the Schema Registry! • Good Schema Registry supports immutability, versioning and basic validation • Activision uses custom Schema Registry implemented with Python and Cassandra

Summary • Kafka tuning and best practices matter • Invest in good SDKs for producing and consuming data • Unified message envelope and topic names make adding a new game almost effortless • “Operational” stream processing makes it possible. Make sure you can support adhoc filtering and routing of data • Topic names should express data types, not producer or consumer metadata • Schema Registry is a must-have

Thanks! @sap1ens

Building Scalable and Extendable Data Pipeline for Call of Duty - PowerPoint PPT Presentation

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision 1+ Data lake size (AWS S3) PB Number of topics in the biggest cluster 500+ (Apache Kafka)

On s -fully cycle extendable line graphs Yehong Shao Ohio University Southern, Ironton, OH 45638

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universitt 05.07.2018

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

PWSCF and new charge density PWSCF call read_input_file (input.f90) call run_pwscf call setup

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls Rios Software

1,000 foot pipeline Connect Replacement (Saugus 3 and 4) Wells to Magic Mountain Pipeline

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Talent Pipeline Management Helping Illinois Businesses Manage Their Talent Pipeline Building on

Hey, Google: Scan Away! Jake Harr and Jeff Pool tell you why A Match Made in Heaven Storage

COMMERCIAL CONFIDENTIAL INFORMATION Principles to be considered London, 22 January 2009 Vincenzo

Welcome and agenda Description Time 1 Welcome 7.00 pm 2 Orientation: how we came to be here

Kindergarten Curriculum -Environment rich in reading and writing opportunities English Language

Finding the Right Exemplars for Reconstructing Single Image Super-Resolution Jiahuan Zhou , Ying

@lenadro id Fundamental techniques and building blocks @lenadro id Are fundamentals still

Parallel DEVS & DEVSJAVA Presented by Ximeng Sun Mar 16, 2005 References Bernard P.

PG Change Data Capture with Debezium Hannu Valtonen Kafka Meetup | PG CDC with Debezium |

Q2 2019 Earnings Presentation August 8, 2019 Important Notices and Safe Harbor Statement This

Building Scalable and Extendable Data Pipeline for Call of Duty - PowerPoint PPT Presentation

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision 1+ Data lake size (AWS S3) PB Number of topics in the biggest cluster 500+ (Apache Kafka)

On s -fully cycle extendable line graphs Yehong Shao Ohio University Southern, Ironton, OH 45638

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universitt 05.07.2018

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

PWSCF and new charge density PWSCF call read_input_file (input.f90) call run_pwscf call setup

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls Rios Software

1,000 foot pipeline Connect Replacement (Saugus 3 and 4) Wells to Magic Mountain Pipeline

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering &amp; Research

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Talent Pipeline Management Helping Illinois Businesses Manage Their Talent Pipeline Building on

Hey, Google: Scan Away! Jake Harr and Jeff Pool tell you why A Match Made in Heaven Storage

COMMERCIAL CONFIDENTIAL INFORMATION Principles to be considered London, 22 January 2009 Vincenzo

Welcome and agenda Description Time 1 Welcome 7.00 pm 2 Orientation: how we came to be here

Kindergarten Curriculum -Environment rich in reading and writing opportunities English Language

Finding the Right Exemplars for Reconstructing Single Image Super-Resolution Jiahuan Zhou , Ying

@lenadro id Fundamental techniques and building blocks @lenadro id Are fundamentals still

Parallel DEVS &amp; DEVSJAVA Presented by Ximeng Sun Mar 16, 2005 References Bernard P.

PG Change Data Capture with Debezium Hannu Valtonen Kafka Meetup | PG CDC with Debezium |

Q2 2019 Earnings Presentation August 8, 2019 Important Notices and Safe Harbor Statement This

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Parallel DEVS & DEVSJAVA Presented by Ximeng Sun Mar 16, 2005 References Bernard P.