RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at - - PowerPoint PPT Presentation

▶

Jan 23, 2024 238 likes •458 views

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll) Request Metrics at Google JBD, Google (@rakyll) "100% is the wrong reliability target for basically everything." -- Benjamin Treynor

SLIDE 1

RPC Metrics at Google

JBD, Google (@rakyll)

SLIDE 2

gRPC Metrics at Google

JBD, Google (@rakyll)

SLIDE 3

Request Metrics at Google

JBD, Google (@rakyll)

SLIDE 4

@rakyll

"100% is the wrong reliability target for basically everything."

- Benjamin Treynor Sloss, VP of Engineering, Google

SLIDE 5

@rakyll

"A service is available if users cannot tell that there was an outage."

SLIDE 6

@rakyll

Principled way of saying what level of downtime is acceptable.

Error rate
Latency expectations

SLOs

SLIDE 7

@rakyll

Analytics frontend server Authentication Reporting Users ... Spanner Blob Store

SLIDE 8

@rakyll

Questions infra teams want to ask:

Are we meeting the SLO for the other team?
What’s the impact of a product on infra?
How much do we need to scale up if product grows 10%?

SLIDE 9

@rakyll

High-Cardinality

Breaking down the metrics data...

SLIDE 10

@rakyll

Query the collected data in various ways:

Latency distribution for RPCs originated at Google Analytics.
Requests take took more than 100ms for the customer #123.
Compare the request latency initiated at web vs mobile frontend.

SLIDE 11

@rakyll

Analytics frontend server Authentication Reporting Users ... Spanner Blob Store

riginator=analytics;

...

SLIDE 12

@rakyll

Blob store read errors by originator

SLIDE 13

@rakyll

Dynamically choose aggregation

(split between recording and aggregation)

SLIDE 14

@rakyll

Exemplars

SLIDE 15

@rakyll

/rpz and /statz

SLIDE 16

@rakyll

http://server:7777/debug/rpcz

SLIDE 17

@rakyll

Export?

Monarch, Prometheus, and more.

SLIDE 18

@rakyll

import “cloud.google.com/go/pubsub”

SLIDE 19

@rakyll

+

SLIDE 20

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at - - PowerPoint PPT Presentation

RPC Metrics at Google

gRPC Metrics at Google

Request Metrics at Google

"100% is the wrong reliability target for basically everything."

"A service is available if users cannot tell that there was an outage."

SLOs

High-Cardinality

Dynamically choose aggregation

Exemplars

/rpz and /statz

Export?

import “cloud.google.com/go/pubsub”

+

Thank you!

JBD, Google jbd@google.com @rakyll