Women in Big Data x Pinterest
Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager
Goku: Pinterest’s in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest
Pinterest Discover new ideas and find ● inspiration to do the things they love 300M+ MAU, billions pins ○ Metrics for monitoring site health ● Latency, QPS, CPU, memory ○ Metric about product quality ● MAU, Impression, etc ○ Monitoring service needs to be fast, ● reliable and scalable Confidential 7
Monitoring at Pinterest Graphite ● Easy to setup at small scale ○ Down sampling support long range query well ○ Hard to scale ○ Deprecated at Pinterest’s current scale ○ OpenTSDB ● Rich query, tagging support ○ Easy to scale horizontally with underlying HBase cluster ○ Long latency for high cardinality data ○ Long latency for query over longer time range ○ ■ No down sampling Heavy GC worsened by combined heavy write QPS and long range scan ○ Confidential 8
Why OpenTSDB is not good fit HBase Schema ● Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes) ○ ○ Column qualifier: <delta to row key timestamp(up to 4 bytes)> Unnecessary Scan ● Query: m1{rpc=delete} [t1 to t2] ○ ○ <m1><t1><host=h1><rpc=delete> <m1><t1><host=h1><rpc=get> ○ ○ <m1><t1><host=h1><rpc=put> HBase RS <m1><t2><host=h2><rpc=delete> ○ Data size ● ○ 20 bytes per data point Aggregation ● HBase OpenTSDB RS Read data onto one opentsdb and aggregate ○ ○ Ex. ostrich.gauges.singer.processor.stuck_processors {host=*} Serialization ● Json. Super slow when there are many many data points to return ○ HBase RS Confidential 9
Goku is here to save Confidential
Write OpenTSDB Read Kafka ● Read/|Write requests are sent to a random selected OpenTSDB box, and then routed to corresponding RS based on row key range Ingestor Statsboard (Write Client) (Read Client) ● Reads: raw data is read from individual HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client OpenTSDB HBase HBase HBase RS RS RS
Write Goku cluster Read Kafka ● A Goku box is not only storage engine, but also: ○ Proxy that route requests ○ Aggregation engine Ingestor Statsboard ● Client can send requests to any Goku (Write Client) (Read Client) box who will route requests ○ Scatter and Gather Goku Goku Goku Goku
Two level sharding 1.Requests sent to a random goku box 6.return response ● Group# hashed from metric name ○ E.g tc.metrics.rpc_latency Shard ● Shard# hashed from metric + set Goku 5.another aggregation config 2.comput sharding to of tagk and tagv G2: S1 and S2, then ○ E.g. look up shard config tc.metrics.rpc_latency{rpc=put,host=m1} 3.route requests ● Control read fanout while easy to scale out individual group G2:S2 G3:S1 G2:S1 G1:S3 G1:S1 G1:S2 4.Retrieve data and local aggregate G3:S1 G3:S3 G1:S3 G4:S1 G4:S2
Goku #1. Time Series Database based on Beringei Confidential
Beringei Write Read ● I n-memory key value store ○ Key: string ○ Value: list of <timestamp, value> pairs ● Gorilla compression ○ Delta-of-Delta encoding on timestamps Shard ○ Delta encoding on values ● Stores most recent 24 hours data ts ts Bucket (configurable) ts ts Gorilla Gorilla Bucket Encode Decode ts ts ● One level of sharding to distribute Bucket ts ts ● Datapoint size reduced ○ from 20 bytes to 1.37 bytes Disk Beringei
Goku #2. Query Engine -- Inverted Index Confidential
Write Read Inverted Index ● A map from search term to its bitset ● Built along with processing incoming data points ● Fast lookup when serve query Shard Inverted Index ● Support query filters ○ ExactMatch : metricname{host=h1,api=get). => intersect ts ts Bucket bitsets of metricname, host=h1, api=get ○ Or : metricname{host=h1|h2}. => union bitsets of host=h1 ts ts Gorilla Gorilla and host=h2 and intersect with bitset of metricname Bucket Decode Encode ts ts ○ Nor : metricname{host=not_literal_or(h1|h2)}. => remove bitsets of host=h1 and host=h2 from bitset of metricname Bucket ts ts ○ Wildcard : a. metricname{host=*} => intersect bitsets of metricname and host=*; b.metricname{host=h*} => convert to regex filter ○ Regex : metricname{host=h[1|2].*, api=get, az=us-east-1} => apply other filters first. Then build a regex pattern based DISK on the filter values and then iterate corresponding full metric names of all ids after applying other filters. Goku Phase #1
Goku #3. Query Engine -- Aggregation Confidential
Write Read Aggregation ● Post-process after retrieving all relevant time Aggregation series ● Mimic OpenTSDB’s aggregation layer Shard ● Support basic aggregators, including SUM, Inverted Index AVG, MAX, MIN, COUNT, DEV and Downsampling ts ts Bucket ● Versus OpenTSDB ts ts Gorilla Gorilla Bucket Decode Encode ○ OpenTSDB does aggregation on a ts ts single instance since HBase RS don’t Bucket ts ts know how to aggregate ○ Goku does aggregation in two phases. First on each leaf goku node, and DISK second on the routing goku node ○ Distribute the computation and save data on the wire Goku Phase #1
AWS EFS Confidential
Write Read AWS EFS ● Store log and data files to recovery Aggregation ● Posix compliant ● Data durability Shard Inverted Index ● Operate it asynchronously, so latency isn’t an issue ts ts Bucket ● Easy to move shard ts ts Gorilla Gorilla Bucket Decode Encode ts ts ● Easy to use on AWS Bucket ts ts AWS EFS Goku Phase #1
Phase #2 Disk based Goku Confidential
Write Read Goku Phase #2 -- Disk based S3 ● Hadoop job constantly Aggregation runs to compact data Group into disk with downsample Shard Inverted Index ● Data stored into S3 for better availability and ts ts Bucket low cost Hadoop job ts ts Gorilla Gorilla Bucket Decode Encode ● RocksDB is used for ts ts Bucket online serving data ts ts AWS EFS Distributed KV store(Rock Store) Goku Phase #2
Next step for Goku ● Replication ○ Currently dule write to two clusters for fault tolerance ○ Replication to improve availability and consistency ● More query possibilities ○ TopK ○ Percentile ● Analytics use case ○ Another big consumer of Time Series data Confidential 24
Thanks!
Scheduling Asynchronous Tasks at Pinterest Isabel Tallam Data (Core Services) Team Pinterest
Why asynchronous tasks? Asynchronous task processing service Design considerations
Why asynchronous tasks? %$#* M A P S SPAM SPAM %$#* SPAM SPAM # * % $
Why asynchronous tasks? Asynchronous task processing service Design considerations
Pinlater Asynchronous Task Processing Service Pinlater features - High throughput - Easily create new tasks - At-least-once guarantee - Strict ack mechanism - Metrics and debugging support - Different task priorities - Scheduling future tasks - Python, Java support
Pinlater Asynchronous Task Processing Service Pinlater components Pinlater Pinlater Pinlater Pinlater Pinlater Pinlater Clients Clients Servers Workers Clients Servers Workers Servers Workers insert request /ack Storage Storage Storage Master Slave Master Slave Master Slave
Pinlater Asynchronous Task Processing Service Pinlater Stats ~1000 different tasks defined ~8 billion task instances processed per day ~3000 Pinlater hosts
Why asynchronous tasks? Asynchronous task processing service Design considerations
Pinlater Asynchronous Task Processing Service Storage Layer Pinlater Pinlater Pinlater Servers Servers Servers Cache Storage Storage Storage Master Slave Master Slave Master Slave
Pinlater Asynchronous Task Processing Service Handling failures in the system Pinlater Pinlater Pinlater Pinlater Pinlater Pinlater Clients Clients Servers Workers Clients Servers Workers Servers Workers insert request /ack timeout monitor Storage Master Slave
Pinlater Asynchronous Task Processing Service Thank You!
Experimentation at Pinterest Lu Yang Data (Data Analytics - Core Product Data) Team Pinterest
Outline 1 Background 2 Platform 3 Architecture
What is an a/b experiment? It is a method to compare two (or more) variations of something to determine which one performs better against your target metrics OR
With Experiment Mindset Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch Existing code - CONTROL Changed code - ENABLED A randomized, controlled trial with measurement All Users Not in experiment
Number of Experiments Over Time
Recommend
More recommend