Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - PowerPoint PPT Presentation

Women in Big Data x Pinterest

Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager

Goku: Pinterest’s in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest

Pinterest Discover new ideas and find ● inspiration to do the things they love 300M+ MAU, billions pins ○ Metrics for monitoring site health ● Latency, QPS, CPU, memory ○ Metric about product quality ● MAU, Impression, etc ○ Monitoring service needs to be fast, ● reliable and scalable Confidential 7

Monitoring at Pinterest Graphite ● Easy to setup at small scale ○ Down sampling support long range query well ○ Hard to scale ○ Deprecated at Pinterest’s current scale ○ OpenTSDB ● Rich query, tagging support ○ Easy to scale horizontally with underlying HBase cluster ○ Long latency for high cardinality data ○ Long latency for query over longer time range ○ ■ No down sampling Heavy GC worsened by combined heavy write QPS and long range scan ○ Confidential 8

Why OpenTSDB is not good fit HBase Schema ● Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes) ○ ○ Column qualifier: <delta to row key timestamp(up to 4 bytes)> Unnecessary Scan ● Query: m1{rpc=delete} [t1 to t2] ○ ○ <m1><t1><host=h1><rpc=delete> <m1><t1><host=h1><rpc=get> ○ ○ <m1><t1><host=h1><rpc=put> HBase RS <m1><t2><host=h2><rpc=delete> ○ Data size ● ○ 20 bytes per data point Aggregation ● HBase OpenTSDB RS Read data onto one opentsdb and aggregate ○ ○ Ex. ostrich.gauges.singer.processor.stuck_processors {host=*} Serialization ● Json. Super slow when there are many many data points to return ○ HBase RS Confidential 9

Goku is here to save Confidential

Write OpenTSDB Read Kafka ● Read/|Write requests are sent to a random selected OpenTSDB box, and then routed to corresponding RS based on row key range Ingestor Statsboard (Write Client) (Read Client) ● Reads: raw data is read from individual HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client OpenTSDB HBase HBase HBase RS RS RS

Write Goku cluster Read Kafka ● A Goku box is not only storage engine, but also: ○ Proxy that route requests ○ Aggregation engine Ingestor Statsboard ● Client can send requests to any Goku (Write Client) (Read Client) box who will route requests ○ Scatter and Gather Goku Goku Goku Goku

Two level sharding 1.Requests sent to a random goku box 6.return response ● Group# hashed from metric name ○ E.g tc.metrics.rpc_latency Shard ● Shard# hashed from metric + set Goku 5.another aggregation config 2.comput sharding to of tagk and tagv G2: S1 and S2, then ○ E.g. look up shard config tc.metrics.rpc_latency{rpc=put,host=m1} 3.route requests ● Control read fanout while easy to scale out individual group G2:S2 G3:S1 G2:S1 G1:S3 G1:S1 G1:S2 4.Retrieve data and local aggregate G3:S1 G3:S3 G1:S3 G4:S1 G4:S2

Goku #1. Time Series Database based on Beringei Confidential

Beringei Write Read ● I n-memory key value store ○ Key: string ○ Value: list of <timestamp, value> pairs ● Gorilla compression ○ Delta-of-Delta encoding on timestamps Shard ○ Delta encoding on values ● Stores most recent 24 hours data ts ts Bucket (configurable) ts ts Gorilla Gorilla Bucket Encode Decode ts ts ● One level of sharding to distribute Bucket ts ts ● Datapoint size reduced ○ from 20 bytes to 1.37 bytes Disk Beringei

Goku #2. Query Engine -- Inverted Index Confidential

Write Read Inverted Index ● A map from search term to its bitset ● Built along with processing incoming data points ● Fast lookup when serve query Shard Inverted Index ● Support query filters ○ ExactMatch : metricname{host=h1,api=get). => intersect ts ts Bucket bitsets of metricname, host=h1, api=get ○ Or : metricname{host=h1|h2}. => union bitsets of host=h1 ts ts Gorilla Gorilla and host=h2 and intersect with bitset of metricname Bucket Decode Encode ts ts ○ Nor : metricname{host=not_literal_or(h1|h2)}. => remove bitsets of host=h1 and host=h2 from bitset of metricname Bucket ts ts ○ Wildcard : a. metricname{host=*} => intersect bitsets of metricname and host=*; b.metricname{host=h*} => convert to regex filter ○ Regex : metricname{host=h[1|2].*, api=get, az=us-east-1} => apply other filters first. Then build a regex pattern based DISK on the filter values and then iterate corresponding full metric names of all ids after applying other filters. Goku Phase #1

Goku #3. Query Engine -- Aggregation Confidential

Write Read Aggregation ● Post-process after retrieving all relevant time Aggregation series ● Mimic OpenTSDB’s aggregation layer Shard ● Support basic aggregators, including SUM, Inverted Index AVG, MAX, MIN, COUNT, DEV and Downsampling ts ts Bucket ● Versus OpenTSDB ts ts Gorilla Gorilla Bucket Decode Encode ○ OpenTSDB does aggregation on a ts ts single instance since HBase RS don’t Bucket ts ts know how to aggregate ○ Goku does aggregation in two phases. First on each leaf goku node, and DISK second on the routing goku node ○ Distribute the computation and save data on the wire Goku Phase #1

AWS EFS Confidential

Write Read AWS EFS ● Store log and data files to recovery Aggregation ● Posix compliant ● Data durability Shard Inverted Index ● Operate it asynchronously, so latency isn’t an issue ts ts Bucket ● Easy to move shard ts ts Gorilla Gorilla Bucket Decode Encode ts ts ● Easy to use on AWS Bucket ts ts AWS EFS Goku Phase #1

Phase #2 Disk based Goku Confidential

Write Read Goku Phase #2 -- Disk based S3 ● Hadoop job constantly Aggregation runs to compact data Group into disk with downsample Shard Inverted Index ● Data stored into S3 for better availability and ts ts Bucket low cost Hadoop job ts ts Gorilla Gorilla Bucket Decode Encode ● RocksDB is used for ts ts Bucket online serving data ts ts AWS EFS Distributed KV store(Rock Store) Goku Phase #2

Next step for Goku ● Replication ○ Currently dule write to two clusters for fault tolerance ○ Replication to improve availability and consistency ● More query possibilities ○ TopK ○ Percentile ● Analytics use case ○ Another big consumer of Time Series data Confidential 24

Thanks!

Scheduling Asynchronous Tasks at Pinterest Isabel Tallam Data (Core Services) Team Pinterest

Why asynchronous tasks? Asynchronous task processing service Design considerations

Why asynchronous tasks? %$#* M A P S SPAM SPAM %$#* SPAM SPAM # * % $

Pinlater Asynchronous Task Processing Service Pinlater features - High throughput - Easily create new tasks - At-least-once guarantee - Strict ack mechanism - Metrics and debugging support - Different task priorities - Scheduling future tasks - Python, Java support

Pinlater Asynchronous Task Processing Service Pinlater components Pinlater Pinlater Pinlater Pinlater Pinlater Pinlater Clients Clients Servers Workers Clients Servers Workers Servers Workers insert request /ack Storage Storage Storage Master Slave Master Slave Master Slave

Pinlater Asynchronous Task Processing Service Pinlater Stats ~1000 different tasks defined ~8 billion task instances processed per day ~3000 Pinlater hosts

Pinlater Asynchronous Task Processing Service Storage Layer Pinlater Pinlater Pinlater Servers Servers Servers Cache Storage Storage Storage Master Slave Master Slave Master Slave

Pinlater Asynchronous Task Processing Service Handling failures in the system Pinlater Pinlater Pinlater Pinlater Pinlater Pinlater Clients Clients Servers Workers Clients Servers Workers Servers Workers insert request /ack timeout monitor Storage Master Slave

Pinlater Asynchronous Task Processing Service Thank You!

Experimentation at Pinterest Lu Yang Data (Data Analytics - Core Product Data) Team Pinterest

Outline 1 Background 2 Platform 3 Architecture

What is an a/b experiment? It is a method to compare two (or more) variations of something to determine which one performs better against your target metrics OR

With Experiment Mindset Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch Existing code - CONTROL Changed code - ENABLED A randomized, controlled trial with measurement All Users Not in experiment

Number of Experiments Over Time

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - PowerPoint PPT Presentation

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager Goku: Pinterests in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest Pinterest Discover new

2.5 Billion page views occur on Pinterest Pinterest Cover Sheet Pinterest CREATE Start LDS

https://engineering.pinterest.com/ Pinterest Template 1.0 Pushing The Boundaries Of The Layout

ANGELICA DASS PROJECT 3 SOURCE: SOURCE: https://www.pinterest.es/pin/

EASTER EGG DYEING SOURCE: SOURCE: https://www.pinterest.es/pin/41440160320

TOWN PRINTING PROJECT 1 SOURCE: https://www.pinterest.es/pin/313281717798933627/ TOWN PRINTING

HALLOWEEN SOURCE: https://www.pinterest.es/pin/511369732682768372/ HALLOWEEN SOURCE: SOURCE:

Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution Scaling Pinterest Growth

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Pinterest Brings Us Ideas Collected OGC Winter Decorating Ideas for Home & Garden Created by

SONIA DELAUNAY PROJECT 1 SOURCE: https://www.pinterest.de/pin/288652657349224552/ AUTHOR: SONIA

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

WOMEN-POWERED PROSPERITY Women at the Center Women at the Center No country can ever truly

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Women Returners to Travel Tourism and Hospitality What is Women in Travels Women Returners

Women, Coaching, and Title IX A Presentation by Emily Carr What Happened? 1972: % of women

Environment Model 1. To evaluate a combination: evaluate subexpressions then apply value of

GNU Radio in 2019: Facts and Plans An overview of where GNU Radio is going this fine year Marcus

of Web APIs at Web Scale using LD Standards F. Michel, C. Faron-Zucker, O. Corby, F. Gandon

Octavia Project Update OpenStack Summit - Denver Adam Harwell - Train PTL - Verizon Media

Metamodeling and Metaprogramming 1. Introduction to metalevels 2. Different Ways of

www.drupaleurope.org Drupal + Technology TRACK SUPPORTED BY 17/3/2018 Entity access for lists

Intro to PHP Lecture 12 CGS 3066 Fall 2016 November 29, 2016 PHP PHP is a server scripting

Search: Uninformed Search Russel & Norvig Chap. 3 Material in part from

Sambuz

Useful Links

Newsletter

Mail Us

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - PowerPoint PPT Presentation

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager Goku: Pinterests in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest Pinterest Discover new

2.5 Billion page views occur on Pinterest Pinterest Cover Sheet Pinterest CREATE Start LDS

https://engineering.pinterest.com/ Pinterest Template 1.0 Pushing The Boundaries Of The Layout

ANGELICA DASS PROJECT 3 SOURCE: SOURCE: https://www.pinterest.es/pin/

EASTER EGG DYEING SOURCE: SOURCE: https://www.pinterest.es/pin/41440160320

TOWN PRINTING PROJECT 1 SOURCE: https://www.pinterest.es/pin/313281717798933627/ TOWN PRINTING

HALLOWEEN SOURCE: https://www.pinterest.es/pin/511369732682768372/ HALLOWEEN SOURCE: SOURCE:

Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution Scaling Pinterest Growth

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Pinterest Brings Us Ideas Collected OGC Winter Decorating Ideas for Home &amp; Garden Created by

SONIA DELAUNAY PROJECT 1 SOURCE: https://www.pinterest.de/pin/288652657349224552/ AUTHOR: SONIA

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

WOMEN-POWERED PROSPERITY Women at the Center Women at the Center No country can ever truly

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Women Returners to Travel Tourism and Hospitality What is Women in Travels Women Returners

Women, Coaching, and Title IX A Presentation by Emily Carr What Happened? 1972: % of women

Environment Model 1. To evaluate a combination: evaluate subexpressions then apply value of

GNU Radio in 2019: Facts and Plans An overview of where GNU Radio is going this fine year Marcus

of Web APIs at Web Scale using LD Standards F. Michel, C. Faron-Zucker, O. Corby, F. Gandon

Octavia Project Update OpenStack Summit - Denver Adam Harwell - Train PTL - Verizon Media

Metamodeling and Metaprogramming 1. Introduction to metalevels 2. Different Ways of

www.drupaleurope.org Drupal + Technology TRACK SUPPORTED BY 17/3/2018 Entity access for lists

Intro to PHP Lecture 12 CGS 3066 Fall 2016 November 29, 2016 PHP PHP is a server scripting

Search: Uninformed Search Russel &amp; Norvig Chap. 3 Material in part from

Sambuz

Useful Links

Newsletter

Mail Us

Pinterest Brings Us Ideas Collected OGC Winter Decorating Ideas for Home & Garden Created by

Search: Uninformed Search Russel & Norvig Chap. 3 Material in part from