Predictive Precompute with Recurrent Neural Networks Hanson Wang - PowerPoint PPT Presentation

Predictive Precompute with Recurrent Neural Networks Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020

De Defining Pre recompute On client: prefetching • Improve the latency of user interactions in the Facebook app by precomputing data queries before the interactions occur On server: cache warmup • Improve cache hit-rates in Facebook backend services by precomputing cache values hours in advance 2

De Defining Pre recompute: Pre refetching Wait for data User opens to arrive… the tab 3

De Defining Pre recompute: Pre refetching Data gets Data is precomputed immediately at startup time available! 4

edictive Precompute Pr Pred • Naïvely precomputing 100% of the time is too expensive • Facebook spends non-trivial % of compute on this • Idea: Predict user behavior to avoid wasting resources • Classification problem: P(tab access) at session start • Apply threshold on top of probability to make precompute decisions (can be tuned to product constraints) 5

Formulation as an ML problem Fo em Access Access prediction prediction no access access P(A 1 ) A 1 = 1 P(A 2 ) A 2 = 0 In In gene neral al, we want ant to o estimat ate: P(A n | C 1 , A 1 , C 2, A 2 , …, C n ) time Session 2 (10mins) Session 1 (10mins) Context (C 2 ) Co Context (C 1 ) Co hour of day = 11 hour of day = 9 # notifications = 0 # notifications = 1 user age = 25 user age = 25 … … 6

Fo Formulation as an ML problem em Fea Features res Simple features can be taken from current context (C i ) • Time-based (hour of day, day of week) • User-based (age, country) • Session-based (notification count) • How to incorporate previous contexts and accesses? 7

Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Co Context (C 2 ) Context (C 1 ) Co Context (C 3 ) Co hour of day = 11 hour of day = 9 hour of day = 13 # notifications = 1 # notifications = 1 # notifications = 0 … … …

Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models Number of accesses in the past 7 days = 1 Access rate in the past 7 days = 50% time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Co Context (C 2 ) Co Context (C 1 ) Co Context (C 3 ) hour of day = 11 hour of day = 9 hour of day = 13 # notifications = 1 # notifications = 1 # notifications = 0 … … …

Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models Number of accesses in the past 14 days wi with n notifications = 2 Access rate in the past 14 days wi with n notifications = 100% time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Context (C 2 ) Co Co Context (C 1 ) Co Context (C 3 ) hour of day = 11 hour of day = 9 hour of day = 13 # notifications = # = 1 # notifications = # = 1 # notifications = 0 … … …

Historic His ical l features domin inate feature im importance… User’s access rate with current notification count and referrer page (28 days) User’s access rate with current notification count (28 days) User’s access rate with current referrer page (28 days) Notification count User's overall access rate (28 day) Sample feature importance User's overall access rate (1 day) from a GBDT model (q (qualit lity drops >15 15% wit itho hout access rates) Referrer page 11

Fo Formulation as an ML problem em Fea Features res “Recipe” for historical features: • Select an aggregation type (count, access rate, time elapsed…) • Select a time range (1 day, 7 days, 28 days…) • (Optional) Filter on a subset of context attributes (with / without notifications, at the current hour of the day, …) 💦 Combinatorial explosion of features! 💱 Aggregation features make inference expensive! 12

Fo Formulation as an ML problem em Mo Models Traditional models • Simple baseline: output the lifetime access rate for each user • Most basic historical feature, surprisingly effective • Logistic Regression, Gradient-boosted Decision Trees • Consumes concatenated vector of engineered features 13

Alt-text: The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent. — xkcd #1838

Neural ne Ne networ orks to o the rescue Recurrent neural networks address problems with historical features: Complex, non-linear interactions between features can be captured through a hidden state “memory” for each user. Hidden state updates are incremental in nature. Storage consumption is bounded by the number of dimensions. Model each user’s session history as a sequential prediction task. 15

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) Predictions Session 1 Session 2 Session 3 (P(A i ), online) MLP MLP MLP h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU Session 1 Session 2 Session 3 Hidden states ( h i , async) f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) Prediction Layer h 1 : last known hidden state MLP MLP MLP f 3 : feature vector t 3 : time of prediction T(t 3 – t 1 ): time since h 1 , encoded h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Hidden Layer f 3 : feature vector A 3 : true label for session 3 h 1 f 3 h 0 f 1 h 1 f 2 h 2 : previous hidden state T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) T( Δ t 3 ): time since h 2 , encoded t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Model session + update delays ( δ ) h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Hidden state updates are decoupled from predictions h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )

Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) 1-layer fully-connected network (256 neurons) MLP MLP MLP Latent cross 1 is helpful: h i ◦ (1 + Linear(f i )) h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU GRU with 128 hidden dims f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 ) [1] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H (2018). Latent cross: Making use of context in recurrent recommender systems.

Training det Tr etails • 1M user histories over a 30 day period • ~60 sessions per user on average, ~10% positive rate • Only compute loss on last 21 days • All evaluation metrics use last 7 days • Training takes about ~8 hours on GPU (PyTorch) • Faster with BPPSA? 23

Results Facebook company 24

Prec Pr ecision and Rec ecall fo for Pr Prec ecompute Precision: (true positives) / (predicted positives) • What percentage of precomputed results are accessed? • Inversely correlated to additional compute cost. Recall: (true positives) / (total positives) • What percentage of accesses used precomputed results? • Directly correlated to product latency improvements. 25

Precision-Recall Curves: FB Mobile Tab 100% 90% 80% 70% Precision 60% 50% 40% 30% 20% 10% 0% 3% 6% 9% 12% 15% 18% 21% 24% 30% 33% 36% 39% 42% 51% 60% 63% 66% 69% 81% 90% 93% 96% 99% 27% 45% 48% 54% 57% 72% 75% 78% 84% 87% Recall Baseline Logistic Regression GBDT RNN

Precision-Recall Curves: FB Mobile Tab 100% In practice, we 90% typically try to hit a precision target. 80% 70% Precision 60% Recall at Precision = 50% 50% 40% 30% 20% 10% 0% 3% 6% 9% 12% 15% 18% 21% 24% 30% 33% 36% 39% 42% 51% 60% 63% 66% 69% 81% 90% 93% 96% 99% 27% 45% 48% 54% 57% 72% 75% 78% 84% 87% Recall Baseline Logistic Regression GBDT RNN

Numerical comparison: FB Mobile Tab Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.470 0.413 Logistic Regression 0.546 0.596 GBDT 0.578 0.616 Recurrent Neural Network 0. 0.596 0. 0.642 Improvement 3.11% 4.22% ~3.4% increase in successful prefetches

Predictive Precompute with Recurrent Neural Networks Hanson Wang - PowerPoint PPT Presentation

Predictive Precompute with Recurrent Neural Networks Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020 De Defining Pre recompute On client: prefetching Improve the latency of user interactions in the Facebook app by precomputing data

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Why Create CG-CAHPS 3.0? CAHPS surveys evolve to keep pace with changing environment of health

t ss r

Composite Decentralized Access Control Petar Tsankov , Srdjan Marinovic, Mohammad Torabi Dashtj,

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

Welcome to todays Webcast- AN OVERVIEW OF HEALTH RESOURCES AND SERVICES ADMINISTRATIONS

Felicia Anthonio Campaigner, KeepItOn Lead, Access Now felicia@accessnow.org | @FelAnthonio What

W ELC O M E TO # W C ETW EBC A ST March 20, 2018 The webcast will begin shortly. There is no audio

Accessibility and Universal Design SPEC Survey Series June 13Webcast , 2018 Introductions