Predictive Precompute with Recurrent Neural Networks Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020
De Defining Pre recompute On client: prefetching • Improve the latency of user interactions in the Facebook app by precomputing data queries before the interactions occur On server: cache warmup • Improve cache hit-rates in Facebook backend services by precomputing cache values hours in advance 2
De Defining Pre recompute: Pre refetching Wait for data User opens to arrive… the tab 3
De Defining Pre recompute: Pre refetching Data gets Data is precomputed immediately at startup time available! 4
edictive Precompute Pr Pred • Naïvely precomputing 100% of the time is too expensive • Facebook spends non-trivial % of compute on this • Idea: Predict user behavior to avoid wasting resources • Classification problem: P(tab access) at session start • Apply threshold on top of probability to make precompute decisions (can be tuned to product constraints) 5
Formulation as an ML problem Fo em Access Access prediction prediction no access access P(A 1 ) A 1 = 1 P(A 2 ) A 2 = 0 In In gene neral al, we want ant to o estimat ate: P(A n | C 1 , A 1 , C 2, A 2 , …, C n ) time Session 2 (10mins) Session 1 (10mins) Context (C 2 ) Co Context (C 1 ) Co hour of day = 11 hour of day = 9 # notifications = 0 # notifications = 1 user age = 25 user age = 25 … … 6
Fo Formulation as an ML problem em Fea Features res Simple features can be taken from current context (C i ) • Time-based (hour of day, day of week) • User-based (age, country) • Session-based (notification count) • How to incorporate previous contexts and accesses? 7
Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Co Context (C 2 ) Context (C 1 ) Co Context (C 3 ) Co hour of day = 11 hour of day = 9 hour of day = 13 # notifications = 1 # notifications = 1 # notifications = 0 … … …
Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models Number of accesses in the past 7 days = 1 Access rate in the past 7 days = 50% time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Co Context (C 2 ) Co Context (C 1 ) Co Context (C 3 ) hour of day = 11 hour of day = 9 hour of day = 13 # notifications = 1 # notifications = 1 # notifications = 0 … … …
Fo Formulation as an ML problem em Hi Historical Features Historical usage features must be “engineered” for traditional models Number of accesses in the past 14 days wi with n notifications = 2 Access rate in the past 14 days wi with n notifications = 100% time Session 2 Session 1 Session 3 A 1 = 1 A 1 = 1 A 3 = 0 Context (C 2 ) Co Co Context (C 1 ) Co Context (C 3 ) hour of day = 11 hour of day = 9 hour of day = 13 # notifications = # = 1 # notifications = # = 1 # notifications = 0 … … …
Historic His ical l features domin inate feature im importance… User’s access rate with current notification count and referrer page (28 days) User’s access rate with current notification count (28 days) User’s access rate with current referrer page (28 days) Notification count User's overall access rate (28 day) Sample feature importance User's overall access rate (1 day) from a GBDT model (q (qualit lity drops >15 15% wit itho hout access rates) Referrer page 11
Fo Formulation as an ML problem em Fea Features res “Recipe” for historical features: • Select an aggregation type (count, access rate, time elapsed…) • Select a time range (1 day, 7 days, 28 days…) • (Optional) Filter on a subset of context attributes (with / without notifications, at the current hour of the day, …) 💦 Combinatorial explosion of features! 💱 Aggregation features make inference expensive! 12
Fo Formulation as an ML problem em Mo Models Traditional models • Simple baseline: output the lifetime access rate for each user • Most basic historical feature, surprisingly effective • Logistic Regression, Gradient-boosted Decision Trees • Consumes concatenated vector of engineered features 13
Alt-text: The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent. — xkcd #1838
Neural ne Ne networ orks to o the rescue Recurrent neural networks address problems with historical features: Complex, non-linear interactions between features can be captured through a hidden state “memory” for each user. Hidden state updates are incremental in nature. Storage consumption is bounded by the number of dimensions. Model each user’s session history as a sequential prediction task. 15
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) Predictions Session 1 Session 2 Session 3 (P(A i ), online) MLP MLP MLP h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU Session 1 Session 2 Session 3 Hidden states ( h i , async) f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) Prediction Layer h 1 : last known hidden state MLP MLP MLP f 3 : feature vector t 3 : time of prediction T(t 3 – t 1 ): time since h 1 , encoded h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Hidden Layer f 3 : feature vector A 3 : true label for session 3 h 1 f 3 h 0 f 1 h 1 f 2 h 2 : previous hidden state T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) T( Δ t 3 ): time since h 2 , encoded t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Model session + update delays ( δ ) h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) MLP MLP MLP Hidden state updates are decoupled from predictions h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 )
Recurrent Network Architecture P ( A 1 ) P ( A 2 ) P ( A 3 ) 1-layer fully-connected network (256 neurons) MLP MLP MLP Latent cross 1 is helpful: h i ◦ (1 + Linear(f i )) h 1 f 3 h 0 f 1 h 1 f 2 T(0) T(t 2 - t 1 ) T(t 3 - t 1 ) t 3 t 1 t 2 t t 1 + δ t 2 + δ t 3 + δ h 1 h 3 h 2 GRU GRU h 0 GRU GRU with 128 hidden dims f 1 A 1 f 2 A 2 f 3 A 3 T( ∆ t 3 ) T( ∆ t 1 ) T( ∆ t 2 ) [1] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H (2018). Latent cross: Making use of context in recurrent recommender systems.
Training det Tr etails • 1M user histories over a 30 day period • ~60 sessions per user on average, ~10% positive rate • Only compute loss on last 21 days • All evaluation metrics use last 7 days • Training takes about ~8 hours on GPU (PyTorch) • Faster with BPPSA? 23
Results Facebook company 24
Prec Pr ecision and Rec ecall fo for Pr Prec ecompute Precision: (true positives) / (predicted positives) • What percentage of precomputed results are accessed? • Inversely correlated to additional compute cost. Recall: (true positives) / (total positives) • What percentage of accesses used precomputed results? • Directly correlated to product latency improvements. 25
Precision-Recall Curves: FB Mobile Tab 100% 90% 80% 70% Precision 60% 50% 40% 30% 20% 10% 0% 3% 6% 9% 12% 15% 18% 21% 24% 30% 33% 36% 39% 42% 51% 60% 63% 66% 69% 81% 90% 93% 96% 99% 27% 45% 48% 54% 57% 72% 75% 78% 84% 87% Recall Baseline Logistic Regression GBDT RNN
Precision-Recall Curves: FB Mobile Tab 100% In practice, we 90% typically try to hit a precision target. 80% 70% Precision 60% Recall at Precision = 50% 50% 40% 30% 20% 10% 0% 3% 6% 9% 12% 15% 18% 21% 24% 30% 33% 36% 39% 42% 51% 60% 63% 66% 69% 81% 90% 93% 96% 99% 27% 45% 48% 54% 57% 72% 75% 78% 84% 87% Recall Baseline Logistic Regression GBDT RNN
Numerical comparison: FB Mobile Tab Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.470 0.413 Logistic Regression 0.546 0.596 GBDT 0.578 0.616 Recurrent Neural Network 0. 0.596 0. 0.642 Improvement 3.11% 4.22% ~3.4% increase in successful prefetches
Recommend
More recommend