Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang - PowerPoint PPT Presentation

Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng Chua 2 1 College of Computer Science 2 School of Computing Zhejiang University National University of Singapore

Example: Predicting Customers’ Income • Inputs: a) Occupation = { banker, engineer, … } Junior bankers have a lower income than b) Level = { junior, senior } junior engineers, but this is the reverse case c) Gender = { male, female } for senior bankers Feature vector X Occupation Level Gender Target y # # Occupation Level Gender B E … J S M F 1 Banker Junior Male 1 1 0 … 1 0 1 0 0.4 2 Engineer Junior Male 2 0 1 … 1 0 1 0 0.6 3 Banker Junior Female 3 1 0 … 1 0 0 1 0.4 One-hot Encoding 4 Engineer Junior Female 4 0 1 … 1 0 0 1 0.6 5 Banker Senior Male 5 1 0 … 0 1 1 0 0.9 6 Engineer Senior Male 6 0 1 … 0 1 1 0 0.7 7 Banker Senior Female 7 1 0 … 0 1 0 1 0.9 8 Engineer Senior Female 8 0 1 … 0 1 0 1 0.7 … … … … … … … … Attentional Factorization Machines: 2 Learning the Weight of Feature Interactions via Attention Networks

Linear Regression (LR) • Model Equation: • Example: Occupation Level Gender Banker Junior Male 𝑧(𝐲) = 𝐱 𝐶𝑏𝑜𝑙𝑓𝑠 + 𝐱 𝐾𝑣𝑜𝑗𝑝𝑠 + 𝐱 𝑁𝑏𝑚𝑓 • Drawbacks: Cannot learn cross-feature effects like: “Junior bankers have lower income than junior engineers, while senior bankers have higher income than senior engineers” Attentional Factorization Machines: 3 Learning the Weight of Feature Interactions via Attention Networks

Factorization Machines (FM) • Model Equation: 𝑈 𝐰 𝜕 𝑗𝑘 = 𝐰 𝑗 a) 𝑘 𝐰 𝑗 ∈ ℝ 𝑙 : the embedding vector for feature 𝑗 b) c) 𝑙 : the size of embedding vector • Example: Occupation Level Gender Banker Junior Male 𝑧 𝐲 = 𝐱 𝐶𝑏𝑜𝑙𝑓𝑠 + 𝐱 𝐾𝑣𝑜𝑗𝑝𝑠 + 𝐱 𝑁𝑏𝑚𝑓 + 𝒘 𝐶𝑏𝑜𝑙𝑓𝑠 , 𝒘 𝐾𝑣𝑜𝑗𝑝𝑠 + 𝒘 𝐶𝑏𝑜𝑙𝑓𝑠 , 𝒘 𝑁𝑏𝑚𝑓 + 𝒘 𝐾𝑣𝑜𝑗𝑝𝑠 , 𝒘 𝑁𝑏𝑚𝑓 • Drawbacks: Model all factorized interactions with the same weight. • For example, the gender variable above is less important than others for estimating the target. Attentional Factorization Machines: 4 Learning the Weight of Feature Interactions via Attention Networks

Attentional Factorization Machines (AFM) • Our main contribution: a) Pair-wise Interaction Layer b) Attention-based Pooling Our main contribution Same with FM Attentional Factorization Machines: 5 Learning the Weight of Feature Interactions via Attention Networks

Contribution #1: Pair-wise Interaction Layer • Layer Equation: Where: a) ⨀ : element-wise product of two vectors b) 𝒴 = {𝒚 𝟑 , 𝒚 𝟓 , 𝒚 𝟕 , 𝒚 𝟗 , … } : the set of non-zero features in the feature vector x ℰ = 𝐰 𝑗 𝑦 𝑗 𝑗∈𝒴 : the output of the embedding layer Attentional Factorization Machines: 6 Learning the Weight of Feature Interactions via Attention Networks

Express FM as a Neural Network • Sum pooling over pair-wise interaction layer: Where: a) 𝐪 𝑈 ∈ ℝ 𝑙 : weights for the prediction layer b) b : bias for the prediction layer • By fixing p to 1 and b to 0, we can exactly recover the FM model Attentional Factorization Machines: 7 Learning the Weight of Feature Interactions via Attention Networks

Contribution #2: Attention-based Pooling Layer • The idea of attention is to allow different parts contribute differently when compressing them to a single representation. • Motivated by the drawback of FM, We propose to employ the attention mechanism on feature interactions by performing a weighted sum on the interacted vectors. Attention score for feature interaction 𝑗, 𝑘 Attentional Factorization Machines: 8 Learning the Weight of Feature Interactions via Attention Networks

Attention-based Pooling Layer • Definition of attention network: Where: 𝐗 ∈ ℝ 𝑢×𝑙 , 𝐜 ∈ ℝ 𝑢 , 𝐢 ∈ ℝ 𝑢 : parameters a) 𝑢 : attention factor , denoting the hidden layer size of the b) attention network • The output 𝑏 𝑗𝑘 is a 𝑙 dimensional vector Attentional Factorization Machines: 9 Learning the Weight of Feature Interactions via Attention Networks

Summarize of AFM • The overall formulation of AFM: • For comparison, the overall formulation of FM in neural network is: • Attention factors bring AFM stronger representation ability than FM. Attentional Factorization Machines: 10 Learning the Weight of Feature Interactions via Attention Networks

Experiments • Task #1: Context-aware App Usage Prediction a) Frappe data: userID, appID, and 8 context variables (sparsity: 99.81%) • Task #2: Personalized Tag Recommendation a) MovieLens data: userID, movieID and tag (sparsity: 99.99%) • Randomly split: 70% (training), 20% (validation), 10% (testing) • Evaluated prediction error by RMSE (lower score, better performance). Attentional Factorization Machines: 11 Learning the Weight of Feature Interactions via Attention Networks

Baselines • 1. LibFM: − The official implementation of second-order FM • 2. HOFM: − A 3rd party implementation of high-order FM. − We experimented with order size 3. • 3. Wide&Deep: − Same architecture as the paper: 3 layer MLP: 1024->512- >256 • 4. DeepCross: − Same structure as the paper: 10 layer (5 ResUnits): 512- >512->256->128->64) • 𝑙 (the size of embedding feature) is set to 256 for all baselines and our AFM model. Attentional Factorization Machines: 12 Learning the Weight of Feature Interactions via Attention Networks

I. Performance Comparison • For Wide&Deep, DeepCross and AFM, pretraining their feature embeddings with FM leads to a lower RMSE than end-to-end training with a random initialization. 1. Linear way of high-order modelling has minor benefits. 2. Wide&Deep slightly betters LibFM while DeepCross suffers from overfitting. 3. AFM significantly betters LibFM with fewest additional parameters. M means million Attentional Factorization Machines: 13 Learning the Weight of Feature Interactions via Attention Networks

II. Hyper-parameter Investigation • Dropout ratio (on embedding layer) = *Best • 𝝁 ( 𝑴 𝟑 regularization on attention network) = ? • Attention factor = 256 = 𝑙 (size of embedding size) Attentional Factorization Machines: 14 Learning the Weight of Feature Interactions via Attention Networks

II. Hyper-parameter Investigation • Dropout ratio = *Best • 𝜇 ( 𝑀 2 regularization on attention network) = *Best • Attention factor = ? Attentional Factorization Machines: 15 Learning the Weight of Feature Interactions via Attention Networks

II. Hyper-parameter Investigation • Dropout ratio = *Best • 𝜇 ( 𝑀 2 regularization on attention network) = *Best • Attention factor = *Best Attentional Factorization Machines: 16 Learning the Weight of Feature Interactions via Attention Networks

III. Micro-level Analysis 1 • FM : Fix 𝑏 𝑗𝑘 to a uniform number ℛ 𝑦 • FM+A: Fix the feature embeddings pretrained by FM and train the attention network only. • AFM is more explainable by learning the weight of feature interactions • The performance is improved about 3% in this case. Attentional Factorization Machines: 17 Learning the Weight of Feature Interactions via Attention Networks

Conclusion • Our proposed AFM enhances FM by learning the importance of feature interactions with an attention network, and achieved a 8.6% relative improvement . − improves the representation ability − improves the interpretability of a FM model • This work is orthogonal with our recent work on neural FM [He and Chua, SIGIR-2017] − in that work we develops deep variants of FM for modelling high-order feature interactions Attentional Factorization Machines: 18 Learning the Weight of Feature Interactions via Attention Networks

Future works • Explore deep version for AFM by stacking multiple non-linear layers above the attention-based pooling layer • Improve the learning efficiency by using learning to hash and data sampling • Develop FM variants for semi-supervised and multi-view learning • Explore AFM on modelling other types of data for different applications, such as: a) Texts for question answering, b) More semantic-rich multi-media content Attentional Factorization Machines: 19 Learning the Weight of Feature Interactions via Attention Networks

Thanks! Codes: https://github.com/hexiangnan/attentional_factorization_machine Attentional Factorization Machines: 20 Learning the Weight of Feature Interactions via Attention Networks

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang - PowerPoint PPT Presentation

Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng Chua 2 1 College of Computer Science 2 School of Computing

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

Consciousness First? Attention First? David Chalmers Some Issues Q1: Is there consciousness

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

A Convolutional Attention Network for Extreme Summarization of Source Code ATTENTION

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,