How Does Selective Mechanism Improve Self-Attention Networks? Xinwei Geng 1 , Longyue Wang 2 , Xing Wang 2 ,Bing Qin 1 , Ting Liu 1 , Zhaopeng Tu 2 1 Research Center for Social Computing and Information Retrieval, HIT 2 NLP Center, Tencent AI Lab
Conventional Self-Attention Networks(SANs) • Calculate the attentive output by glimpsing the entire sequence • In most case, only a subset of input elements are important
Selective Self-Attention Networks (SSANs) • An universal and flexible implementation of selective mechanism • Select a subset of input words, on top of which self-attention is conducted
Selector • Parameterize selection action a ∈ {SELECT, DISCARD} with an auxiliary policy network – SELECT(1) indicates that the element is selected – DISCARD(0) represents to abandon the element • Reinforcement Learning is utilized to train the policy network – employ gumbel-sigmoid to approximate the sampling – G’ and G’’ are gumbel noises – " is temperature parameter
Experiments
Evaluation of Word Order Encoding • Employ bigram order shift detection and word reordering detection tasks to investigate the ability of capturing both local and global word orders • Bigram order shift detection (Conneau et al., 2018) – inverted two random adjacent words – e.g. what are you doing out there? => what you are doing out there? • Word reordering detection (Yang et al., 2019) – a random word is popped and inserted into another position – e.g. Bush held a talk with Sharon. => Bush a talk held with Sharon.
Detection of Local Word Reordering
Detection of Global Word Reordering
Evaluation of Structural Modeling • Leverage tree depth and top constituent tasks to assess the syntactic information embeded in the encoder representations • Tree Depth(Conneau et al., 2018) – Check whether the examined model can group sentences by depth of the longest path from root to any leaf • Top Constituent(Conneau et al., 2018) – Classify the sentence in terms of the sequence of top constituents immediately below the root node
Structures Embedded in Representations SSANs is more robust to the depth of the sentences SSANs significantly improves the prediction F1 score as the complexity of sentences increases
Structures Modeled by Attention • Constructing constituency trees from the attention distributions – attention distribution within phrases is stronger than the other (Marecek and Rosa, 2018) – When splitting a phrase with span (i, j), the target is to look for a position k maximizing the scores of the two resulting phrases – utilize Stanford CoreNLP toolkit to annotate English sentences as golden constituency trees
Conclusion • We adopt an universal and flexible implementation of selective mechanism, demonstrating its effectiveness across three NLP tasks • SSANs can identify the improper word orders in both local and global ranges by learning to attend to the expected words • SSANs produce more syntactic representations with a better modeling of structure by selective attention
Thanks & QA
Recommend
More recommend