OpenTag : Open Attribute Value Extraction From Product Profiles Guineng Zheng*, Subhabrata Mukherjee Δ , Xin Luna Dong Δ , FeiFei Li* Δ Amazon.com, *University of Utah product KDD 2018 graph 1
Motivation Alexa , what are the flavors of nescafe? Nescafe Coffee flavors include caramel, mocha, vanilla, coconut, cappuccino, original/regular, decaf, espresso, and cafe au lait decaf. product KDD 2018 graph 2
Attribute value extraction from product profiles Flavor Brand product KDD 2018 graph 3
Characteristics of Attribute Extraction Open World Assumption Limited semantics, irregular syntax • No Predefined Attribute Value • Most titles have 10-15 words • New Attribute Value Discovery • Most bullets have 5-6 words • Phrases not Sentences • Lack of regular grammatical 1. beef flavor structure in titles and bullets 2. lamb flavor • Attribute stacking 3. venison flavor 1. Rachael Ray Nutrish Just 6 Natural Dry Dog Food, Lamb Meal & Brown Rice Recipe 2. Lamb Meal is the #1 Ingredient product KDD 2018 graph 4
Prior Work and Our Contributions Open World No Lexicon, Active Learning Assumption No Hand-crafted Features Ghani et al. 2003, Putthividhya et al. 2011, Ling et al. 2012, Petrovski et al. 2017 Huang et al. 2015, Kozareva et al. 2016 Kozareva et al. 2016, Lample et al. 2016, Ma et al. 2016 OpenTag (this work) product KDD 2018 graph 5
Outline • Problem Definition • Models • Experiments • Active Learning • Experiments product KDD 2018 graph 6
Recap: Problem Statement Given product profiles (e.g., titles, descriptions, bullets) and a set of attributes: extract values of attributes from profile texts Input Product Profile Output Extractions … Title Description Bullets Flavor Brand CESAR Canine A Delectable Meaty Meal • Filet Mignon 1.filet mignon cesar Cuisine Variety for a Small Canine Looking Flavor; 2.porterhouse canine Pack Filet Mignon for the right food … This • Porterhouse Steak steak cuisine & Porterhouse delicious dog treat Flavor; Steak Dog Food contains tender slices of • CESAR Canine (Two 12-Count meat in gravy and is Cuisine provides Cases) formulated to meet the complete and nutritional needs of small balanced nutrition dogs. … product KDD 2018 graph 7
Attribute Extraction as Sequence Tagging B Beginning of attribute value x={w 1 ,w 2 ,…,w n } input sequence I Inside of attribute value y={t 1 ,t 2 ,…,t n } tagging decision O Outside of attribute value E End of attribute value T U P Flavor Extractions {ranch raise lamb} {beef meal} T U O t 1 t 2 t 3 t 4 t 5 t 6 t 7 y B I O E B I O E B I O E B I O E B I O E B I O E B I O E w 2 w 4 w 5 w 6 w 7 w 1 w 3 T U x P meal & ranch raised lamb recipe beef N I product KDD 2018 8 graph
Outline • Introduction • Models • BiLSTM • BiLSTM + CRF • Attention Mechanism • OpenTag Architecture • Active Learning product KDD 2018 graph 9
OpenTag Architecture product KDD 2018 graph 10
OpenTag Architecture (1/4): Word Embedding Map ‘beef’, ‘chicken’, ‘pork’ to nearby points in Flavor– embedding space product KDD 2018 graph 11
OpenTag Architecture (2/4): Bidirectional LSTM Capture long and short range dependencies in input sequence via forward and backward hidden states product KDD 2018 graph 12
OpenTag Architecture (3/4): CRF • Bi-LSTM captures dependency between token sequences, but not between output tags • Conditional Random Field (CRF) enforces tagging consistency product KDD 2018 graph 13
OpenTag Architecture (4/4): Attention • Focus on important hidden concepts, downweight the rest => attention ! • Attention matrix A to attend to important BiLSTM hidden states (h t ) • α t,tʹ ∈ A captures importance of h t w.r.t. h tʹ • Attention-focused representation l t of token x t given by: product KDD 2018 graph 14
OpenTag Architecture product KDD 2018 graph 15
Experimental Discussions: Datasets product KDD 2018 graph 16
Results Overall, OpenTag obtains high F-score of 82.8% product KDD 2018 graph 17
- Highest improvement in F-score of 5.3% over BiLSTM-CRF for product descriptions - However, less accurate than titles Results product KDD 2018 graph 18
OpenTag discovers new attribute-values not seen during training with 82.4% F-score No overlap in attribute value between train and test splits product KDD 2018 graph 19
Interpretability via Attention product KDD 2018 graph 20
OpenTag achieves better concept clustering Distribution of word vectors before attention Distribution of word vectors after attention product KDD 2018 graph 21
Semantically related words come closer in the embedding space product KDD 2018 graph 22
Outline • Introduction • Models • BiLSTM • BiLSTM + CRF • Attention Mechanism • OpenTag Architecture • Active Learning product KDD 2018 graph 23
Active Learning: Motivation • Annotating training data is expensive and time-consuming • Does not scale to thousands of verticals with hundreds of attributes and thousands of values in each domain product KDD 2018 graph 24
Active Learning (Settles, 2009) • Query selection strategy like uncertainty sampling selects sample with highest uncertainty for annotation • Ignores difficulty in estimating individual tags product KDD 2018 graph 25
Tag Flip as Query Strategy • Simulate a committee of OpenTag learners over multiple epochs • Most informative sample => major disagreement among committee members for tags of its tokens across epochs • Use dropout mechanism for simulating committee of learners duck , fillet mignon and ranch raised lamb flavor B O B E O B I E O B O B O O O O B O Tag flips = 4 Most informative sample has highest tag flips across all the epochs • product KDD 2018 graph 26
Tag Flip (red) better than Uncertainty Sampling (blue) TF v.v. LC on detergent data TF v.v. LC on multi extraction product KDD 2018 graph 27
OpenTag reduces burden of human annotation by 3.3x Learning from scratch on detergent data Learning from scratch on multi extraction OpenTag requires only 500 training samples to obtain > 90% P-R • Active learning brings it down to 150 training samples to match 28 • similar performance product KDD 2018 graph
Production Impact Previous Coverage of OpenTag Increase in Existing Production Coverage (%) Coverage (%) System (%) Attribute_1 23 78 53 Attribute_2 21 72 45 Attribute_3 < 1 56 50 Attribute_4 < 1 49 48 product KDD 2018 graph 29
Summary • OpenTag models open world assumption (OWA), multi-word and multiple attribute value extraction with sequence tagging • Word embeddings + Bi-LSTM + CRF + attention • OpenTag + Active learning reduces burden of human annotation (by 3.3x) • Method of tag flip as query strategy • Interpretability • Better concept clustering, attention heatmap, etc. product KDD 2018 graph 30
Thank you for your attention! Summary • OpenTag models open world assumption (OWA), multi-word and multiple attribute value extraction with sequence tagging • Word embeddings + Bi-LSTM + CRF + attention • OpenTag + Active learning reduces burden of human annotation (by 3.3x) • Method of tag flip as query strategy • Interpretability • Better concept clustering, attention heatmap, etc. product KDD 2018 graph 31
Backup Slides product KDD 2018 graph 32
Word Embedding • Map words co-occurring in a similar context to nearby points in embedding space • Pre-trained embeddings learn single representation for each word • But ‘duck’ as a Flavor should have different embedding than ‘duck’ as a Brand • OpenTag learns word embeddings conditioned on attribute-tags product KDD 2018 graph 33
Bi-directional LSTM • LSTM (Hochreiter, 1997) capture long and short range dependencies between tokens, suitable for modeling token sequences • Bi-directional LSTM’s improve over LSTM’s capturing both forward (f t ) and backward (b t ) states at each timestep ‘t’ • Hidden state h t at each timestep generated as: h t = ! ([b t , f t ]) product KDD 2018 graph 34
Bi-directional LSTM B I O E B I O E B I O E B I O E Cross Entropy Loss Hidden Vector h1 h2 h3 h4 100+100=200 units Backward LSTM b1 b2 b3 b4 100 units Forward LSTM f1 f2 f3 f4 100 units e1 e2 e3 e4 Word Embedding glove embedding 50 w1 w2 w3 w4 Word Index ranch raised beef flavor product KDD 2018 graph 35
Conditional Random Fields (CRF) • Bi-LSTM captures dependency between token sequences, but not between output tags • Likelihood of a token-tag being ‘E’ (end) or ‘I’ (intermediate) increases, if the previous token-tag was ‘I’ (intermediate) • Given an input sequence x = {x 1 ,x 2 , …, x n } with tags y = {y 1 , y 2 , …, y n }: linear-chain CRF models: product KDD 2018 graph
Recommend
More recommend