self tuning ongoing terminology extraction retrained on
play

Self-tuning ongoing terminology extraction retrained on terminology - PowerPoint PPT Presentation

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is


  1. Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin TKE 2016 Copenhagen The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

  2. Agenda www.adaptcentre.ie Motivation • Why do we need to do terminology extraction on an ongoing basis? Methodology • Ongoing terminology extraction with and without learning Experimental Setup and Results • Description of Simulation Experiments and Results Conclusions and next steps • The feedback loop in machine learning-based ongoing terminology extraction can help in identifying the majority of terms in a batch of new content

  3. www.adaptcentre.ie MOTIVATION

  4. A frequent assumption in terminology extraction www.adaptcentre.ie • Surely if I do terminology extraction at some point towards the beginning of a content creation project, I will capture the majority of the terms of interest that are ever likely to appear, right? • I’m basically taking a representative sample of the terms in the project

  5. Let’s test that assumption www.adaptcentre.ie • Here’s an actual example using the term -annotated ACL RD-TEC (QasemiZadeh and Handschuh, 2014) • ACL RD-TEC: a corpus of ACL academic papers written between 1965 to 2006 in which domain-specific terms have been manually annotated

  6. Motivation – new content introduces new terms www.adaptcentre.ie The proportion of new terms in a subsequent year never reaches 0 Between 12% and 20% of all valid terms in any given year will be new If you don’t do term extraction periodically (e.g. annually) you will start missing out A LOT OF new terms within a few years

  7. The reality is … www.adaptcentre.ie • As content gets updated, new previously unseen terms will start appearing • These terms will not have been captured during our initial term extraction and will have to be researched by our users or our terminologists downstream , causing bottlenecks in translation / usage of terminology, perhaps incurring additional costs Clipart from https://openclipart.org

  8. www.adaptcentre.ie THE SOLUTION? (METHODOLOGY)

  9. Ongoing terminology extraction www.adaptcentre.ie First proposed by Warburton (2013) – automatically filtering previously identified terms and non-terms in subsequent extraction exercises … Content Batch 1 Content Batch 2 Content Batch 3 Extraction and Ranking Extraction and Ranking Extraction and Ranking Automatic filtering Automatic filtering Validation Validation Validation … Rejected terms Selected terms Selected terms Rejected terms Selected terms Rejected terms Terminology Terminology Terminology Filtered terms Filtered terms Filtered terms Pipeline … Pipeline … Pipeline … Filtered terms Filtered terms

  10. Proposed Solution: Machine Learning ongoing www.adaptcentre.ie Terminology Extraction (MLTE) Instead of compiling term lists for filtering, we introduce a Machine Learning classification model that learns from terminologist’s validation decisions … Content Batch 1 Content Batch 2 Content Batch 3 Extraction Extraction Extraction Validation Candidate Classification Candidate Classification Validation Validation … Rejected terms Rejected terms Selected terms Selected terms Rejected terms Selected terms Terminology Terminology Terminology Train model Pipeline … Pipeline … Pipeline … Retrained model Retrained model

  11. Proposed System Architecture www.adaptcentre.ie CURRENT BATCH Current batch text Text from previous k batches Training, model, etc. … Valid Not Valid Training Validation decisions for current batch Model for Validation decisions from current batch previous k batches Parameter: • History size k (number of past batches to use as training data)

  12. www.adaptcentre.ie EXPERIMENTAL SETUP AND RESULTS

  13. Dataset www.adaptcentre.ie • Usage of the ACL RD-TEC corpus • Has terminology gold standard • Has term index info (which terms appear in which docs) • Documents are time-stamped (date of conference) • C04-1001_cln.txt • J05-1003_cln.txt • Sample: RDTEC papers from 2004 till 2006 • 2,781 articles • 9,114,767 words • 3,300 words per article on average • Sample divided in chronological batches of approx. 40 articles each • 69 batches • Simulation of ongoing term extraction AND validation using an annotated, time-stamped corpus

  14. Simulation www.adaptcentre.ie Given current batch b t : 1. Extract term candidate n-grams from articles in batch (n = 1 .. 7) 2. Automatically remove any term candidates that appeared in any previous batch – like Warburton (2013) 3. Automatically remove any term candidates with POS patterns not associated with any valid terms in previous batches • This is to reduce the amount of non-valid term candidates in training data to counteract skewness towards non-valid candidates • Notice no need to supply manual POS pattern filters! 4. Using previously trained model (if available), predict whether each term candidate is a valid term or not 5. Evaluate prediction by comparing predictions with gold standard in ACL RD-TEC annotation – Simulates manual validation step 6. Create new training data by concatenating this gold standard data points with that of the previous k-1 batches (history of size k). In our experiments, best results with k = 16. 7. Train a new model using newly created training data. 8. Go to next batch b t+1 and start from 1 until completing all batches.

  15. Model and Features www.adaptcentre.ie • Model • Support Vector Machine (SVM) classifier • Linear Kernel • Features • Term candidate’s POS pattern • Term candidate’s character 3 -grams • Two domain contrastive features: • Domain Relevance (DR) (Navigli and Velardi, 2002) • Term Cohesion (TC) (Park et al., 2002) • Contrastive corpus 1 – a 500-way clustering of 2009 Wikipedia documents (Baroni et al., 2009) • Contrastive corpus 2 – a dynamic clustering of batch history (each cluster has roughly 40 articles)

  16. Experiments www.adaptcentre.ie • Our simulated approach, as described • Two baselines: • Baseline 1: An approximation to Warburton’s (2013) method using standard, off-the-shelf filter-rankers provided by JATE (Zhang et al., 2008) • Automatic filtering across batches takes place • No learning model is trained • Baseline 2:Train SVM classifier using our features on first batch and use that classifier to predict terms from all subsequent batches • Same as our approach, but no retraining at each batch takes place

  17. Evaluation www.adaptcentre.ie • Recall (coverage): % of valid terms in a batch were predicted as valid • Low recall indicates we’re missing many valid terms • Precision (true positives): % of valid terms in the set of term candidates predicted as valid • Low precision indicates we’re producing many false positives • Usually, we want to identify as many true valid terms as possible, potentially at the risk of returning a relatively high number of false positives. • We’re interested in achieving high recall (coverage) at the expense of a moderate precision

  18. Results www.adaptcentre.ie

  19. www.adaptcentre.ie CONCLUSIONS AND NEXT STEPS

  20. Conclusions www.adaptcentre.ie • Obtained good recall (coverage) scores using our method (ONGOING), much better than the two baselines • Average recall of 74.16% across all batches • Precision scores are quite disappointing, meaning that we can expect many false positives in each batch • Ongoing retraining does help in keeping high recall • Manual terminology validation already takes place in virtually all terminology extraction tasks. Let’s just use them to train an ongoing machine-learning classifier automatically! • The lack of a feedback loop mechanism in the statistical filter- rankers does hinder their performance when used on an ongoing basis with automatic exclusion lists

  21. Future work www.adaptcentre.ie • Conduct human-based benchmarks • Address low precision scores • Post-processing strategies like re-ranking predicted candidates (e.g. by using statistical rankers) • Exploring new features based on topic models • Exploring reinforced learning techniques • Experiment on other datasets from several other domains • Further investigate role of contrastive corpus • E.g. not all specialised terms will feature in Wikipedia • Fall-back strategy like relying in sub-terms • Distributional vector composition techniques in order to estimate feature values of terms missing in contrastive corpus

  22. www.adaptcentre.ie QUESTIONS? Alfredo Maldonado Research Fellow ADAPT Centre at Trinity College Dublin alfredo.maldonado@adaptcentre.ie maldonaa@tcd.ie @alfredomg on Twitter The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 12/RC/2106) and is co-funded under the European Regional Development Fund. Clipart from https://openclipart.org

  23. www.adaptcentre.ie APPENDIX

Recommend


More recommend