Data Selection with Fewer Words Amittai Axelrod University of - PowerPoint PPT Presentation

  Data Selection   with Fewer Words Amittai Axelrod   University of Maryland   & Johns Hopkins   Philip Resnik   University of Maryland   Xiaodong He Microsoft Research   Mari Ostendorf University of Washington 1

Domain* Adaptation • * Defined by construction. • Ideally based on some notion of textual similarity: • Lexical choice • Grammar • Topic • Style • Genre • Register • Intent • Domain = particular contextual setting.   Here we use “domain” to mean “corpus”. Amittai Axelrod Data Selection with Fewer Words WMT 2015 2

Domain Adaptation • Training data doesn’t always match desired tasks. • Have bilingual: • Parliament proceedings • Newspaper articles • Web scrapings • Want to translate: • Travel scenarios • Facebook updates • Realtime conversations • Sometimes want a specific kind of language, not just breadth! Amittai Axelrod Data Selection with Fewer Words WMT 2015 3

Data Selection • "filter Big Data down to Relevant Data"   • Use your regular pipeline,   but improve the input!   • Not all sentences are equally valuable... Amittai Axelrod Data Selection with Fewer Words WMT 2015 4

Data Selection • For a particular translation task: • Identify the most relevant training data. • Build a model on only this subset. • Goal: • Better task-specific performance • Cheaper (computation, size, time) Amittai Axelrod Data Selection with Fewer Words WMT 2015 5

Data Selection Algorithm • Quantify the domain • Compute similarity of sentences in pool to the in-domain corpus • Sort pool sentences by score • Select top n% • • • Amittai Axelrod Data Selection with Fewer Words WMT 2015 6

Data Selection Algorithm • Quantify the domain • Compute similarity of sentences in pool to the in-domain corpus • Sort pool sentences by score • Select top n% • Use n% to build task-specific MT system • Combine with system trained on in-domain data (optional) • Apply task-specific system to task. Amittai Axelrod Data Selection with Fewer Words WMT 2015 7

          Perplexity-Based Filtering • A language model LM Q measures the likelihood of some text by its perplexity:   i =1 log LM Q ( w i | h i ) = 2 H LMQ ( s ) P N ppl LM Q ( s ) = 2 − 1 N • Intuition: Average branching factor of LM • Cross-entropy H (of a text w.r.t. an LM) is log ( ppl ).   Amittai Axelrod Data Selection with Fewer Words WMT 2015 8

Cross-Entropy Difference • Perplexity-based filtering: • Score and sort sentences in pool   by perplexity with in-domain LM. • Then rank, select, etc. • However! By construction, the data pool does not match the target task. Amittai Axelrod Data Selection with Fewer Words WMT 2015 9

        Cross-Entropy Difference • Score and rank by cross-entropy difference:   argmin H LM IN ( s ) − H LM P OOL ( s ) s ∈ P OOL (Also called "XEDiff" or "Moore-Lewis") • Prefer sentences that both: • Are like the target task • Are unlike the pool average. Amittai Axelrod Data Selection with Fewer Words WMT 2015 10

Bilingual Cross-Entropy Diff. • Extend the Moore-Lewis similarity score for use with bilingual data, and apply to SMT: ( H L 1 ( s 1 , LM IN ) − H L 1 ( s 1 , LM P OOL )) +( H L 2 ( s 2 , LM IN ) − H L 2 ( s 2 , LM P OOL )) • Training on only the most relevant subset of training data (1%-20%) yields translation systems that are smaller, cheaper, faster, and (often) better. Amittai Axelrod Data Selection with Fewer Words WMT 2015 11

Using Fewer Words • How much can we trust rare words? • If a word is seen 2 times in the general corpus and 3 in the in-domain one,   is it really 50% more likely? • Low-frequency words often ignored   (Good-Turing smoothing, singleton pruning...) Amittai Axelrod Data Selection with Fewer Words WMT 2015 12

  Hybrid word/POS Corpora • In stylometry,   syntactic structure = proxy for style. • POS-tag n-grams used as features to determine authorship, genre, etc. • Incorporate this idea as a pre-processing step to data selection:   Amittai Axelrod Data Selection with Fewer Words WMT 2015 13

  Hybrid word/POS Corpora • In stylometry,   syntactic structure = proxy for style. • POS-tag n-grams used as features to determine authorship, genre, etc. • Incorporate this idea as a pre-processing step to data selection:   Replace rare words with POS tags Amittai Axelrod Data Selection with Fewer Words WMT 2015 14

Hybrid word/POS Corpora • Replace rare words with POS tags: • an earthquake in Port-au-Prince • an earthquake in NNP • • Amittai Axelrod Data Selection with Fewer Words WMT 2015 15

Hybrid word/POS Corpora • Replace rare words with POS tags: • an earthquake in Port-au-Prince • an NN in NNP • • Amittai Axelrod Data Selection with Fewer Words WMT 2015 16

Hybrid word/POS Corpora • Replace rare(?) words with POS tags: • an earthquake in Port-au-Prince • DT NN IN NNP • • Amittai Axelrod Data Selection with Fewer Words WMT 2015 17

Hybrid word/POS Corpora • Replace rare words with POS tags: • an earthquake in Port-au-Prince • an earthquake in NNP • an earthquake in Kodari • Amittai Axelrod Data Selection with Fewer Words WMT 2015 18

Hybrid word/POS Corpora • Replace rare words with POS tags: • an earthquake in Port-au-Prince • an earthquake in NNP • an earthquake in Kodari • Threshold: ( if Count < 10 ) in either corpus Amittai Axelrod Data Selection with Fewer Words WMT 2015 19

Using Fewer Words • Use the hybrid word/POS texts instead of the original corpora. • Train LMs on the corpora, compute sentence scores, and re-rank the original general corpus. • Standard Moore-Lewis / Cross-entropy diff,   but with different corpus representation. Amittai Axelrod Data Selection with Fewer Words WMT 2015 20

        TED Zh-En Translation • Task: Translate TED talks, Chinese-to-English, using LDC data (6m sentence pairs). • Vocabulary reduction from TED+LDC:   Eliminate 97% of the vocabulary   Lang Vocab Kept % En 470,154 10,036 2.1% Zh 729,283 11,440 1.5% • What happens to SMT performance? Amittai Axelrod Data Selection with Fewer Words WMT 2015 21

TED Zh-En Translation • Slightly better scores,   despite (much) smaller selection vocab! Amittai Axelrod Data Selection with Fewer Words WMT 2015 22

In-Domain Lexical Coverage • Up to 10% more in-domain coverage Amittai Axelrod Data Selection with Fewer Words WMT 2015 23

General-Domain Coverage • Hybrid-selected data covers 10-15% more   of the general lexicon. Amittai Axelrod Data Selection with Fewer Words WMT 2015 24

Hybrid Word/POS Selection • Must re-compute for every task/pool,   but vocabulary statistics are easy. • Aggregating the statistics for rare terms allows generalizing to other unseen words. • Perhaps preserving sentence structure,   picking up words that fill similar roles/patterns in the sentence?   Amittai Axelrod Data Selection with Fewer Words WMT 2015 25

Hybrid Word/POS Selection • Replace all rare words with POS tags, then run regular data selection. • Reduces active lexicon by 97%,   to ~10k words with robust statistics • Potentially helpful for algorithms bound by vocabulary size "V" • Selection LM is 25% smaller Amittai Axelrod Data Selection with Fewer Words WMT 2015 26

Questions? Amittai Axelrod Data Selection with Fewer Words WMT 2015 27

[ this slide intentionally left blank ] Amittai Axelrod Data Selection with Fewer Words WMT 2015 28

Data Selection with Fewer Words Amittai Axelrod University of - PowerPoint PPT Presentation

Data Selection with Fewer Words Amittai Axelrod University of Maryland & Johns Hopkins Philip Resnik University of Maryland Xiaodong He Microsoft Research Mari Ostendorf University of Washington 1 Domain*

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

1/6/2017 1 1/6/2017 Commissioning Benefits Fewer Change Orders Fewer corrective actions

Traditional news media: fewer readers lower ad revenue fewer resources less original

Distributed coloring in sparse graphs with fewer colors Marthe Bonamy with Pierre Aboulker,

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Improving Pseudo-Code CS16: Introduction to Data Structures & Algorithms Spring 2020

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Total Annual Gun Deaths Total Annual Gun Deaths 34,000 150 1st world countries with fewer

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

Hammond and Axelrods model is not useful for studying ethnocentrism The Segregation Model

Starting Practice Jolie Chang, MD Assistant Professor Department of Otolaryngology, Head and

Iterative methods for Image Processing Lothar Reichel Como, May 2018. Lecture 2: Tikhonov

fi fi Finnish Centre of Excellence in Inverse Problems Research Outline: Inverse

Applica'ons of Data Selec'on via Cross-Entropy Difference for

The iterated Prisoners dilemma U. Sperhake DAMTP , University of Cambridge PHEP Seminar,

CHAPTER 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

Emergent systems Spring-14 Cultural models http://www.cs.umu.se/kurser/5DV017 Previous lectures

Data Selection with Fewer Words Amittai Axelrod University of - PowerPoint PPT Presentation

Data Selection with Fewer Words Amittai Axelrod University of Maryland & Johns Hopkins Philip Resnik University of Maryland Xiaodong He Microsoft Research Mari Ostendorf University of Washington 1 Domain*

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

1/6/2017 1 1/6/2017 Commissioning Benefits Fewer Change Orders Fewer corrective actions

Traditional news media: fewer readers lower ad revenue fewer resources less original

Distributed coloring in sparse graphs with fewer colors Marthe Bonamy with Pierre Aboulker,

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Improving Pseudo-Code CS16: Introduction to Data Structures &amp; Algorithms Spring 2020

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

Total Annual Gun Deaths Total Annual Gun Deaths 34,000 150 1st world countries with fewer

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

Hammond and Axelrods model is not useful for studying ethnocentrism The Segregation Model

Starting Practice Jolie Chang, MD Assistant Professor Department of Otolaryngology, Head and

Iterative methods for Image Processing Lothar Reichel Como, May 2018. Lecture 2: Tikhonov

fi fi Finnish Centre of Excellence in Inverse Problems Research Outline: Inverse

Applica'ons of Data Selec'on via Cross-Entropy Difference for

The iterated Prisoners dilemma U. Sperhake DAMTP , University of Cambridge PHEP Seminar,

CHAPTER 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

Emergent systems Spring-14 Cultural models http://www.cs.umu.se/kurser/5DV017 Previous lectures

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Improving Pseudo-Code CS16: Introduction to Data Structures & Algorithms Spring 2020