Index Introduction Previous approaches Our proposal Evaluation - PowerPoint PPT Presentation

C O L ES IR at CLEF 2007: from English to French via Character N -Grams Jes´ us Vilares Michael P. Oakes Manuel Vilares Computer Science Dept. School of Computing and Technology Computer Science Dept. University of A Coru˜ na University of Sunderland University of Vigo jvilares@udc.es Michael.Oakes@sunderland.ac.uk vilares@uvigo.es J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 1

Index Introduction Previous approaches Our proposal Evaluation Conclusions and future work J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 2

Translation in CLIR Techniques of Machine Translation (MT) Softened restrictions Not limited to just one translation Not limited by syntax Conventional MT tools (e.g., S YSTRAN ) Single well-formed translation Dismisses advantages of MT in CLIR J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 3

Translation in CLIR (cont.) Bilingual dictionaries Problems with out-of-vocabulary words (misspellings, unknown words) Normalization Word-Sense Disambiguation (WSD) Parallel corpora Automatic generation of dictionaries: Collocations Association measures Probabilistic translation measure No normalization J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 4

Character N -Grams tomatoes n =5 − → { -tomat- , -omato- , -matoe- , -atoes- } Applications: Language recognition Misspelling processing Information Retrieval Reduction of vocabulary size (dictionary) Asian languages (no delimiters) J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 5

McNamee and Mayfield, 2004 No word normalization Language-independent : No language-specific processing Applicable to very different languages Knowledge-light approach : Minimal linguistic information and resources Robustness : Out-of-vocabulary words J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 7

McNamee and Mayfield, 2004 (cont.) J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 8

N -Gram Alignment Algorithm Input: parallel corpus aligned at paragraph-level Text splitted into n -grams Process : for each source n -gram of source language: 1. To locate source language paragraphs containing it 2. To identify parallel paragraphs in target language 3. To calculate translation score for each n -gram in target paragraphs ( ad-hoc association measure ). 4. Potential translation : target n -gram with highest score. Output: n -gram-level alignment J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 9

N -Gram Alignment Algorithm (cont.) Drawbacks: Very slow (several days): not accurate for testing Single translation J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 10

Goals Testing tool To speed up the training process Multiple translations Freely available resources More transparency Reduce effort J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 12

Differences Freely available resources : Parallel corpus: E UROPARL (Koehn, 2005) Statistical aligner: GIZA++ (Och and Ney, 2003) Retrieval engine: T ERRIER ( http://ir.dcs.gla.ac.uk/terrier/ ) Standard association measures : Dice coefficient Mutual Information Log-likelihood Alignment in two phases : 1. Word-level alignment 2. N -gram-level alignment J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 13

N -Gram Alignment Algorithm Input: parallel corpus aligned at paragraph-level Process : two phases 1. Word-level alignment using GIZA++ (slowest): filtering 2. N -gram-level alignment : Aligned words as weighted word-level parallel corpus Association measures between cooccurring n -grams Likelihood of cooccurrences weighted according to their alignment probabilities (from word-level alignment) Output: n -gram-level alignment J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 14

N -Gram Alignment Algorithm (cont.) Optimizations: Input word-translation probability threshold W ( W =0.15) Input word pairs / output n -gram pairs: ∼ 95 % reduction Bidirectional word alignment ( EN2FR ∩ FR2EN ) Input word pairs / output n -gram pairs: ∼ 50 % reduction J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 15

Evaluation English-to-French run ( EN2FR ) 4-grams (McNamee and Mayfield, 2004) T ERRIER retrieval engine: DFR paradigm InL2 weight Corpus: CLEF 2007 robust track (Cross-Language Evaluation Forum) collection (FR) size #docs. #topics (EN) LeMonde 94 + SDA 94 243 MB 87,191 100 ( training ) 100 ( test ) J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 17

Querying title + description topic fields Querying process: Split source language query into n -grams Replaced by their N highest scored aligned target n-grams: Tuned using English-to-Spanish experiments ( EN2ES ) Dice coefficient N =1 Mutual Information N =10 Log-likelihood N =1 Submit translated query J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 18

Precision vs. Recall 1 1 EN (MAP=0.2567) EN (MAP=0.1437) FR (MAP=0.4270) FR (MAP=0.3168) EN2FR Dice (MAP=0.3219) EN2FR Dice (MAP=0.2205) 0.8 0.8 EN2FR MI (MAP=0.2627) EN2FR MI (MAP=0.1550) EN2FR logl (MAP=0.3293) EN2FR logl (MAP=0.2287) Precision (P) 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (Re) Recall (Re) TRAINING set TEST set J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 19

Precision at top D documents 1 1 EN (MAP=0.2567) EN (MAP=0.1437) FR (MAP=0.4270) FR (MAP=0.3168) EN2FR Dice (MAP=0.3219) EN2FR Dice (MAP=0.2205) 0.8 0.8 EN2FR MI (MAP=0.2627) EN2FR MI (MAP=0.1550) EN2FR logl (MAP=0.3293) EN2FR logl (MAP=0.2287) Precision (P) 0.6 0.6 0.4 0.4 0.2 0.2 0 0 5 10 15 20 30 100 200 500 1000 5 10 15 20 30 100 200 500 1000 Documents retrieved (D) Documents retrieved (D) TRAINING set TEST set J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 20

Conclusions CLIR using n -grams as indexing and translation units N -gram alignment in two phases: speeds up process 1. Word-level alignment ( concentrates complexity ) 2. N -gram-level alignment Optimizations during word-level alignment : Word-translation probability threshold Bidirectional alignment Dice and log-likelihood perform better J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 22

Future work New languages Remove diacritics Remove stopwords and/or stopngrams (obtained automatically) Simplify word-level alignment ( bottleneck ) Direct evaluation of n -gram alignments J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 23

The End www.grupocole.org Go back to the beginning of the presentation J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 24

N -Gram Contingency Table J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 25

N -Gram Contingency Table (cont.) The likelihood of a cooccurrence is inherited from the probability of its containing word alignment: P ( ngram iu → ngram jv ) = P ( word u → word v ) 0.80 tomate tomato ↓ ↓ ↓ 0.80 tomat- -omate tomat- -omato ↓ ↓ ↓ 0.80 tomat- tomat- 0.80 tomat- -omato 0.80 -omate tomat- 0.80 -omate -omato J. Vilares, M P . Oakes and M. Vilares. From English to French via Character N -Grams– p. 26

Index Introduction Previous approaches Our proposal Evaluation - PowerPoint PPT Presentation

C O L ES IR at CLEF 2007: from English to French via Character N -Grams Jes us Vilares Michael P. Oakes Manuel Vilares Computer Science Dept. School of Computing and Technology Computer Science Dept. University of A Coru na University

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

Results of the 2013 SRI Index review Corli le Roux, Head of SRI Index and Sustainability

New Zealand Consumers Price Index: Retrospective superlative index and impact of alternative

Castlestone FAANG+ UCITS Funds Q2 2020 FAANG+ holdings in S&P 500 & MSCI EM Index

The Geary-Khamis index and the Lehr index: how much do they differ? 15 th Meeting of the Ottawa

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M

Graph-based Algorithms in NLP Edge weights w ( u, v ) define a measure of pairwise similarity

Orthogonal Polynomials on Polynomial Lemniscates Brian Simanek (Vanderbilt University, USA) MWAA

3-2: Learning Goals Lets see how big different things are. Download for free at

B LEU ATRE : Flattening Syntactic Dependencies for MT Evaluation Dennis N. Mehay and Chris

Index Introduction Previous approaches Our proposal Evaluation - PowerPoint PPT Presentation

C O L ES IR at CLEF 2007: from English to French via Character N -Grams Jes us Vilares Michael P. Oakes Manuel Vilares Computer Science Dept. School of Computing and Technology Computer Science Dept. University of A Coru na University

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

TCC Index China Telematics Brands Exposure Analysis Report 2013.3-2013.5 1 TCC Index TCC

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Multi-Indexed Files : Outline ! Introduction ! Inverted Files ! Multilist Files rasitjutrakul

Results of the 2013 SRI Index review Corli le Roux, Head of SRI Index and Sustainability

New Zealand Consumers Price Index: Retrospective superlative index and impact of alternative

Castlestone FAANG+ UCITS Funds Q2 2020 FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index

The Geary-Khamis index and the Lehr index: how much do they differ? 15 th Meeting of the Ottawa

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

More Distributional Semantics: New Models &amp; Applications CMSC 723 / LING 723 / INST 725 M

Graph-based Algorithms in NLP Edge weights w ( u, v ) define a measure of pairwise similarity

Orthogonal Polynomials on Polynomial Lemniscates Brian Simanek (Vanderbilt University, USA) MWAA

3-2: Learning Goals Lets see how big different things are. Download for free at

B LEU ATRE : Flattening Syntactic Dependencies for MT Evaluation Dennis N. Mehay and Chris

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

Castlestone FAANG+ UCITS Funds Q2 2020 FAANG+ holdings in S&P 500 & MSCI EM Index

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M