Word segmentation most European languages Based on rules, - PowerPoint PPT Presentation

1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2 The difference between East-Asian and Word segmentation most European languages Based on rules, dictionaries and/or statistics Problems for information retrieval A common problem in East-Asian languages — Segmentation Ambiguity: The same string can be segmented into different words (Chinese, Japanese and Korean to some extent) e.g. “ 发展中国家 ” � is the lack of natural word boundaries. 发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) For information retrieval, we have to determine 发展 (development)/ 中国 (China)/ 家 (family) the index units first. — If a document and a query are segmented into different words, there may be mismatch. � Using word segmentation � Cutting sentence into n-grams — Two different words may have the same or related meaning, especially when they share come common characters. 办公室 (office) ↔ 办公楼 (office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4 1. Motivation 1. Motivation 1

Cutting the sentence into n-grams We focus on Need not any linguistic resource Using words and n-grams as index The utilization of unigrams and bigrams has been units for monolingual IR under LM investigated in several previous studies. frame work. — As effective as using a word segmentation Using words and n-grams as translation units in CLIR The limitation of previous studies — N-grams only used in monolingual IR — we only tested for English-Chinese CLIR — Integration of n-grams and words in retrieval models (vector space model, probabilistic model, etc) other than language modeling (LM) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 6 1. Motivation 1. Motivation Mono-lingual IR Chinese text input Segmentation into words or n-grams (indexing units) — Various approaches to word segmentation (e.g. longest matching) — Overlapping n-grams 2. Related work E.g. 前年收入有所下降 Unigram: 前 � 年 � 收 � 入 � 有 � 所 � 下 � 降 Word: 前年 / 收入 / 有所 / 下降 or : 前 / 年收入 / 有所 / 下降 Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降 Score function in language modeling similar to other languages Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 8 2. Related Work 2

LM approach to IR Cross-Language IR Query-likelihood retrieval model: (1) Build a LM for each document Translation between query and document languages (2) Rank in the probability of document model generating query Q (Ponte&Croft’98, Croft’03) Basic approach: translation query ∏ = P ( Q | D ) P ( q | D ) — MT system i q ∈ Q — Bilingual dictionary i KL-divergence: — Parallel corpus (1) Build LMs for document and query, (2) determine the divergence between them (Lafferty&Zhai’01,’02) � Train a probabilistic translation model from θ P ( w | ) ∑ Q parallel corpus, then use the TM for CLIR = − θ θ = − θ Score ( D , Q ) KL ( || ) P ( w | ) log Q D Q P ( w | θ ) (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) ∈ w V D θ = λ ⋅ + − λ P ( w | ) P ( w | d ) ( 1 ) P ( w | C ) Smoothing D θ = P ( w | ) c ( w , q ) / | q | Maximum Likelihood Estimation Q Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 9 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 10 2. Related Work 2. Related Work LM approach to CLIR For KL-divergence model (Kraaij et al’03) ∑ θ = θ = θ 3. Using different indexing P ( w | ) P ( t | ) P ( s , t | ) Q i Q j i Q j s s ∑ = θ θ P ( t | s , ) P ( s | ) units i j Q j Q j s s ∑ ≈ θ t ( t | s ) P ( s | ) i j j Q j s where t is a term in document (target) language; s in query (source) language; t ( t i |s j ) is translation model. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 11 2. Related Work 3

Combining different indexes Different indexing units Combine words with characters or bigrams and characters Single index — Merging indexes “ 国企研发投资 ” — Unigram (single character) “ 国企研发投资 ” � WU: Word & Unigram U: 国 / 企 / 研 / 发 / 投 / 资 WU: 国企 / 研发 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 B: 国企 / 企研 / 研发 / 发投 / 投资 — Bigram � BU: Bigram & Unigram BU: 国企 / 企研 / 研发 / 发投 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 W: 国企 / 研发 / 投资 — Word — Multiple indexes � B+U: Interpolate Bigram and Unigram = − θ θ Score ( D , Q ) KL ( || ) Q − KL ( Q U D , ) D Q D U U U Q D Q − KL ( Q U D , ) D B U Problems with single index B ∑ — Words can be segmented in different ways Score ( D , Q ) = α Score ( D , Q ) i i i — Closely related words cannot match Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 13 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 14 3. Using different index units 3. Using different index units Using different index units for C/J/K Experiment Setting monolingual IR on NTCIR4/5 Means Average Precision (MAP) NTCIR3/4 NTCIR5/6 Run U B W BU WU 0.3B+0.7U Collections #doc (KB) Collections #doc(KB) �� Cn CIRB011 CIRB020 381 CIRB040r 901 �� Mainichi98/99 Mainichi00/01r �� Jp 594 858 Yomiuri98+99 Yomiuri00+01 �� − − − − �� Chosunilbo98/99 Chosunilbo00/01 − − − − Kr 254 220 Hankookilbo Hankookilbo00/01 �� − − − − �� − − − − NTCIR3 NTCIR4 NTCIR5 NTCIR6 Surprisingly, U is better than B and W for Chinese Numbers of topics 50 60 50 50 Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. However, BU and B are the best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 15 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 16 3. Using different index units 3. Using different index units 4

Word segmentation most European languages Based on rules, - PowerPoint PPT Presentation

1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

Lecture 8: Image Segmentation Peng Chao Face++ Researcher pengchao@megvii.com Nov. 2017

Pixel-Level Im Image Understanding wit ith Semantic Segmentation and Panoptic Segmentation

Co-Segmentation of 3D Shapes via Subspace Clustering Ruizhen Hu Lubin Fan

Introduction to RFM segmentation Karolis Urbonas Head of Data Science, Amazon DataCamp

Image Segmentation Machine Learning Study Group Presented by Yaochen Xie Jan 25, 2018 Outline

Sign constraints on feature weights improve a joint model of word segmentation and phonology Mark

Training Global Linear Models for Chinese Word Segmentation Dong Song and Anoop Sarkar Natural

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi

FIXING MEDIA BUSINESS MODELS WHY A NEW APPROACH IS NEEDED

Acknowledgements Polymorphisms and Symptom Clusters during the Menopausal Research was

The American Wind Wildlife Institute The American Wind Wildlife Institute Results Through

Carnegie Mellon University Search TRECVID 2004 Workshop November 2004 Mike Christel, Jun

Teaching, Supporting & Including Students on the Autism Spectrum Session I: Using Special

More sophisticated behaviour Using library classes to implement some more advanced functionality

Learning in a Global Pandemic Learning in a Global Pandemic EES 3310/5310 EES 3310/5310 Global

Conclusions From Completed Trials in Conclusions From Completed Trials in High Risk Carotid

Word segmentation most European languages Based on rules, - PowerPoint PPT Presentation

1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

Lecture 8: Image Segmentation Peng Chao Face++ Researcher pengchao@megvii.com Nov. 2017

Pixel-Level Im Image Understanding wit ith Semantic Segmentation and Panoptic Segmentation

Co-Segmentation of 3D Shapes via Subspace Clustering Ruizhen Hu Lubin Fan

Introduction to RFM segmentation Karolis Urbonas Head of Data Science, Amazon DataCamp

Image Segmentation Machine Learning Study Group Presented by Yaochen Xie Jan 25, 2018 Outline

Sign constraints on feature weights improve a joint model of word segmentation and phonology Mark

Training Global Linear Models for Chinese Word Segmentation Dong Song and Anoop Sarkar Natural

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi

FIXING MEDIA BUSINESS MODELS WHY A NEW APPROACH IS NEEDED

Acknowledgements Polymorphisms and Symptom Clusters during the Menopausal Research was

The American Wind Wildlife Institute The American Wind Wildlife Institute Results Through

Carnegie Mellon University Search TRECVID 2004 Workshop November 2004 Mike Christel, Jun

Teaching, Supporting &amp; Including Students on the Autism Spectrum Session I: Using Special

More sophisticated behaviour Using library classes to implement some more advanced functionality

Learning in a Global Pandemic Learning in a Global Pandemic EES 3310/5310 EES 3310/5310 Global

Conclusions From Completed Trials in Conclusions From Completed Trials in High Risk Carotid

Teaching, Supporting & Including Students on the Autism Spectrum Session I: Using Special