1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2 The difference between East-Asian and Word segmentation most European languages Based on rules, dictionaries and/or statistics Problems for information retrieval A common problem in East-Asian languages — Segmentation Ambiguity: The same string can be segmented into different words (Chinese, Japanese and Korean to some extent) e.g. “ 发展中国家 ” � is the lack of natural word boundaries. 发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) For information retrieval, we have to determine 发展 (development)/ 中国 (China)/ 家 (family) the index units first. — If a document and a query are segmented into different words, there may be mismatch. � Using word segmentation � Cutting sentence into n-grams — Two different words may have the same or related meaning, especially when they share come common characters. 办公室 (office) ↔ 办公楼 (office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4 1. Motivation 1. Motivation 1
Cutting the sentence into n-grams We focus on Need not any linguistic resource Using words and n-grams as index The utilization of unigrams and bigrams has been units for monolingual IR under LM investigated in several previous studies. frame work. — As effective as using a word segmentation Using words and n-grams as translation units in CLIR The limitation of previous studies — N-grams only used in monolingual IR — we only tested for English-Chinese CLIR — Integration of n-grams and words in retrieval models (vector space model, probabilistic model, etc) other than language modeling (LM) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 6 1. Motivation 1. Motivation Mono-lingual IR Chinese text input Segmentation into words or n-grams (indexing units) — Various approaches to word segmentation (e.g. longest matching) — Overlapping n-grams 2. Related work E.g. 前年收入有所下降 Unigram: 前 � 年 � 收 � 入 � 有 � 所 � 下 � 降 Word: 前年 / 收入 / 有所 / 下降 or : 前 / 年收入 / 有所 / 下降 Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降 Score function in language modeling similar to other languages Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 8 2. Related Work 2
LM approach to IR Cross-Language IR Query-likelihood retrieval model: (1) Build a LM for each document Translation between query and document languages (2) Rank in the probability of document model generating query Q (Ponte&Croft’98, Croft’03) Basic approach: translation query ∏ = P ( Q | D ) P ( q | D ) — MT system i q ∈ Q — Bilingual dictionary i KL-divergence: — Parallel corpus (1) Build LMs for document and query, (2) determine the divergence between them (Lafferty&Zhai’01,’02) � Train a probabilistic translation model from θ P ( w | ) ∑ Q parallel corpus, then use the TM for CLIR = − θ θ = − θ Score ( D , Q ) KL ( || ) P ( w | ) log Q D Q P ( w | θ ) (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) ∈ w V D θ = λ ⋅ + − λ P ( w | ) P ( w | d ) ( 1 ) P ( w | C ) Smoothing D θ = P ( w | ) c ( w , q ) / | q | Maximum Likelihood Estimation Q Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 9 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 10 2. Related Work 2. Related Work LM approach to CLIR For KL-divergence model (Kraaij et al’03) ∑ θ = θ = θ 3. Using different indexing P ( w | ) P ( t | ) P ( s , t | ) Q i Q j i Q j s s ∑ = θ θ P ( t | s , ) P ( s | ) units i j Q j Q j s s ∑ ≈ θ t ( t | s ) P ( s | ) i j j Q j s where t is a term in document (target) language; s in query (source) language; t ( t i |s j ) is translation model. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 11 2. Related Work 3
Combining different indexes Different indexing units Combine words with characters or bigrams and characters Single index — Merging indexes “ 国企研发投资 ” — Unigram (single character) “ 国企研发投资 ” � WU: Word & Unigram U: 国 / 企 / 研 / 发 / 投 / 资 WU: 国企 / 研发 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 B: 国企 / 企研 / 研发 / 发投 / 投资 — Bigram � BU: Bigram & Unigram BU: 国企 / 企研 / 研发 / 发投 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 W: 国企 / 研发 / 投资 — Word — Multiple indexes � B+U: Interpolate Bigram and Unigram = − θ θ Score ( D , Q ) KL ( || ) Q − KL ( Q U D , ) D Q D U U U Q D Q − KL ( Q U D , ) D B U Problems with single index B ∑ — Words can be segmented in different ways Score ( D , Q ) = α Score ( D , Q ) i i i — Closely related words cannot match Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 13 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 14 3. Using different index units 3. Using different index units Using different index units for C/J/K Experiment Setting monolingual IR on NTCIR4/5 Means Average Precision (MAP) NTCIR3/4 NTCIR5/6 Run U B W BU WU 0.3B+0.7U Collections #doc (KB) Collections #doc(KB) ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Cn CIRB011 CIRB020 381 CIRB040r 901 �������� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Mainichi98/99 Mainichi00/01r �������� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Jp 594 858 Yomiuri98+99 Yomiuri00+01 �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − �������� ����� ����� ����� ����� ����� ����� ����� ����� Chosunilbo98/99 Chosunilbo00/01 − − − − Kr 254 220 Hankookilbo Hankookilbo00/01 �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − NTCIR3 NTCIR4 NTCIR5 NTCIR6 Surprisingly, U is better than B and W for Chinese Numbers of topics 50 60 50 50 Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. However, BU and B are the best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 15 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 16 3. Using different index units 3. Using different index units 4
Recommend
More recommend