Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 Sanmitra Bhattacharya Hung Tran Padmini Srinivasan Computer Science
INTRODUCTION • Why SMS-based FAQ Retrieval? • Exponential growth in telecom market • India among the top contributors • Widespread use of text messages • Personal communication • Advertisement • Enquiry
FIRE 2011 SMS-BASED FAQ RETRIEVAL SMS-based FAQ retrieval • Corpus: FAQs in agriculture, career, general knowledge, etc. • Queries: SMS text messages • Task: Find FAQ entries that answer/match SMS queries answer/match SMS queries
FIRE 2011 SMS-BASED FAQ RETRIEVAL SMS-based FAQ retrieval TREC QA track (1999-2007) • Corpus: FAQs in agriculture, • Corpus: Newswire, AQUAINT, and Blogs career, general knowledge, • Question Series − FACT & LIST etc. • Question Topic: “House of Chanel” • Queries: SMS text messages • FACT: In what year was the company • Task: Find FAQ entries that founded? answer/match SMS queries answer/match SMS queries • LIST: What museums have displayed Chanel clothing? • Task: Define a target by answering questions
CHALLENGES • Sample SMS query • “wht is career counclng” • Non-standard abbreviations (what -> wht, wt, vat, etc.) • Misspellings • Misspellings • Omission of words • Inappropriate Transliterations • Grammatical Errors • Match this SMS query to “What is career counseling?”
SUB-TASKS • Mono-lingual FAQ Retrieval ENGLISH ENGLISH FAQs Query • Cross-lingual FAQ Retrieval ENGLISH HINDI FAQs FAQs Query • Multi-lingual FAQ Retrieval ENGLISH ENGLISH, HINDI, MALAYALAM Query FAQs
DATA • FAQs: <FAQ> <FAQID>ENG_CAREER_1</FAQID> <DOMAIN>CAREER</DOMAIN> <QUESTION>What is career counseling?</QUESTION> <ANSWER> Career counseling is a process ... </ANSWER> </FAQ> • • SMS queries: SMS queries: <SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_TEXT> <MATCHES> <ENGLISH>ENG_CAREER_1</ENGLISH> <MALAYALAM>NONE</MALAYALAM> <HINDI>NONE</HINDI> </MATCHES> </SMS>
DATASET • FAQ Corpus • ENGLISH: 7251 • HINDI: 1994 • MALAYALAM: 681 • • SMS Queries SMS Queries Sub- Training Testing task Englis Hindi Malayala English Hindi Malayala h m m Mono 1071 230 140 3405 324 50 Cross 472 - - 3405 - - Multi 460 230 80 3405 324 50
FLOWCHART OF METHODS
BASIC STEPS • Indexing • INDRI IR system • 2 types − UTF-8 and Translated • Translation mechanism for Hindi • Google Translate • • Microsoft Bing Translator Microsoft Bing Translator • Sample Output • Hindi FAQ: धान भण् डारण करते समय क् या - क् या सावधािनयां बरतनी ह� ? • Google Translate Output: When grain storage - what are the precautions? • Microsoft Bing Translator Output: What-if, when Paddy cold storage savdhaniyan be?
BASIC STEPS • Translation mechanism for Malayalam • No standard API • Crowdsourcing − oDesk • 681 FAQs + 50 SMS queries • # of translators: 2 • • Time: 2 days Time: 2 days • Cost: 40 USD • Example: Malayalam: ���������������������������� ? • • English Translation 1: Which is the longest river in the world? • English Translation 2: Which is world’s longest river?
BASIC STEPS • Straight Borda Count • Used for merging several results • Consensus-based voting of retrieval results • ALL RETRIEVAL METHODS USE INDRI’S BELIEF-OPERATOR #combine #combine
MONO-LINGUAL RETRIEVAL (ENGLISH)
MONO-LINGUAL RETRIEVAL (ENGLISH) • ENGLISH • Google Spelling Suggestions • Input: <SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_QUERY> ... </SMS> </SMS> • Output: <SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>what is career counselling</SMS_QUERY> ... </SMS> • No standard API
MONO-LINGUAL RETRIEVAL (ENGLISH) • ENGLISH (cont.) • Term Expansion • 1-4 character words • Commonly used abbreviations: ‘c’ for ‘see’ • Manually created lookup table • 766 abbreviations and expansions • 766 abbreviations and expansions • Aspell spell-checker • Problem with common acronyms and proper nouns (Ghaziabad -> Gasbag) • Term Frequency • ≤ 6 least frequent terms/SMS query
MONO-LINGUAL RETRIEVAL (HINDI) • HINDI • UTF-8 retrieval • English-translated retrieval − Similar to English • Straight Borda Count
MONO-LINGUAL RETRIEVAL (MALAYALAM) • MALAYALAM • UTF-8 retrieval • English-translated retrieval − oDesk • Straight Borda Count
CROSS-LINGUAL RETRIEVAL
CROSS-LINGUAL RETRIEVAL ENGLISH HINDI FAQs Query • Same methods as in English mono-lingual retrieval • ONLY index is different • Hindi FAQs translated into English
MULTI-LINGUAL RETRIEVAL
MULTI-LINGUAL RETRIEVAL English FAQ English Hindi TR FAQ SMS Malayalam TR FAQ • ENGLISH SMS • Run 1: Google Spelling Suggestions • Run 1: Google Spelling Suggestions + Term Expansion • Run 2: Google Spelling Suggestions + Term Expansion + Spell check • Run 3: Google Spelling Suggestions + Term Expansion + Term Frequency
MULTI-LINGUAL RETRIEVAL English FAQ Hindi SMS (TR + Hindi FAQ (TR + UTF-8) native) Malayalam TR FAQ • HINDI SMS • Run 1: Translated SMS queries + Google Spelling Suggestions (all Google Spelling Suggestions (all English indexes) • Run 2: • Hindi: UTF-8 query on UTF-8 index • English & Malayalam (translated): English query on English index
MULTI-LINGUAL RETRIEVAL English FAQ Malayalam SMS (TR + Hindi TR FAQ native) Malayalam FAQ (TR + UTF-8) • MALAYALAM SMS • Run 1: oDesk Translated SMS queries (all English indexes) queries (all English indexes) • Run 2: • Malayalam: UTF-8 query on UTF-8 index • English & Hindi (translated): English query on English index
RESULTS English Run 1: Google • Mono-lingual FAQ Retrieval Spelling Suggestion + Term Expansion • Mean Reciprocal Rank (MRR) English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893 Run 3 0.711 0.819 0.881 • English: Aspell spell-checker doesn’t work well
RESULTS Hindi Run 2: • Mono-lingual FAQ Retrieval Translated + Google Spelling Suggestion • Mean Reciprocal Rank (MRR) English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.860 0.687 0.893 Run 3 0.711 0.819 0.881 • Hindi: Translated queries and corpus work well
RESULTS Malayalam Run 2: • Mono-lingual FAQ Retrieval oDesk Translated • Mean Reciprocal Rank (MRR) English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.893 0.687 0.860 Run 3 0.711 0.819 0.881 • Malayalam: Translated queries and corpus work well
RESULTS (CONT.) • Cross-lingual FAQ Retrieval MRR Run 1 0.108 Run 2 0.135 Run 3 0.104 • Probable errors in relevance judgments SMS Query: (ID: ENG_SMS_QUERY_I31) WHAT IS WIRELESS ISP? • Relevance Judgment: HINDI_TELECOMMUNICATION_88 -> Q. दुिनया का �थान दूरसंचार क े �े� म� �या है ? (English-translated: Q. What is the world's place in the telecom sector?) Run 1 Retrieval: HINDI_TELECOMMUNICATION_87 -> Q. वायरलेस आईएसपी �या है ? (English-translated: Q. What is a Wireless ISP?)
RESULTS (CONT.) • Multi-lingual FAQ Retrieval English Hindi Malayalam Run 1 0.711 0.889 0.727 Run 2 0.839 0.683 0.829 Run 3 0.661 - - • English: Aspell spell-checker doesn’t work well • Hindi: Involving the native UTF-8 retrieval gives better score • Malayalam: Translated queries and corpus work well
RESULTS (CONT.) • Comparison of best results from mono- and multi-lingual tasks English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889
RESULTS (CONT.) • Comparison of best results from mono- and multi-lingual tasks English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889 • Identical errors in relevance judgment as in cross-lingual retrieval retrieval • But minimal effect # of SMS % in Relevant queries English 704 (100%) Total 3405 Hindi 37 (5.2%) Relevant 704 (20.6%) Malayala 84 (11.9%) Non- 2701 (79.4%) m relevant
CONCLUSIONS • Google spelling suggestions and term expansion improve retrieval performance • For Hindi and Malayalam, translation to English helps • Use of crowdsourcing for Malayalam-English translation is effective • Multi-lingual translation is more challenging than mono-lingual • Multi-lingual translation is more challenging than mono-lingual • Lesser noise (abbreviations, misspellings, etc.) in Hindi and Malayalam SMS • Future work: Explore other techniques for handling non- standard abbreviations
ACKNOWLEDGMENT • Text Mining and Retrieval Group, Computer Science, University of Iowa • FIRE 2011 Organizers
Thank You!
Recommend
More recommend