Model learning 𝑢 • State value function: 𝑊 𝜄 (𝑇 𝑢 ) = σ 𝑗=1 𝜌 𝜄 (𝑇 𝑗−1 , 𝐵 𝑗 ) • An E2E trainable, question-specific, neural network model • Weakly supervised learning setting • Question-answer pairs are available • Correct parse for each question is not available • Issue of delayed (sparse) reward • Reward is only available after we get a (complete) parse and the answer • Approximate (dense) reward • Check the overlap of the answers of a partial parse 𝐵(𝑇) with the gold answers 𝐵 ∗ 𝐵 𝑇 ∩𝐵 ∗ • 𝑆 𝑇 = 𝐵 𝑇 ∪𝐵 ∗ [ Iyyer+18; Andreas+16; Yih+15]
Parameter updates • Make the state value function 𝑊 𝜄 behave similarly to reward 𝑆 • For every state 𝑇 and its (approximated) reference state 𝑇 ∗ , we define loss as • ℒ 𝑇 = 𝑊 𝜄 𝑇 ∗ − 𝑆 𝑇 − 𝑆 𝑇 ∗ 𝜄 𝑇 − 𝑊 • Improve learning efficiency by finding the most violated state መ 𝑇 // labeled QA pair // Finds the best approximated reference state // Finds the most violated state [ Iyyer+18; Taskar+04]
DynSP SQA • “which superheroes came from Earth and first appeared after 2009?” • ( 𝐵 1 ) Select-column Character • ( 𝐵 2 ) Cond-column Home World • ( 𝐵 3 ) Op- Equal “Earth” • ( 𝐵 2 ) Cond-column First Appeared • ( 𝐵 5 ) Opt- GT “2009” • “which of them breathes fires” • ( 𝐵 12 ) S-Cond-column Powers • ( 𝐵 13 ) S-Op- Equal “Fire breath” Possible action transitions based on their types. Shaded circles are end states. [ Iyyer+18; Andreas+16; Yih+15]
DynSP for sequential QA (SQA) • Given a question (history) and a table • Q1: which superheroes came from Earth and first appeared after 2009? • Q2: which of them breathes fire? • Add subsequent statement (answer column) for sequential QA • Select Character Where { Home World = “Earth”} & { First Appear > “2009”} • A1: {Dragonwing, Harmonia} • Subsequent Where { Powers = “Fire breath”} • A2: {Dragonwing} [ Iyyer+18 ]
Query rewriting approaches to SQA Q1: When was California founded? A1: September 9, 1850 Q2: Who is its governor? → Who is California governor? A2: Jerry Brown Q3: Where is Stanford? A3: Palo Alto, California Q4: Who founded it? → Who founded Stanford? A4: Leland and Jane Stanford Q5: Tuition costs → Tuition cost Stanford A5: $47,940 USD [Ren+18; Zhou+20]
Dialog Manager – dialog memory for state tracking Dialog Memory (of state tracker) Entity {United States, “q”} {New York City, “a”} {University of Pennsylvania, “a”} … Predicate {isPresidentOf} {placeGraduateFrom} {yearEstablished } … Set → 𝐵 4 𝐵 15 𝑓 𝑣𝑡 𝑠 Action subsequence 𝑞𝑠𝑓𝑡 Set → 𝐵 4 𝐵 15 (partial/complete states) [Guo+18]
Dialog Manager – policy for next action selection • A case study of Movie-on-demand • System selects to • Either return answer or ask a clarification question. • What (clarification) question to ask? E.g., movie title, director, genre, actor, release-year, etc. [Dhingra+17]
What clarification question to ask 0.7 • Baseline: ask all questions in a 0.6 randomly sampled order 0.5 • Ask questions that users can Task Success Rate answer 0.4 • learned from query logs 0.3 • Ask questions that help reduce search space 0.2 • Entropy minimization 0.1 • Ask questions that help complete the task successfully 0 1 2 3 4 5 6 7 8 9 • Reinforcement learning via agent- # of dialogue turns user interactions Results on simulated users [Wu+15; Dhingra+17; Wen+17; Gao+19]
Response Generation • Convert “dialog act” to “natural language response” • Formulated as a seq2seq task in a few-shot learning setting 𝑈 • 𝑞 𝜄 𝒚 𝐵 = σ 𝑢=1 𝑞 𝜄 (𝑦 𝑢 |𝑦 <𝑢 , 𝐵) • Very limited training samples for each task • Approach • Semantically Conditioned neural language model • Pre-training + fine-tuning, • e.g., semantically conditioned GPT (SC-GPT) [Peng+20; Yu+19; Wen+15; Chen+19]
SC-GPT Performance of different response generation models in few-shot setting (50 samples for each task) [Peng+20; Raffel+19]
C-KBQA approaches w/o semantic parser • Building semantic parsers is challenging • Limited amounts of training data, or • Weak supervision • C-KBQA with no logic-form • Symbolic approach: “look before you hop” • Answer an initial question using any standard KBQA • Form a context subgraph using entities of the initial QA pair • Answer follow-up questions by expanding the context subgraph to find candidate answers • Neural approach • Encode KB as graphs using a GNN • Select answers from the encoded graph using a point network [Christmann+19; Muller+19]
Open Benchmarks • SQA (sequential question answering) • https://www.microsoft.com/en-us/download/details.aspx?id=54253 • CSQA (complex sequence question answering), • https://amritasaha1812.github.io/CSQA/ • ConvQuestions (conversational question answering over knowledge graphs) • https://convex.mpi-inf.mpg.de/ • CoSQL (conversational text-to-SQL) • https://yale-lily.github.io/cosql • CLAQUA (asking clarification questions in Knowledge-based question answering) • https://github.com/msra-nlc/MSParS_V2.0
Conversational QA over Texts • Tasks and datasets • C-TextQA system architecture • Conversational machine reading compression models • Remarks on pre-trained language models for conversational QA
QA over text – extractive vs. abstractive QA [Rajpurkar+16; Nguyen+16; Gao+19]
Conversation QA over text: CoQA & QuAC [Choi+18; Reddy+18]
Dialog behaviors in conversational QA • Topic shift: a question about sth previous discussed • Drill down: a request for more info about a topic being discussed • Topic return: asking about a topic again after being shifted • Clarification: reformulating a question • Definition: asking what is meant by a team [Yatskar 19]
C-TextQA system architecture Machine Reading • (Conversational) MRC Comprehension Q3: Where? • Find answer to a question given text (MRC) module and previous QA pairs • Extractive (span) vs. abstractive Conversation history: answers Dialog state tracker Q1: what is the story about? Texts Dialog Manager A1: young girl and her dog (previous QA pairs) • Dialog manager Q2: What were they doing? A2: set out a trip • Maintain/update state of dialog history (e.g., QA pairs) • Select next system action (e.g., ask Dialog Policy clarification questions, answer) (action selection) • Response generator • Convert system action to natural language response Response A3: the woods Generator [Huang+19]
Neural MRC models for extractive TextQA • QA as classification given (question, text) • Classify each word in passage as start/end/outside of the answer span • Encoding: represent each passage word using an integrated context vector that encodes info from • Lexicon/word embedding (context-free) • Passage context • Question context • Conversation context (previous question-answer pairs) • Prediction: predict each word (its integrated context vector) the start and end position of answer span. [Rajpurkar+16; Huang+10; Gao+19]
Three encoding components • Lexicon embedding e.g., GloVe • represent each word as a low-dim continuous vector • Passage contextual embedding e.g., Bi-LSTM/RNN, ELMo, Self-Attention/BERT • capture context info for each word within a passage • Question contextual embedding e.g., Attention, BERT • fuse question info into each passage word vector … … question passage [Pennington+14; Melamud+16; Peters+18; Devlin+19]
Neural MRC model: BiDAF Answer prediction Integrated context vectors Question contextual embedding Passage contextual Embedding Lexicon Embedding [Seo+16]
Transformer-based MRC model: BERT Answer prediction Integrated context vectors Question contextual Passage contextual embedding Embedding (inter-attention) (self-attention) Lexicon Embedding Question Passage [Devlin+19]
Conversational MRC models • QA as classification given (question, text) • Classify each word in passage as start/end/outside of answer span • Encoding: represent each passage word using an integrated context vector that encodes info about • Lexicon/word embedding • Passage context • Question context • Conversation context (previous question-answer pairs) • Prediction: predict each word (its integrated context vector) the start and end position of answer span. A recent review on conversational MRC is [Gupta&Rawat 20]
Conversational MRC models • Pre-pending conversation history to current question or passage • Convert conversational QA to single-turn QA • BiDAF++ (BiDAF for C-QA) • Append a feature vector encoding dialog turn number to question embedding • Append a feature vector encoding N answer locations to passage embedding • BERT (or RoBERTa) • Prepending dialog history to current question • Using BERT as • context embedding (self-attention) • Question/conversation context embedding (inter-attention) [Choi+18; Zhu+19; Ju+19; Devlin+19]
FlowQA: explicitly encoding dialog history • Integration Flow (IF) Layer • Given: • Current question 𝑅 𝑈 , and previous questions 𝑅 𝑢 , 𝑢 < 𝑈 • For each question 𝑅 𝑢 , integrated context vector of each passage word 𝑥 𝑢 • Output: • Conversation-history-aware integrated context vector of each passage word • 𝑥 𝑈 = LSTM(𝑥 1 , … , 𝑥 𝑢 , … , 𝑥 𝑈 ) • So, the entire integrated context vectors for answering previous questions can be used to answer the current question. • Extensions of IF • FlowDelta explicitly models the info gain thru conversation • GraphFLOW captures conversation flow using a graph neural network • Implementing IF using Transformer with proper attention masks [Huang+19; Yeh&Chen 19; Chen+19]
Remarks on BERT/RoBERTa • BERT-based models achieve SOTA results on conversational QA/MRC leaderboards. • What BERT learns • BERT rediscovers the classical NLP pipeline in an interpretable way • BERT exploits spurious statistical patterns in datasets instead of learning meaning in the generalizable way that humans do, so • Vulnerable to adversarial attack/tasks (adversarial input perturbation) • Text-QA: Adversarial SQuAD [Jia&Liang 17] • Classification: TextFooler [Jin+20] • Natural language inference: Adversarial NLI [Nie+19] • Towards a robust QA model [Tenney+19; Nie+ 19; Jin+20; Liu+20]
BERT rediscovers the classical NLP pipeline in an interpretable way • Quantify where linguistic info is captured within the network • Lower layers encode more local syntax • higher layers encode more global complex semantics • A higher center-of-gravity value means that the information needed for that task is captured by higher layers [Tenney+19]
Adversarial examples Text-QA Sentiment Classification SQuAD MR IMDB Yelp Original 88.5 86.0 90.9 97.0 Adversarial 54.0 11.5 13.6 6.6 BERT BASE results [Jia&Liang 17; Jin+20; Liu+20]
Build Robust AI models via adversarial training • Standard Training objective • Adversarial Training in computer vision: apply small perturbation to input images that maximize the adversarial loss • Adversarial Training for neural language modeling (ALUM): • Perturb word embeddings instead of words • adopt virtual adversarial training to regularize standard objective [Goodfellow+16; Madry+17; Miyato+18; Liu+20]
Generalization and robustness • Generalization: perform well on unseen data • pre-training • Robustness: withstand adversarial attacks • adversarial training • Can we achieve both? • Past work finds that adversarial training can enhance robustness, but hurts generalization [Raghunathan+19; Min+20] • Apply adversarial pre-training (ALUM) improves both [Liu+20] [Raghunathan+19; Min+20; Liu+20]
Outline • Part 1: Introduction • Part 2: Conversational QA methods • Part 3: Conversational search methods • Part 4: Case study of commercial systems
Conversational Search: Outline • What is conversational search? • A view from TREC Conversational Assistance Track (TREC CAsT) [1] • Unique Challenges in conversational search. • Conversational query understanding [2] • How to make search more conversational ? • From passive retrieval to active conversation with conversation recommendation [3] [1] Cast 2019: The conversational assistance track overview [2] Few-Shot Generative Conversational Query Rewriting 51 51 [3] Leading Conversational Search by Suggesting Useful Questions
Why Conversational Search Ad hoc Search Conversational Search Natural Queries Keyword-ese Queries Necessity: • Speech/Mobile Interfaces Opportunities: • More natural and explicit expression of information needs Challenge: • Query understanding & sparse retrieval 52 52
Why Conversational Search Ad hoc Search Conversational Search Natural Responses Ten Blue-Links Necessity: • Speech/Mobile Interfaces Opportunities: • Direct & Easier access to information Challenge: • Document understanding; combine and synthesize information 53 53
Why Conversational Search Ad hoc Search Conversational Search Multi-Turn Dialog Single-Shot Query Necessity: • N.A. Opportunities: • Serving complex information needs and tasks Challenge: • Contextual Understanding & Memorization 54 54
Why Conversational Search Ad hoc Search Conversational Search Did you mean the comparison between seed investment and crowdfunding? Active Engaging Passive Serving Necessity: • N.A. Opportunities: • Collaborative information seeking & better task assistance Challenge: • Dialog management, less lenient user experience 55 55
A View of Current Conversational Search How does seed investment work? Conversational Queries (R1) Search Documents Documents Documents Response Synthesis System Response 56 56
A View of Current Conversational Search Tell me more about the difference How does seed investment work? Conversational Queries (R2) Conversational Queries (R1) Contextual Search Understanding “Tell me more about the Context Resolved Query difference between seed and Documents early stage funding” Documents Documents Response Synthesis System Response 57 57
A View of Current Conversational Search Tell me more about the difference How does seed investment work? Conversational Queries (R2) Conversational Queries (R1) Contextual Search Understanding Context Resolved Query Documents Documents System Response Documents Are you also interested in learning the different Conversation Response series of investments? Recommendations Recommendation Synthesis Did you mean the System Response Learn to Ask difference between seed Clarifications and early stage? 58 58
A Simpler View from TREC CAsT 2019 • “Conversational Passage Retrieval/QA” Input: Conversational Queries (R2) Conversational Queries (R1) • Manually written conversational queries Contextual • ~20 topics, ~8 turns per topic Search Understanding • Contextually dependent on previous queries Context Resolved Query Corpus: Answer Passage • Contextual MS MARCO + CAR Answer Passages Search Task: Answer Passage • Passage Retrieval for conversational queries Answer Passage Answer Passage Answer Passage Answer Passage 59 59 http://treccast.ai/
TREC CAsT 2019 • An example conversational search session Title: head and neck cancer Input: • Manually written conversational queries Description: A person is trying to compare and contrast • ~20 topics, ~8 turns per topic types of cancer in the throat, esophagus, and lungs. • Contextually dependent on previous queries 1 What is throat cancer? Corpus: • MS MARCO + CAR Answer Passages 2 Is it treatable? Task: 3 Tell me about lung cancer. • Passage Retrieval for conversational queries 4 What are its symptoms? 5 Can it spread to the throat? 6 What causes throat cancer? 7 What is the first sign of it? 8 Is it the same as esophageal cancer? 9 What's the difference in their symptoms? 60 60 http://treccast.ai/
TREC CAsT 2019 • Challenge: contextual dependency on previous conversation queries Title: head and neck cancer Input: • Manually written conversational queries Description: A person is trying to compare and contrast • ~20 topics, ~8 turns per topic types of cancer in the throat, esophagus, and lungs. • Contextually dependent on previous queries 1 What is throat cancer? Corpus: • MS MARCO + CAR Answer Passages 2 Is it treatable? Task: 3 Tell me about lung cancer. • Passage Retrieval for conversational queries 4 What are its symptoms? 5 Can it spread to the throat? 6 What causes throat cancer? 7 What is the first sign of it ? 8 Is it the same as esophageal cancer? 9 What's the difference in their symptoms? 61 61 http://treccast.ai/
TREC CAsT 2019 • Learn to resolve the contextual dependency Title: head and neck cancer Manual Queries provided by CAsT Y1 Description: A person is trying to compare and contrast types of cancer in the throat, esophagus, and lungs. 1 What is throat cancer? 1 What is throat cancer? 2 Is it treatable? 2 Is throat cancer treatable? 3 Tell me about lung cancer. 3 Tell me about lung cancer. 4 What are its symptoms? 4 What are lung cancer’s symptoms? 5 Can it spread to the throat? 5 Can lung cancer spread to the throat 6 What causes throat cancer? 6 What causes throat cancer? 7 What is the first sign of it ? 7 What is the first sign of throat cancer? 8 Is it the same as esophageal cancer? 8 Is throat cancer the same as esophageal cancer? 9 What's the difference in throat cancer and 9 What's the difference in their symptoms? esophageal cancer's symptoms? 62 62 http://treccast.ai/
TREC CAsT 2019: Query Understanding Challenge • Statistics in Y1 Testing Queries Type (#. Turns) Utterance Mention Pronominal (128) How do they celebrate Three Kings Day? they -> Spanish people Zero (111) What cakes are traditional? Null -> Spanish, Three Kings Day Groups (4) Which team came first? which team -> Avengers, Justice League Abbreviations (15) What are the main types of VMs ? VMs -> Virtual Machines 63 63 Cast 2019: The Conversational Assistance Track Overview
TREC CAsT 2019: Result Statics • Challenge from contextual query understanding Notable gaps between auto and manual runs 64 64 Cast 2019: The Conversational Assistance Track Overview
TREC CAsT 2019: Techniques • Techniques used in Query Understanding Usage NDCG Gains 60% 40% 30% 50% Usage Fraction % relative gain 20% 40% 10% 30% 0% 20% -10% 10% -20% 0% -30% Y1 Manual Testing… Entity Linking External Unsupervised Deep Learning Y1 Training Data Coreference MS MARCO Conv Use NLP Toolkit Rules None 65 65 Cast 2019: The Conversational Assistance Track Overview
Conversational Search [2] Lin et al. 2020. Query Reformulation using Query History for Passage Retrieval in [1] Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering • Automatic run results TREC CAsT 2019: Notable Solutions 0.00 0.10 0.20 0.30 0.40 0.50 0.60 SMNgate ECNUICA_BERT mpi-d5_union MPmlp SMNmlp MPgate UMASS_DMN_V1 indri_ql_baseline galago_rel_q galago_rel_1st ECNUICA_MIX mpi_base ECNUICA_ORI coref_cshift RUCIR-run2 UDInfoC_TS_2 RUCIR-run3 ilps-lm-rm3-dt coref_shift_qe RUCIR-run4 UDInfoC_TS mpi-d5_cqw mpi-d5_igraph GPT-2 generative query rewriting [1] mpi-d5_intu ensemble bertrr_rel_q bertrr_rel_1st UDInfoC_BL mpi_bert 66 66 ug_cont_lin ug_1stprev3_sdm clacBaseRerank BM25_BERT_RANKF ilps-bert-feat2 BM25_BERT_FC ug_cedr_rerank clacBase ilps-bert-featq ilps-bert-feat1 pg2bert pgbert h2oloo_RUN2 CFDA_CLIP_RUN1 BERT query expansion [2]
Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolved query Input Output … ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 What is throat cancer? What is the first sign of throat cancer? What is the first sign of it ? 67 67 Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering
Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolve query Input Output … ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 What is throat cancer? What is the first sign of throat cancer? What is the first sign of it ? • Leverage pretrained NLG model (GPT-2) [1] NLG ∗ GPT-2 𝑟 𝑗 … “[GO]” 𝑟 1 𝑟 2 𝑟 𝑗 68 68 Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering
Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolve query Input Output … ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 What is throat cancer? What is the first sign of throat cancer? What is the first sign of it ? CAsT Y1 Data: • Concern: Limited training data • Manually written conversational queries • 50 topics, 10 turns per topic NLG • ∗ 20 topics with TREC relevance labels GPT-2 𝑟 𝑗 ? 100X Millions of Parameters 500 Manual Rewrite Labels … “[GO]” 𝑟 1 𝑟 2 𝑟 𝑗 69 69 Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering
Few-Shot Conversational Query Rewriting • Train conversational query rewriter with the help of ad hoc search data Ad hoc Search Conversational Search • Existing billions of search sessions • Production scenarios still being explored • Lots of high-quality public benchmarks • Relative new topic, fewer available data 70 70 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding Ad hoc Search Conversational Search Conversational Rounds Ad hoc Search Sessions 71 71
Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding Ad hoc Search Conversational Search Conversational Rounds Ad hoc Search Sessions Challenges? • Available only in commercial search engines • Approximate sessions available in MS MARCO • Keyword-ese • Filter by question words 72 72
Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding Ad hoc Search Conversational Search ? Conversational Rounds Ad hoc Search Sessions Challenges? • Available only in commercial search engines • Approximate sessions available in MS MARCO • Keyword-ese • Filter by question words • No explicit context dependency? 73 73
Few-Shot Conversational Query Rewriting: Self-Training • Learn to convert ad hoc sessions to conversational query rounds “Contextualizer”: make ad hoc sessions more conversation-alike Learn to omit information and add contextual dependency GPT-2 Converter … ∗ ′ ∗ ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 “Conversation - alike” Queries Self-contained Queries 74 74 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting: Self-Training • Learn to convert ad hoc sessions to conversational query rounds “Contextualizer”: make ad hoc sessions more conversation-alike Learn to omit information and add contextual dependency GPT-2 Converter … ∗ ′ ∗ ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 “Conversation - alike” Queries Self-contained Queries Training: • X (Self-contained q): Manual rewrites of CAsT Y1 conversational sessions • Y (Conversation-alike q): Raw queries in CAsT Y1 sessions Inference: • X (Self-contained q): Ad hoc questions from MS MARCO sessions • Y (Conversation-alike q): Auto-converted conversational sessions Model: • Any pretrained NLG model: GPT-2 Small in this Case 75 75 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting: Self-Training • Leverage the auto-converted conversation-ad hoc session pairs “Rewriter”: recover the full self-contained queries from conversation rounds Learn from generated training data by the converter GPT-2 Rewriter … ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 “ Conversation-alike ” Queries Self-contained Queries 76 76 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting: Self-Training • Leverage the auto-converted conversation-ad hoc session pairs “Rewriter”: recover the full self-contained queries from conversation rounds Learn from generated training data by the converter GPT-2 Rewriter … ∗ 𝑟 1 𝑟 2 𝑟 𝑗 𝑟 𝑗 “ Conversation-alike ” Queries Self-contained Queries Training: • X (Conversation-alike q): Auto-converted from the Contextualizer • Y (Self-contained q): Raw queries from ad hoc MARCO sessions Inference: • X (Conversation-alike q): CAsT Y1 raw conversational queries • Y (Self-contained q): auto-rewritten queries that are more self-contained Model: • Any pretrained NLG model: another GPT-2 Small in this Case 77 77 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting: Self-Training • The full “self - learning” loop Learn to omit information is easier than recover GPT-2 Converter: Convert ad hoc sessions to conversation-alike sessions • learn from a few conversational queries with manual rewrites Self-Contained Conversation Queries Queries GPT-2 Rewriter: Rewrite conversational queries to self-contained ad hoc queries • learn from the large amount of auto- converted “ad hoc” ↔ “conversation alike” sessions Much more training signals from the Contextualizer 78 78 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Few-Shot Conversational Query Rewriting: Results TREC CAsT Y1 BLEU-2 TREC CAsT Y1 NDCG@3 Better 1 0.6 +7% compared generation → +12% to coreference 0.9 ranking NDCG resolution 0.5 0.8 Y1 Best 0.7 0.4 0.6 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 Raw Query Coreference Raw Query Coreference Self-Learned GPT-2 Oracle Self-Learned GPT-2 Oracle 79 79 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
How Few-shot Can Pretrained NLG Models Be? • Five Sessions are all they need? BLEU-2 on CAsT Y1 NDCG@3 on CAsT Y1 1 0.6 0.5 0.8 0.4 0.6 0.3 0.4 0.2 0.2 0.1 0 0 0 10 20 30 40 0 10 20 30 40 # Training Session # Training Session CV Self-Learn CV Self-Learn 80 80 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
What is learned? • More about learning the task format, nor the semantics • Semantic mostly in the pretrained weights % Rewriting Terms Copied % Starting with Question Words 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% 0 50 100 150 0 50 100 150 # Training Steps # Training Steps CV Self-Learn Oracle CV Self-Learn Oracle 81 81 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Auto-rewritten Examples: Win • Surprisingly good at Long-term dependency and Group Reference 82 82 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
Auto-rewritten Examples: Win • More “fail to rewrite” 83 83 Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020
CAsT Y2: More Realistic Conversational Dependencies • More interactions between queries and system responses Conversational Queries (R2) Conversational Queries (R1) Search Context Resolved Query Dependency on Answer Passage Contextual Previous Results Search Answer Passage Answer Passage Answer Passage Answer Passage Answer Passage Developed by interacting with a BERT-based search engine: 84 84 http://boston.lti.cs.cmu.edu/boston-2-25/
CAsT Y2: More Realistic Conversational Dependencies • More interactions between queries and system responses Conversational Queries (R2) Conversational Queries (R1) Q1: How did snowboarding begin? Search R1: …The development of snowboarding was inspired by skateboarding, surfing and skiing. The first Context Resolved Query Dependency on Answer Passage snowboard, the Snurfer, was invented by Sherman Contextual Previous Results Poppen in 1965. Snowboarding became a Winter Search Answer Passage Olympic Sport in 1998 . Answer Passage Answer Passage Q2: Interesting. That's later than I expected. Who were the winners? Answer Passage Manual rewrites: Answer Passage Who were the winners of snowboarding events in the 1998 Winter Olympics? Auto rewrites without considering response: Who were the winners of the snowboarding contest? Developed by interacting with a BERT-based search engine: 85 85 http://boston.lti.cs.cmu.edu/boston-2-25/
From Passive Information Supplier to Active Assistant Conversational Queries (R2) Conversational Queries (R1) Context Resolved Query Documents Documents Passive Retrieval System Response Documents System Response 86 86
From Passive Information Supplier to Active Assistant Conversational Queries (R2) Conversational Queries (R1) Context Resolved Query Documents Documents Passive Retrieval System Response Documents Conversation Active Assistant Recommendations Recommendation System Response 87 87 Rosset et al. Leading Conversational Search by Suggesting Useful Questions
Making Search Engines More Conversational • Search is moving from "ten blue links" to conversational experiences https://sparktoro.com/blog/less-than-half-of-google-searches-now-result-in-a-click/ 88
Making Search Engines More Conversational • Search is moving from "ten blue links" to conversational experiences Yet most queries are not “conversational” 1. Users are trained to use keywords 2. Less conversational queries 3. Less learning signal 4. Less conversational experience “Chicken and Egg” Problem https://sparktoro.com/blog/less-than-half-of-google-searches-now-result-in-a-click/ 89
Conversation Recommendation: “People Also Ask” • Promoting more conversational experiences in search engines • E.g., for keyword query "Nissan GTR" • Provide the follow questions: What is Nissan GTR? How to buy used Nissan GTR in Pittsburgh? Does Nissan make sports car? Is Nissan Leaf a good car? 90 90
Conversation Recommendation: Challenge • Relevant != Conversation Leading/Task Assistance • User less lenient to active recommendation What is Nissan GTR? [Duplicate] How to buy used Nissan GTR in Pittsburgh? [Too Specific] Does Nissan make sports car? [Prequel] Is Nissan Leaf a good car? [Miss Intent] 91 91
Conversation Recommendation: Beyond Relevance • Recommending useful conversations that • Help user complete their information needs • Assist user with their task • Provide meaningful explorations Relevant Relevant & Useful What is Nissan GTR? How to buy used Nissan GTR in Pittsburgh? Does Nissan make sports car? Is Nissan Leaf a good car? 92 92
Usefulness Metric & Benchmark • Manual annotations on Bing query, conversation recommendation pairs Types of non-useful ones. • Crucial for annotation consistency A higher bar of being useful 93 93 https://github.com/microsoft/LeadingConversationalSearchbySuggestingUsefulQuestions
Conversation Recommendation Model: Multi-Task BERT • BERT seq2seq in the standard multi-task setting X Y Not Conversation Leading BERT Click Bait? [CLS] Query [SEP] PAA Question User Click BERT Just Related? [CLS] Query [SEP] PAA Question Relevance BERT Click Bait #2? [CLS] Query [SEP] PAA Question High/Low CTR 94 94
Conversation Recommendation: Session Trajectory • Problem: the previous 3 signals were prone to learning click-bait • We need more information about how users seek new information • Solution: imitate how users issue queries in sessions Task: classify whether the potential next query was issued 4. Millions of sessions for imitation learning by the user BERT [CLS] Session [SEP] Potential Next Query User Behavior “Federal Tax Return” “Flu Shot Codes 2018” “Facebook” “Flu Shot Billing Codes 2018” Predict last query from “How Much is Flu Shot?” session context 95 95
Conversation Recommendation: Weak Supervision • Learn to lead the conversation from queries user search in the next turn PAA Tasks Y BERT [CLS] Query [SEP] PAA Question User Click BERT [CLS] Query [SEP] PAA Question Relevance BERT [CLS] Query [SEP] PAA Question High/Low CTR Weak Supervision from Sessions User provided contents BERT More exploratory [CLS] Query [SEP] Potential Next Query User Behavior Less Constrained by Bing 96 96
Conversation Recommendation: Session Trajectory • What kinds of sessions to learn from? Randomly Chosen Sessions: Noisy and unfocused People often multi-task in search sessions “Federal Tax Return” "These don't “Flu Shot Codes 2018” belong!" “Facebook” “Flu Shot Billing Codes 2018” “How Much is Flu Shot?” 97 97
Multi-task Learning: Session Trajectory Imitation • What kinds of sessions to learn from? "Conversational" Sessions: Subset of queries that all have some coherent relationship to each other “Federal Tax Return” 0.23 “Flu Shot Codes 2018” Gen-Encoding Similarity “Facebook” 0.89 0.61 “Flu Shot Billing Codes 2018” 0.73 “How Much is Flu Shot?” 98 98 Zhang et al. Generic Intent Representation in Web Search. SIGIR 2019
Multi-task Learning: Session Trajectory Imitation What kinds of sessions to learn from? "Conversational" Sessions: Subset of queries that all have some coherent relationship to each other 1. Treat each session as a graph “Federal Tax Return” 2. Edge weights are "GEN-Encoder 0.23 Similarity" (cosine similarity of “Flu Shot Codes 2018” query intent vector encodings) Gen-Encoding Similarity 3. Remove edges < 0.4 “Facebook” 0.89 4. Keep only the largest "Connected 0.61 Component" of queries “Flu Shot Billing Codes 2018” 0.73 “How Much is Flu Shot?” 99 99 Zhang et al. Generic Intent Representation in Web Search. SIGIR 2019
Method: Inductive Weak Supervision • Learn to lead the conversation from queries user search in the next turn PAA Tasks Y BERT [CLS] Query [SEP] PAA Question User Click BERT [CLS] Query [SEP] PAA Question Relevance BERT [CLS] Query [SEP] PAA Question High/Low CTR Weak Supervision from Sessions BERT [CLS] Query [SEP] “Next Turn Conversation” User Behavior User Next-Turn Interaction 100 100
Recommend
More recommend