【社外秘】 YJTI at the NTCIR-13 STC Japanese Subtask Dec. 7, 2017 Toru Shimizu 1
【公開】 Overview 2
【公開】 Retrieval or Generation • Retrieval-based system – Effective if you have a good matching model and enough candidate responses – Pros • Human-written, fluent sentences for responses • The conversation can sometimes actually be interesting. – Hence more practical – Cons • Lack of flexibility – This can be mitigated with large amount of candidates and the variety in them. – 1.2M unique sentences in the training data 3
【公開】 Architecture • DSSM (Deep Structured Semantic Model) – Huang et al., 2013 – A method for IR, query-document matching query z Q Q encoder document z D D encoder • LSTM-DSSM – Palangi et al., 2014 – LSTM-RNN for generating query and document representations 4
【公開】 The Overall Process: Three Stages ・ Train two models: Model Training - a comment encoder - a reply encoder ・ Preprocess the training data to obtain candidate replies Reply Text Preparation ・ Generate vector representations and Indexing of the replies ・ Build the reply index ・ Produce actual reply lists using Runtime the runtie system 5
【公開】 Submissions • Two runs: – YJTI-J-R1 • Trained by Twitter conversation data – YJTI-J-R2 • Trained mainly by Yahoo! Chiebukuro QA data • The runtime system is the same. • Only the models are different. 6
【公開】 Runtime System 7
【公開】 Runtime System Overview Model training Runtime stage stage query ・ data comment (comment) encoder model ・ component reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 8
【公開】 Runtime System Overview: Software Components Model training Runtime stage stage query ・ data comment (comment) encoder model ・ component reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 9
【公開】 Runtime System Overview: Data Model training Runtime stage stage query ・ data comment (comment) encoder model ・ component reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 10
【公開】 Runtime System Overview: The 1 st Stage Model training Runtime stage stage query comment (comment) encoder model reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 11
【公開】 Runtime System Overview: The 2 nd Stage Model training Runtime stage stage query comment (comment) encoder model reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 12
【公開】 Runtime System Overview: The 3 rd Stage Model training Runtime stage stage query comment (comment) encoder model reply comment encoder model encoder comment Reply text vector preparation and top-200 indexing stage replies retriever candidate ranker reply replies vectors top-10 reply ranked replies encoder 13
【公開】 Indexer and Retriever • Generate 1024-element representations of reply candidates by the reply encoder model • NGT – Open source software for graph-based approximate similarity search over dense vectors • Developed by M. Iwasaki • https://research-lab.yahoo.co.jp/software/ngt/ • Retrieve the nearest 200 reply vectors from a given comment vectors – L2-distance, cosine similarity • Return the list of their texts and metadata 14
【公開】 Ranker • Three tiers for dealing with metadata matching: THEME, GENRE, and OTHER The final top-10 replies ・ reply 1 THEME The Theme is matched btw. ・ reply 2 the comment and a reply. ・ reply 3 (At most 3) ・ reply 4 The Genre is matched btw. GENRE ・ reply 5 the comment and a reply ・ reply 6 (At most 3) ・ reply 7 ・ reply 8 OTHER No metadata match. ・ reply 9 (No limitation of number) ・ reply 10 15
【公開】 Model, Data, and Training 16
【公開】 Comment/Reply Encoder Model • 3-layer LSTM RNN – Formulation: Graves, 2013 – LSTM's hidden layer size: 1024 (for all the – Embedding layer size: 256 – Representation size: 1024 output layer LSTM-RNN 3 z LSTM-RNN 2 LSTM-RNN 1 embedding layer <s> た だ い ま </s> 17
【公開】 Comment/Reply Encoder Model • Training comment encoder z Q model Q : ただいま reply encoder z D model D : おかえり – Consider this as a classification problem and maximize the probability for the right choice over a given dataset 18
【公開】 Comment/Reply Encoder Model • Training cont'd run model data name records type comsumed YJTI-J-R1 DSSM Twitter conversation 135.0M YJTI-J-R2 LM Y! Chiebukuro LM 171.5M DSSM Twitter conversation 85.8M DSSM Y! Chiebukuro QA 42.9M 19
【公開】 Data for Model Training name type no. of records Twitter LM posts 100.0M Twitter conversation pairs 65.1M Y! Chiebukuro LM posts 202.0M Y! Chiebukuro QA pairs 66.3M 20
【公開】 Results 21
【公開】 Analysis and Results • Performances measured by the validation data 22
【公開】 Analysis and Results • The official results under Rule-2 23
【公開】 Conclusions • Effectiveness of the overall approach: – Retrieval-based system – DSSM-like matching powered by LSTM-RNNs trained over a large amount of linguistic resources • Social QA data was surprisingly useful for modeling topic-oriented conversations seen in this Yahoo! News comments data 24
Recommend
More recommend