Learning to Automatically Generate Fill-In-The-Blank Quizzes NLPTEA 2018 Edison Marrese-Taylor Ai Nakajima Yutaka Matsuo Yuichi Ono Graduate School of Engineering The University of Tokyo 1
Table of contents 1. Introduction 2. Proposed Approach 3. Empirical Study 4. Conclusions 2
Introduction
background 3
Background Web Based Automatic Quiz Generation • Fill-in-the-blank questions (CQ) 4
Automatic Quiz Generation multiple choice questions (MCQs) • Commonly used for evaluating knowledge and reading comprehension skill Fill-in-the-blanks questions as cloze questions (CQ) • commonly used for evaluating proficiency of language learners Fill-in-the-blanks questions (FIB) • Commonly used for evaluating listening skill • Amount of blanks and words selected as blanks • Easy to automate 5
Related Work • Sumita et al. (2005) proposed a cloze question generation system which focuses corpus. distractor generation and selection using a large-scale language learners’ • Sakaguchi et al. (2013) a discriminative approach based on SVM classifiers for self-assessment. data and textual descriptions of RDF resources. Cricket portal. approach to generate CQs by making use of a knowledge base extracted from a Machine learning techniques for multiple-choice cloze question generation. • Pino et al. (2008). proficiency. on distractor generation using search engines to automatically measure English 6 • Lee and Seneff (2007), Lin et al. (2007) • Goto et al. (2009) • Narendra and Agarwal (2013), present a system which adopts a semi-structured • Lin et al. (2015) a generic semi-automatic system for quiz generation using linked • Kumar et al. (2015) an approach automatic for CQs generation for student
Our work Contributions • We formalize the problem of automatic fill-in-the-blank question generation. • We present an empirical study using deep learning models for fill-in-the-blank question generation in the context of foreign language learning. 7
Proposed Approach
Approach Formalizing the AQG problem be blanked inside S n . • This setting allows us to train from examples of single blank-annotated sentences . In this way, in order to obtain a sentence with several blanks, multiple passes over the model are required. • This approach works in a way analogous to humans, where blanks are provided one at a time. 8 • We consider a training corpus of N pairs ( S n , C n ) , n = 1 . . . N where S n = s 1 , . . . , s L ( S n ) is a sequence of L ( S n ) tokens and C n ∈ [ 1 , L ( S n )] is an index that indicates the position that should
AQG as Sequence Labeling (1) classic sequence labeling scheme, as follows: (2) (1) y i 9 We model the conditional probability of an output label using the one item (the one in position C n ) belongs to the positive class. • Label sequence is created by simply creating a one-hot vector of • Embedded input sequence S n = s 1 , . . . , s L ( n ) size L ( S n ) for the given class C n , i.e. a sequence of binary classes, Y n = y 1 , . . . , y L ( n ) , where only L ( n ) ∏ ˆ p ( Y n | S n ) ∝ i = 1 ˆ y i = H ( y i − 1 , y i , s i )
AQG as Sequence Labeling (2) (4) model based on an LSTM for AQG. Figure 1: Our sequence labeling real labels y t . the mini-batch, between • We model function H • The loss function is the (5) average cross entropy for 10 (3) using a bidirectional LSTM (Hochreiter and Schmidhuber, 1997). The dog is barking h(1) h(2) h(3) h(4) ⃗ h i = LSTM fw ( ⃗ h i − 1 , x i ) ⃗ ⃗ h i = LSTM bw ( h i + 1 , x i ) y i = softmax ([ ⃗ ⃗ ˆ h i ; h i ]) NN BLANK O O O label distribution ˆ y t and
AQG as Sequence Classification (1) The variable-class-size problem • The output of the model is a position in the input sequence S n , on S n . • Regular sequence classification models use a softmax and therefore are not suitable for our case. Proposed Solution • We propose to use an attention-based approach that allows us to have a variable size dictionary for the output softmax, in a way akin to Pointer Networks (Vinyals et al., 2015). 11 so the size of output dictionary for C n is variable and depends distribution over a fixed output dictionary to compute p ( C n | S n )
AQG as Sequence Classification (2) (6) Figure 2: Our sequence classification model, based on an LSTM for AQG. (10) (9) • Embedded input vector sequence (8) (7) 12 techniques such as max or mean . and the softmax normalizes the vector u to be an output distribution • W and v are learnable parameters, h is obtained using pooling ⃗ h i = LSTM fw ( ⃗ h i − 1 , x i ) S n = s 1 , ..., s L ( n ) ⃗ ⃗ ⃗ h i = LSTM bw ( h i + 1 , x i ) h i = [ ⃗ ⃗ h i ; h i ] u = v ⊺ W [ h i ; ¯ h ] over a dictionary of size L ( S n ) . p ( C n | P n ) = softmax ( u ) • ¯ barking The dog is O O O O O O BLANK h(1) h(2) h(3) h(4) A(1) A(2) A(3) A(4) A
Empirical Study
Data and Pre-processing (1) YouTutors • We use our on-line language learning platform, YouTutors (Nakajima and Tomimatsu, 2013; Ono and Nakajima; Ono et al., 2017), to get data. • YouTutors currently uses a rule-based system for AQG, but we would like to develop a more flexible approach. • With this empirical study, we would like to test to what extent our proposed AQG models are able to encode the behavior of the rule-based system. Figure 3: Quiz interface in YouTutors . 13
13
13
Data and Pre-processing (2) Data • Extracted anonymized user interaction data in the manner of real quizzes, obtaining a corpus of approximately 300,000 sentences . • We tokenize using CoreNLP (Manning et al., 2014) to obtain 1.5 million single-quiz question training examples . • We split this dataset using the regular 70/10/20 partition. • We build the vocabulary using the train partition, with a minimum frequency of 1. We do not keep cases and obtain an unknown vocabulary of size 2,029 , and a total vocabulary size of 66,431 tokens. 14
Results on Sequence Labeling Recall Table 1: Results of the sequence labeling approach. 88.80 88.34 88.56 0.0037 Test 88.58 88.81 88.35 0.0037 Valid F1-Score Precision • We use a 2-layer bidirectional LSTM, which we train using Adam Loss Set evaluation. Recall and F1-Score over the positive class for development and positive-class example on each sentence— we use Precision, given the nature of the blanking scheme —there is only one • For evaluation, as accuracy would be extremely unbalanced a drop probability of 0.2. dropout (Srivastava et al., 2014) before and after the LSTM, using word embedding size and hidden state size of 300 and add gradient of our parameters to a maximum norm of 5. We use a 15 Kingma and Ba (2014) with a learning rate of 0 . 001, clipping the
Results on Sequence Classification Loss Table 2: Results of the sequence classification approach. 89.31 102.30 Test 89.17 101.80 Valid Accuracy Set • We use use a 2-layer bidirectional LSTM, which we train using • For evaluation we use accuracy over the validation and test set. so we report results using the last hidden state. noticeable performance difference in preliminary experiments, • Our results for different pooling strategies showed no probability of 0.2 before and after the LSTM. embedding and hidden state of 300, and add dropout with drop our parameters to a maximum norm of 5. We use a word 16 Adam with a learning rate of 0 . 001, also clipping the gradient of
Conclusions
Conclusions • We have formalized the problem of automatic fill-in-the-blanks quiz generation using two well-defined learning schemes: sequence classification and sequence labeling . • We have proposed concrete architectures based on LSTMs to tackle the problem in both cases. • We have presented an empirical study, showing that both proposed training schemes seem to offer fairly good results, with an Accuracy/F1-score of nearly 90%. 17
Future Work Model Improvements • Use pre-trained word embeddings and other features to further improve our results. • Test the power of the models in capturing different quiz styles from real questions created by professors . • Train different models for specific quiz difficulty levels. Platform Improvements It seems possible to transition from a heavily hand-crafted approach for AQG to a learning-based approach on the base of examples derived from the platform on unlabeled data. We are eager to deploy our trained models on our platform and receive feedback. 18 cba
Questions? 18
Recommend
More recommend