- The paper presents NewsQA, a machine comprehension dataset of 119,633 natural language questions obtained from 12,744 CNN articles.
- Link to the paper
- Too small - eg MCTest
- USe synthetically generated questions - eg BookTest Dataset
- SQuAD is similar to NewsQA but is not as challenging and diverse as NewsQA.
- Answers of arbitrary length instead of candidate answers to choose from.
- Some questions should have no correct answer in the document.
- Lexical and syntactic divergence between questions and answers.
- Questions should require reasoning beyond simple word and context matching.
-
Article Curation
- Retrieve and sample articles from CNN.
- Partition data into a training set (90%), a development set (5%), and a test set (5%).
-
Question Sourcing
- Questioners see a news article's headline and its summary and use that to formulate questions about the article.
-
Answer Sourcing
- Answerers receive the questions along with the full article.
- They either mark the answer in the article or reject the question as nonsensical or select null answer if the article contains insufficient information.
-
Validation
- Another set of crowd workers sees the full article, a question and the set of unique answers to that question.
- They either choose the best answer among the candidates or reject all the answers.
-
Answer Types
- Linguistically diverse answer set with following distribution:
- common noun phrases (22.2%), clause phrase (18.3%), person (14.8%), numeric (9.8%), and other (11.2%) types.
- Linguistically diverse answer set with following distribution:
-
Reasoning Types
- Type of reasoning required, in ascending order of difficulty, along with approx. percentage of questions:
- Word Matching (32.7%)
- Paraphrasing (27%)
- Inference (13.2%)
- Synthesis (20.7%)
- Ambiguous/Insufficient (6.4%)
- Type of reasoning required, in ascending order of difficulty, along with approx. percentage of questions:
-
match-LSTM
- LSTM network encodes the article and the question as sequences of hidden states.
- mLSTM network compares the article encodings with the question encodings.
- A Pointer Network uses the hidden states of the mLSTM to select the boundaries of the answer span.
-
Bilinear Annotation Re-encoding Boundary (BARB) Model
- Encode all the words in the articles and the question using GloVe embeddings and further into contextual states using GRU.
- Compare the document and the question encodings using C bilinear transformations to obtain the tensor of annotation scores.
- Take the maximum over the question-token dimension to obtain annotation over document word dimension.
- For each document word, input the document encodings, annotation vector and binary feature (indicating whether the document appears in the question) to the re-encoding RNN and obtain encodings for the boundary-pointing stage.
- Use convolutional networks to determine the boundaries of answer span (similar to edge detection).
- For further details, refer the paper.
- Gap between human and machine performance on NewsQA is much higher than that for SQuAD probably because of longer sentences in NewsQA.
- This suggests that NewsQA is a far more challenging dataset than SQuAD and presents a large scope for improvement for machine comprehension tasks.
- Questions requiring inference and synthesis are more challenging for the model as compared to other kinds of questions.
- Interestingly, BARB outperforms human annotators on SQuAD in terms of answering ambiguous questions or those with incomplete information.