NewsQA: A Machine Comprehension Dataset

Introduction

The paper presents NewsQA, a machine comprehension dataset of 119,633 natural language questions obtained from 12,744 CNN articles.
Link to the paper

Article Curation
- Retrieve and sample articles from CNN.
- Partition data into a training set (90%), a development set (5%), and a test set (5%).
Question Sourcing
- Questioners see a news article's headline and its summary and use that to formulate questions about the article.
Answer Sourcing
- Answerers receive the questions along with the full article.
- They either mark the answer in the article or reject the question as nonsensical or select null answer if the article contains insufficient information.
Validation
- Another set of crowd workers sees the full article, a question and the set of unique answers to that question.
- They either choose the best answer among the candidates or reject all the answers.

Answer Types
- Linguistically diverse answer set with following distribution:
  - common noun phrases (22.2%), clause phrase (18.3%), person (14.8%), numeric (9.8%), and other (11.2%) types.
Reasoning Types
- Type of reasoning required, in ascending order of difficulty, along with approx. percentage of questions:
  - Word Matching (32.7%)
  - Paraphrasing (27%)
  - Inference (13.2%)
  - Synthesis (20.7%)
  - Ambiguous/Insufficient (6.4%)

match-LSTM
- LSTM network encodes the article and the question as sequences of hidden states.
- mLSTM network compares the article encodings with the question encodings.
- A Pointer Network uses the hidden states of the mLSTM to select the boundaries of the answer span.
Bilinear Annotation Re-encoding Boundary (BARB) Model
- Encode all the words in the articles and the question using GloVe embeddings and further into contextual states using GRU.
- Compare the document and the question encodings using C bilinear transformations to obtain the tensor of annotation scores.
- Take the maximum over the question-token dimension to obtain annotation over document word dimension.
- For each document word, input the document encodings, annotation vector and binary feature (indicating whether the document appears in the question) to the re-encoding RNN and obtain encodings for the boundary-pointing stage.
- Use convolutional networks to determine the boundaries of answer span (similar to edge detection).
- For further details, refer the paper.

Gap between human and machine performance on NewsQA is much higher than that for SQuAD probably because of longer sentences in NewsQA.
This suggests that NewsQA is a far more challenging dataset than SQuAD and presents a large scope for improvement for machine comprehension tasks.
Questions requiring inference and synthesis are more challenging for the model as compared to other kinds of questions.
Interestingly, BARB outperforms human annotators on SQuAD in terms of answering ambiguous questions or those with incomplete information.