Skip to content
This repository has been archived by the owner on Nov 22, 2020. It is now read-only.

Update README for how to use Neg-Ex datasets. #2

Merged
merged 2 commits into from
Aug 1, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
upd readme
  • Loading branch information
hatianzhang committed Jul 31, 2017
commit b9f567d51386f5a7dfa0e9c7e35842ba7c8d1f82
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,17 @@ This repo contains the implementation of extracting high quality training exampl
+ Haotian Zhang, Jinfeng Rao, Jimmy Lin and Mark Smucker. Automatically Extracting High-Quality Negative Examples for Answer Selection in Question Answering. SIGIR 2017.


## Negative Example DataSet
We provided top k={1,3,5,7} negative examples for each (question,answer) pair in the TrecQA train-all set. The different examples sets locate in `NegExSets/` folder and are named as `splitDocNegTopk.tgz`.
# Negative Example DataSet
* We provided top k={1,3,5,7} negative examples for each (question,answer) pair in the TrecQA train-all set. The different examples sets locate in `NegExSets/` folder and are named as `splitDocNegTopk.tgz`.

* After you uncompress each NegExSet by ```tar zxvf splitDocNegTopk.tgz```, you will see our negative examples for each (question,answer) pair. Each negative example is one sentence which is extracted from the same document containing the answer. Each example is named as:

** `ID of answer` + `relevance` + `ID of doc` + `ID of sentence`

* `ID of answer` is the ID of each answer. For train-all set of Trec-QA, there are 56082 (question,answer) pairs in total. The ID of the answers range from 1 to 56082.
* `relevance` is the relevance of each extracted example answer. If the relevance is 1, this example is the answer itself. Otherwise, it is one of the top k negative examples of the answer.
* `ID of doc` is the ID of the document which contains the answer. All the negative examples sentences come from this document.
* `ID of sentence` is the ID of the extracted sentence. The range of ID is decided by the number of sentences in the document. It starts from 0. And 0th sentence means it is the first sentence of the document.

## Prepare TrecQA DataSet
Please download the TrecQA Dataset and refer to: https://github.com/castorini/data/tree/master/TrecQA
Expand Down Expand Up @@ -68,4 +76,4 @@ $ python selectLowestShingleDist.py --input=shingledist.qaans.list
4.Select the sentences in the document with the lowest shingle matching scores matching the question/answer.
```
$ python splitSentence.py shingledist.ans.doc.pair.top1.list splitDoc
```
```