Implements IDF baselines for QA datasets.
Git clone castorini/data to get TrecQA and WikiQA datasets.
Follow instructions in TrecQA/README.txt
and WikiQA/README.txt
to process the data into a standard format.
After running the respectve scripts, you should have the following directories structure in castorini/data/TrecQA
├── raw-dev
├── raw-test
├── train
└── train-all
and, the following directories in castorini/data/WikiQA
.
├── dev
├── test
├── train
Each directory will have the following files:
├── a.toks
: question[i]
├── b.toks
: answer[i]
├── id.txt
: question_id[i]
└── sim.txt
: label[i]
where 1 <= i <= (number of QA pairs in respective splits of the data)
We need to index the source corpus from which the question-answer pairs are derived in order to get the IDF weights of the terms.
1. Clone and compileAnserini
git clone https://github.com/castorini/Anserini.git
cd Anserini
mvn clean package appassembler:assemble
First, download the Wikipedia dump by running the following command:
mkdir WikiQACollection
for line in $(cat idf_baseline/src/main/resources/WikiQA/wikidump-list.txt); do wget $line -P WikiQACollection; done
To index the collection:
cd Anserini
nohup sh target/appassembler/bin/IndexCollection -collection WikipediaCollection -input ../WikiQACollection
-generator JsoupGenerator -index lucene.index.wikipedia.pos.docvectors -threads 32 -storePositions
-storeDocvectors -optimize > log.wikipedia.pos.docvectors &
Create a new directories called TrecQACollection
mkdir TrecQACollection
Copy the contents of disk1, disk2, disk3, disk4, and AQUAINT to TrecQACollection
To index the collection:
cd Anserini
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -input [path of TrecQACollection]
-generator JsoupGenerator -index lucene.index.trecQA.pos.docvectors -threads 32 -storePositions
-storeDocvectors -optimize > log.trecQA.pos.docvectors &
Build the IDF scorer
cd castorini/Castor/idf_baseline
mvn clean package appassembler:assemble
Run the following command to score each answer with an IDF value:
sh target/appassembler/bin/GetIDFSumSimilarity -index ~/large-local-work/indices/index.wikipedia.pos.docvectors -config ../../data/WikiQA/test -output WikiQA.test.idfsim
The above command will create a run file in the trec_eval
format and a qrel file
at a location specified by -output
.
Possible parameters are:
-index (required)
Path of the index
-config (required)
Configuration of this experiment i.e., dev, train, train-all, test etc.
-output (required)
Path of the run file to be created
-analyze
If specified, the scorer uses EnglishAnalyzer
for removing stopwords and performing stemming. In addition to
the default list, the analyzer uses NLTK's stopword list obtained
fromhere
To calculate MAP/MRR for the above run file:
- Download and install
trec_eval
fromhere
eval/trec_eval.9.0/trec_eval -m map -m recip_rank <qrel-file> <run-file>
For the WikiQA dataset
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/WikiQA/WikiQACorpus/WikiQA-$set.ref WikiQA.$set.idfsim
For the TrecQA dataset
../../Anserini/eval/trec_eval.9.0/trec_eval -m map ../../data/TrecQA/$set.qrel TrecQA.$set.idfsim
python qa-data-idf-only.py ../../data/TrecQA TrecQA
python qa-data-only-idf.py ../../data/WikiQA WikiQA
Evaluate these using step 2.
The same script can now also be used to comput idf sum similarity based on corpus idf statistics
python qa-data-only-idf.py ../../data/TrecQA TrecQA --index-for-corpusIDF ../../data/indices/index.qadata.pos.docvectors.keepstopwords/
Baseline results are saved in Castor/baseline_results.tsv