This is a folder that holds scripts for preprocessing data used for TREC Fair 2022.
In 2022, the corpus had 6475537 articles.
Script naming format: snake case
File naming format: {trecfair2021,trecfair2022}.{train,eval}.{file type (ex. qrel)}.{additional format (ex. bm25)}.{file extension}
Download the corpus
mkdir collections/raw
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/corpus/trec_corpus_20220301_html.json.gz -P collections/raw/trecfair2022.html.json.gz
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/corpus/trec_corpus_20220301_plain.json.gz -P collections/raw/trecfair2022.plain.json.gz
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/corpus/trec_corpus_20220301_text.json.gz -P collections/raw/trecfair2022.text.json.gz
gzip -d collections/raw/*.gz
Next, we need to convert the data to indexing format:
python convert_trec_fair_2022_data_to_jsonl.py \
--input collections/raw/trecfair2022.plain.json \
--output collections/plain/trecfair2022.plain.jsonl
Because there are duplicate doc_ids for documents with the same subject but with slightly different articles, titles, we may want to convert a collection without duplicates
python convert_trec_fair_2022_data_to_jsonl.py \
--input collections/raw/trecfair2022.plain.json \
--output collections/plain/trecfair2022.plain_ignore_duplicates.jsonl \
--ignore-duplicates
Alternatively, we can also merge duplicated documents together
python convert_trec_fair_2022_data_to_jsonl.py \
--input collections/raw/trecfair2022.plain.json \
--output collections/plain/trecfair2022.plain_merge_duplicates.jsonl \
--merge-duplicates
Download the topics
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/2022/train_topics_meta.jsonl -O topics-and-qrels/raw/trecfair2022.train.topics_meta.jsonl
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/2022/trec_2022_articles_discrete.json.gz -O topics-and-qrels/raw/trecfair2022.articles_discrete.json.gz
wget https://data.boisestate.edu/library/Ekstrand/TRECFairRanking/2022/trec_2022_train_reldocs.jsonl -O topics-and-qrels/raw/trecfair2022.train_reldocs.jsonl
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/plain-ignore-duplicates \
--index indexes/plain-ignore-duplicates \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/plain-merge-duplicates \
--index indexes/plain-merge-duplicates \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw
Convert raw train topics to queries of the form query_id topic+keywords
python convert_trec_fair_2022_queries_to_tsv.py \
--input topics-and-qrels/raw/trecfair2022.train.topics_meta.jsonl \
--output topics-and-qrels/trecfair2022.train.queries.tsv
Convert raw train topics to queries of the form query_id topic+keywords
python convert_trec_fair_2022_queries_to_tsv.py \
--input topics-and-qrels/raw/trecfair2022.eval.topics.jsonl \
--output topics-and-qrels/trecfair2022.eval.queries.tsv \
--eval
Normal run only needs 500 hits
python -m pyserini.search.lucene \
--index indexes/trec-fair-2022-plain-ignore-duplicates\
--topics topics-and-qrels/trecfair2022.train.queries.tsv \
--output runs/trecfair2022.train.run500.plain_corpus_ignore_duplicates.bm25.txt \
--bm25 \
--hits 500
To perform negative sampling using runs, we need 100,000 hits
python -m pyserini.search.lucene \
--index indexes/trec-fair-2022-plain-ignore-duplicates\
--topics topics-and-qrels/trecfair2022.train.queries.tsv \
--output runs/trecfair2022.train.run100000.plain_corpus_ignore_duplicates.bm25.txt \
--bm25 \
--hits 100000
Normal run only needs 500 hits
python -m pyserini.search.lucene \
--index indexes/trec-fair-2022-plain\
--topics topics-and-qrels/trecfair2022.eval.queries.tsv \
--output runs/trecfair2022.eval.run500.plain_corpus.bm25.txt \
--bm25 \
--hits 500
To perform negative sampling, we need to get the docIDs
python get_trec_fair_2022_doc_ids.py \
--input collections/plain/trecfair2022.plain.jsonl \
--output topics-and-qrels/trecfair2022.docids.txt
Convert train reldocs to qrels of the form query_id 0 doc_id relevant
python convert_trec_fair_2022_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2022.train.topics_meta.jsonl \
--output topics-and-qrels/trecfair2022.train.qrels.txt
Generate qrels with random negative sampling
python convert_trec_fair_2022_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2022.train.topics_meta.jsonl \
--output topics-and-qrels/trecfair2022.train.qrels_w_random_negative_samples.txt \
--random-negative-samples \
--docIDs topics-and-qrels/trecfair2022.docids.txt
Generate qrels with negative sampling from runs
python convert_trec_fair_2022_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2022.train.topics_meta.jsonl \
--output topics-and-qrels/trecfair2022.train.qrels_w_run_negative_samples.txt \
--run-negative-samples \
--docIDs topics-and-qrels/trecfair2022.docids.txt \
--run runs/trecfair2022.train.run100000.plain_corpus_ignore_duplicates.bm25.txt
Quick check with trec_eval tool
Setup a symbolic link to tools/eval/trec_eval.9.0.4/trec_eval
in pyseirni
./trec_eval -c -mrecall.500 -mP.10 -mndcg -mndcg_cut.10 topics-and-qrels/trecfair2022.train.qrels.txt runs/trecfair2022.train.run500.plain_corpus.bm25.txt
The output should be
P_10 all 0.6739
recall_500 all 0.0138
ndcg all 0.0241
ndcg_cut_10 all 0.6827
qrels for TREC Fair 2022 eval has not come out yet.
Creating T5 input from qrels with run negative samples
python create_trec_fair_2022_monot5_input.py \
--corpus collections/plain-ignore-duplicates/trecfair2022.plain_ignore_duplicates.jsonl \
--topics topics-and-qrels/trecfair2022.train.queries.tsv \
--qrel topics-and-qrels/trecfair2022.train.qrels_w_run_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_run_negative_samples.txt \
--output_t5_ids t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_run_negative_samples.ids.txt \
--stride 4 \
--max_length 8
Create T5 input with only the first segment and run negative samples
python create_trec_fair_2022_monot5_input.py \
--corpus collections/plain-ignore-duplicates/trecfair2022.plain_ignore_duplicates.jsonl \
--topics topics-and-qrels/trecfair2022.train.queries.tsv \
--qrel topics-and-qrels/trecfair2022.train.qrels_w_run_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_run_negative_samples.first_segment.txt \
--output_t5_ids t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_run_negative_samples.first_segment.ids.txt \
--stride 4 \
--max_length 8 \
--only-first-segment
Creating T5 input from qrels with random negative samples
python create_trec_fair_2022_monot5_input.py \
--corpus collections/plain-ignore-duplicates/trecfair2022.plain_ignore_duplicates.jsonl \
--topics topics-and-qrels/trecfair2022.train.queries.tsv \
--qrel topics-and-qrels/trecfair2022.train.qrels_w_random_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_random_negative_samples.txt \
--output_t5_ids t5_inputs/trecfair2022.train.t5input.plain_corpus_ignore_duplicates.qrels_w_random_negative_samples.ids.txt \
--stride 4 \
--max_length 8
Create T5 input from BM25 run
python create_trec_fair_2022_monot5_input.py \
--corpus collections/plain/trecfair2022.plain.json \
--topics topics-and-qrels/trecfair2022.eval.queries.tsv \
--run runs/trecfair2022.eval.run500.text_corpus.bm25.txt \
--output_t5_texts t5_inputs/trecfair2022.eval.t5input.text_corpus.bm25.txt \
--output_t5_ids t5_inputs/trecfair2022.eval.t5input.text_corpus.bm25.ids.txt \
--stride 4 \
--max_length 8