This is a folder that holds scripts for preprocessing data used for TREC Fair 2021.
In 2021, the corpus had 6280328 articles.
The fairness attributes are
- geographical location
- Africa
- Antarctica
- Asia
- Europe
- Latin America and the Caribbean
- Northern America
- Oceania
Task 1 expects, per query, 1 ranking with 1000 articles each.
The output format is {query_id}\t{doc_id}
.
Task 2 expects, per query, 100 rankings with 50 articles each.
The output format is {query_id}\t{repeat_number}\t{doc_id}
.
Script naming format: snake case
File naming format: {trecfair2021,trecfair2022}.{train,eval}.{file type (ex. qrel)}.{additional format (ex. bm25)}.{file extension}
mkdir collections/raw
wget https://data.boisestate.edu/library/Ekstrand-2021/TRECFairRanking2021/trec_topics.json.gz -O topics-and-qrels/raw/trecfair2021.train.topics.json.gz
wget https://data.boisestate.edu/library/Ekstrand-2021/TRECFairRanking2021/eval-topics.json.gz -O tools/topics-and-qrels/raw/trecfair2021.eval.topics.json.gz
wget https://trec.nist.gov/data/fair/2021-eval-topics-with-qrels.json.gz -O topics-and-qrels/raw/trecfair2021.eval.reldocs.json.gz
gzip -d tools/topics-and-qrels/raw/trecfair2021*.json.gz
Convert raw train topics to queries of the form query_id topic+keywords
python convert_trec_fair_2021_queries_to_tsv.py \
--input topics-and-qrels/raw/trecfair2021.train.topics.json \
--output topics-and-qrels/trecfair2021.train.queries.tsv
Convert raw eval topics to queries of the form queri_id topic+keywords
python convert_trec_fair_2021_queries_to_tsv.py \
--input topics-and-qrels/raw/trecfair2021.eval.topics.json \
--output topics-and-qrels/trecfair2021.eval.queries.tsv
Normal run only needs 1000 hits
python -m pyserini.search.lucene \
--index indexes/trec-fair-2021-text \
--topics topics-and-qrels/trecfair2021.train.queries.tsv \
--output runs/trecfair2021.train.run1000.text_corpus.bm25.txt \
--bm25 \
--hits 1000
To perform negative sampling using runs, we need 100,000 hits
python -m pyserini.search.lucene \
--index indexes/trec-fair-2021-text \
--topics topics-and-qrels/trecfair2021.train.queries.tsv \
--output runs/trecfair2021.train.run100000.text_corpus.bm25.txt \
--bm25 \
--hits 100000
python -m pyserini.search.lucene \
--index indexes/trec-fair-2021-text \
--topics topics-and-qrels/trecfair2021.eval.queries.tsv \
--output runs/trecfair2021.eval.run1000.text_corpus.bm25.txt \
--bm25 \
--hits 1000
To perform negative sampling, we need to get the docIDs
python get_trec_fair_2021_doc_ids.py \
--input collections/text/trecfair2021.text.jsonl \
--output topics-and-qrels/trecfair2021.docids.txt
Convert train reldocs to qrels of the form query_id 0 doc_id relevant
python convert_trec_fair_2021_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2021.train.topics.json \
--output topics-and-qrels/trecfair2021.train.qrels.txt
Generate qrels with random negative sampling
python convert_trec_fair_2021_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2021.train.topics.json \
--output topics-and-qrels/trecfair2021.train.qrels_w_random_negative_samples.txt \
--random-negative-samples \
--docIDs topics-and-qrels/trecfair2021.docids.txt
Generate qrels with negative sampling from runs
python convert_trec_fair_2021_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2021.train.topics.json \
--output topics-and-qrels/trecfair2021.train.qrels_w_run_negative_samples.txt \
--run-negative-samples \
--docIDs topics-and-qrels/trecfair2021.docids.txt \
--run runs/trecfair2021.train.run100000.text_corpus.bm25.txt
Quick check with trec_eval
tool
Setup a symbolic link to tools/eval/trec_eval.9.0.4/trec_eval
in pyseirni
./trec_eval -c -mrecall.1000 -mP.10 -mndcg -mndcg_cut.10 topics-and-qrels/trecfair2021.train.qrels.txt runs/trecfair2021.train.run1000.text_corpus.bm25.txt
The output should be
P_10 all 0.6316
recall_1000 all 0.0538
ndcg all 0.0740
ndcg_cut_10 all 0.6245
Convert eval reldocs to qrels of the form query_id 0 doc_id 1
python convert_trec_fair_2021_reldocs_to_qrels.py \
--input topics-and-qrels/raw/trecfair2021.eval.reldocs.json \
--output topics-and-qrels/trecfair2021.eval.qrels.txt
Quick with trec_eval
./trec_eval -c -mrecall.1000 -mP.10 -mndcg -mndcg_cut.10 topics-and-qrels/trecfair2021.eval.qrels.txt runs/trecfair2021.eval.run1000.text_corpus.bm25.txt
The output should be
P_10 all 0.7714
recall_1000 all 0.5920
ndcg all 0.5545
ndcg_cut_10 all 0.7932
Creating T5 input from qrels with run negative samples
python create_trec_fair_2021_monot5_input.py \
--corpus collections/text/trecfair2021.text.jsonl \
--topics topics-and-qrels/trecfair2021.train.queries.tsv \
--qrel topics-and-qrels/trecfair2021.train.qrels_w_run_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_run_negative_samples.txt \
--output_t5_ids t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_run_negative_samples.ids.txt \
--stride 4 \
--max_length 8
Create T5 input with only the first segment
python create_trec_fair_2021_monot5_input.py \
--corpus collections/text/trecfair2021.text.jsonl \
--topics topics-and-qrels/trecfair2021.train.queries.tsv \
--qrel topics-and-qrels/trecfair2021.train.qrels_w_run_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_run_negative_samples.first_segment.txt \
--output_t5_ids t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_run_negative_samples.first_segment.ids.txt \
--stride 4 \
--max_length 8 \
--only-first-segment
Create T5 input from qrels with random negative samples
python create_trec_fair_2021_monot5_input.py \
--corpus collections/text/trecfair2021.text.jsonl \
--topics topics-and-qrels/trecfair2021.train.queries.tsv \
--qrel topics-and-qrels/trecfair2021.train.qrels_w_random_negative_samples.txt \
--output_t5_texts t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_random_negative_samples.txt \
--output_t5_ids t5_inputs/trecfair2021.train.t5input.text_corpus.bm25.qrels_w_random_negative_samples.ids.txt \
--stride 4 \
--max_length 8
Create T5 input from BM25 run
python create_trec_fair_2021_monot5_input.py \
--corpus collections/text/trecfair2021.text.json \
--topics topics-and-qrels/trecfair2021.eval.queries.tsv \
--run runs/trecfair2021.eval.run1000.text_corpus.bm25.txt \
--output_t5_texts t5_inputs/trecfair2021.eval.t5input.text_corpus.bm25.txt \
--output_t5_ids t5_inputs/trecfair2021.eval.t5input.text_corpus.bm25.ids.txt \
--stride 4 \
--max_length 8
We will be using trec fair 2021 evaluator to judge our runs.
It expects runs to be of the form query_id\tdoc_id
We can convert Anserini runs of the form query_id Q0 doc_id _ _ Anserini
to query_id\tdoc_id
python convert_anserini_runs_for_official_eval.py \
--input runs/trecfair2021.eval.run1000.text_corpus.bm25.txt \
--output runs/ \
--task 1
Then, we can get the official evaluation stored in results/
conda activate wptrec
bash trec_fair_2021_run_eval.sh ../../trec2021-fair-public/ runs/trecfair2021.eval.run1000.text_corpus.bm25.eval_format.txt
To analyze the results
python analyze_results.py \
--files results/trecfair2021.eval.run1000.text_corpus.bm25.eval_format.txt.tsv \
--output-file results/task1_summary.tsv \
--task 1
The output should be
Filename: trecfair2021
Mean nDCG: 0.1897162787844332
Mean AWRF: 0.6394697380199352
Mean Score: 0.12017384821260774
python rerank_trec_fair_2021_docs.py \
--corpus collections/Text/trecfair2021.text.jsonl \
--run runs/trecfair2021.eval.run1000.text_corpus.bm25.eval_format.txt \
--output runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt1.eval_format.txt \
--option 1
python rerank_trec_fair_2021_docs.py \
--corpus collections/Text/trecfair2021.text.jsonl \
--run runs/trecfair2021.eval.run1000.text_corpus.bm25.eval_format.txt \
--output runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt2.eval_format.txt \
--option 2
python rerank_trec_fair_2021_docs.py \
--corpus collections/Text/trecfair2021.text.jsonl \
--run runs/trecfair2021.eval.run1000.text_corpus.bm25.eval_format.txt \
--output runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt3.eval_format.txt \
--option 3
Then, we can evaluate these reranked runs/
conda activate wptrec
bash trec_fair_2021_run_eval.sh ../../trec2021-fair-public/ runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt1.eval_format.txt
python analyze_results.py \
--files results/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt1.eval_format.txt.tsv \
--output-file results/task1_summary.tsv \
--task 1
The mean AWRF score should be around 0.7
+/- 0.03
.
conda activate wptrec
bash trec_fair_2021_run_eval.sh ../../trec2021-fair-public/ runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt2.eval_format.txt
python analyze_results.py \
--files results/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt2.eval_format.txt.tsv \
--output-file results/task1_summary.tsv \
--task 1
The mean AWRF score should be around 0.7
+/- 0.03
.
conda activate wptrec
bash trec_fair_2021_run_eval.sh ../../trec2021-fair-public/ runs/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt3.eval_format.txt
python analyze_results.py \
--files results/trecfair2021.eval.run1000.text_corpus.bm25.reranked_opt3.eval_format.txt.tsv \
--output-file results/task1_summary.tsv \
--task 1
The mean AWRF score should be around 0.7
+/- 0.03
and the mean score should be around 0.12
+/- 0.02
.