forked from castorini/anserini
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support Chinese indexing and search (castorini#804)
+ Add CJKAnalyzer in the Indexing class. + Add language argument in indexing argument. + Add setLanguage method and CJKAnalyzer in SimpleSearch class. + Add experiments on NTCIR-8 ZH dataset.
- Loading branch information
1 parent
70350fa
commit b771bb9
Showing
20 changed files
with
110,742 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Cross-lingual Information Retrieval Experiments | ||
|
||
This page contains instructions for running BM25 baselines on the NTCIR 8 *IR4QA* task. | ||
|
||
## Data Prep | ||
|
||
First, we need to convert the corpus into jsonline file format. | ||
|
||
``` | ||
python src/main/python/clir/convert_collection_to_jsonl.py \ | ||
--language zh \ | ||
--corpus_directory /directory/to/ntcir-collection/ \ | ||
--output_path /path/to/dump | ||
``` | ||
## Document Ranking with BM25 | ||
|
||
Run the command | ||
|
||
``` | ||
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \ | ||
-generator LuceneDocumentGenerator -threads 1 \ | ||
-input /directory/to/dump \ | ||
-index /directory/to/index/lucene-index.clir_zh.pos+docvectors+rawdocs -storePositions -storeDocvectors \ | ||
-storeRawDocs -language zh >& log.clir_zh.pos+docvectors+rawdocs & | ||
``` | ||
|
||
to index the documents. | ||
|
||
## Retrieval | ||
|
||
To do the document retrieval, run | ||
|
||
``` | ||
nohup target/appassembler/bin/SearchCollection -topicreader TsvStringKey \ | ||
-index lucene-index.clir_zh.pos+docvectors+rawdocs/ \ | ||
-topics src/main/resources/topics-and-qrels/topics.ntcir8zh.eval.txt \ | ||
-output run.clir-zh.bm25-default.zh.topics.txt -bm25 -language zh & | ||
``` | ||
|
||
## Evaluation | ||
|
||
To evalutate, run | ||
|
||
``` | ||
eval/trec_eval.9.0.4/trec_eval -m map \ | ||
src/main/resources/topics-and-qrels/qrels.ntcir8.eval.txt \ | ||
run.clir-zh.bm25-default.zh.topics.txt | ||
``` | ||
|
||
| Collection | MAP | | ||
|:----------:|:-----:| | ||
| NTCIR-8 ZH | 0.3568| |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
61 changes: 61 additions & 0 deletions
61
src/main/java/io/anserini/search/topicreader/TsvStringTopicReader.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
/** | ||
* Anserini: A Lucene toolkit for replicable information retrieval research | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.anserini.search.topicreader; | ||
|
||
import java.io.BufferedReader; | ||
import java.io.IOException; | ||
import java.nio.file.Path; | ||
import java.util.HashMap; | ||
import java.util.Map; | ||
import java.util.SortedMap; | ||
import java.util.TreeMap; | ||
|
||
/** | ||
* Topic reader for queries in tsv format, such as the MS MARCO queries. | ||
* | ||
* <pre> | ||
* 174249 does xpress bet charge to deposit money in your account | ||
* 320792 how much is a cost to run disneyland | ||
* 1090270 botulinum definition | ||
* 1101279 do physicians pay for insurance from their salaries? | ||
* 201376 here there be dragons comic | ||
* 54544 blood diseases that are sexually transmitted | ||
* ... | ||
* </pre> | ||
*/ | ||
public class TsvStringTopicReader extends TopicReader<String> { | ||
public TsvStringTopicReader(Path topicFile) { | ||
super(topicFile); | ||
} | ||
|
||
@Override | ||
public SortedMap<String, Map<String, String>> read(BufferedReader reader) throws IOException { | ||
SortedMap<String, Map<String, String>> map = new TreeMap<>(); | ||
|
||
String line; | ||
while ((line = reader.readLine()) != null) { | ||
line = line.trim(); | ||
String[] arr = line.split("\\t"); | ||
|
||
Map<String,String> fields = new HashMap<>(); | ||
fields.put("title", arr[1].trim()); | ||
map.put(arr[0], fields); | ||
} | ||
|
||
return map; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Anserini: A Lucene toolkit for replicable information retrieval research | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
""" | ||
|
||
""" | ||
This script is used for converting the cross-lingual IR corpus | ||
into json format, which can be easily indexed by Anserini. | ||
The jsonline format of Anserini is as follows: | ||
{"id": "doc1", "contents": "string1"} | ||
Currently the data we have: | ||
- ZH: gigaword-xin.2002-06.zh-cleaned.xml | ||
""" | ||
|
||
import argparse | ||
import json | ||
import os | ||
|
||
ZH_CORPUS_NAME = "gigaword-xin.2002-06.zh-cleaned.xml" | ||
|
||
|
||
def zh2json(file_path, output_path): | ||
""" | ||
Processing rules: | ||
1. If two lines are successive, then concatenate them without space | ||
2. If two lines are separated with two lines, then separate them with period 。. | ||
This rules do not matter for passage level indexing, but if when we do the | ||
sentence level indexing, it will affect the performance. | ||
:param file_path: | ||
:return: | ||
""" | ||
fout = open(output_path, 'w') | ||
counter = 0 | ||
with open(os.path.join(file_path, ZH_CORPUS_NAME)) as fin: | ||
while True: | ||
line = fin.readline() | ||
if line.startswith("<DOC>"): | ||
# We assume the nextline of "<DOC>" label line is | ||
# "<DOCNO>" line. | ||
example = {} | ||
line = fin.readline() | ||
if line.startswith("<DOCNO>"): | ||
line = line.replace("<DOCNO>", "").replace("</DOCNO>", "").strip() | ||
example["id"] = line | ||
else: | ||
print("The line is {}, but we assume it is <DOCNO> line".format(line)) | ||
exit() | ||
# Read contents | ||
example["contents"] = [] | ||
line = fin.readline() | ||
while (not line.startswith("</DOC>")): | ||
line = line.strip() | ||
if len(line) == 0: | ||
example["contents"].append("。") | ||
else: | ||
example["contents"].append(line) | ||
line = fin.readline() | ||
example["contents"] = "".join(example["contents"]) | ||
fout.write(json.dumps(example) + "\n") | ||
counter += 1 | ||
if counter % 10000 == 0: | ||
print("Dump {} examples".format(counter)) | ||
elif not line: | ||
break | ||
print("Done") | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--language", type=str, choices=["zh"]) | ||
parser.add_argument("--corpus_directory", type=str) | ||
parser.add_argument("--output_path", type=str) | ||
args = parser.parse_args() | ||
|
||
dir = os.path.dirname(args.output_path) | ||
if not os.path.exists(dir): | ||
os.makedirs(dir) | ||
|
||
if args.language == "zh": | ||
zh2json(args.corpus_directory, args.output_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Anserini: Regressions for [NTCIR-8 Simple Chinese](http://research.nii.ac.jp/ntcir/ntcir-ws8/ws-en.html) | ||
|
||
This page documents regression experiments for the NTCIR Information Retrieval for Question Answering Task, which is integrated into | ||
Anserini's regression testing framework. | ||
For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-ntcir8-zh.md). | ||
|
||
## Indexing | ||
|
||
Typical indexing command: | ||
|
||
``` | ||
${index_cmds} | ||
``` | ||
|
||
The directory `/path/to/ntcir-8/` should be a directory containing the official document collection (a single file), in Json format. | ||
[This page](experiments-ntcir8-zh.md) explains how to perform this conversion. | ||
|
||
For additional details, see explanation of [common indexing options](common-indexing-options.md). | ||
|
||
## Retrieval | ||
|
||
Topics and qrels are stored in `src/main/resources/topics-and-qrels/`. | ||
The regression experiments here evaluate on the 73 questions. | ||
|
||
After indexing has completed, you should be able to perform retrieval as follows: | ||
|
||
``` | ||
${ranking_cmds} | ||
``` | ||
|
||
Evaluation can be performed using `trec_eval`: | ||
|
||
``` | ||
${eval_cmds} | ||
``` | ||
|
||
## Effectiveness | ||
|
||
With the above commands, you should be able to replicate the following results: | ||
|
||
${effectiveness} | ||
|
||
The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`. | ||
See [this page](experiments-ntcir8-zh.md) for more details. | ||
Note that here we are using `trec_eval` to evaluate the top 1000 hits for each query. |
Oops, something went wrong.