Open
Description
HI,
I'm trying to create a dense representations from my corpus and search paragraphs/phrases by keywords or a question. I don't have labeled Questions and Answers and I don't need for now to get answers, just retrieve documents possibly containing the answer.
I build a JSON with my corpus (pt-br) like this:
{
"data": [
{
"title": "Radicais livres: o que são, efeitos no corpo e como se proteger",
"paragraphs": [
{
"context": "Os radicais livres ...""
},
{
"context": "Desta forma, quanto menos radicais livres, ..."
}, ...
then I ran the following commands:
python generate_phrase_vecs.py \
--pretrained_name_or_path SpanBERT/spanbert-base-cased \
--data_dir ./data \
--cache_dir ./cache \
--test_file ../tua-saude/all_data.json \
--do_dump \
--max_seq_length 512 \
--fp16 \
--filter_threshold -2.0 \
--append_title \
--output_dir ./data/densephrases-multi_sample \
--load_dir princeton-nlp/densephrases-multi
python build_phrase_index.py \
--dump_dir ./data/densephrases-multi_sample/dump \
--stage all \
--replace \
--num_clusters 128 \
--fine_quant OPQ96 \
--doc_sample_ratio 0.3 \
--vec_sample_ratio 0.3 \
--cuda
python scripts/preprocess/compress_metadata.py \
--input_dump_dir ./data/densephrases-multi_sample/dump/phrase \
--output_dir ./data/densephrases-multi_sample/dump
Those commads looks like working fine. Here the contents of output_dir
Now, when I try to use the model:
model = DensePhrases(
load_dir='princeton-nlp/densephrases-multi',
dump_dir='./data/densephrases-multi_sample/dump/',
index_name='start/128_flat_OPQ96'
)
This error raises:
>>>
This could take up to 15 mins depending on the file reading speed of HDD/SSD
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/projetos/u4vn/DensePhrases/densephrases/model.py", line 52, in __init__
self.truecase = TrueCaser(os.path.join(os.environ['DATA_DIR'], self.args.truecase_path))
File "/projetos/u4vn/DensePhrases/densephrases/utils/data_utils.py", line 366, in __init__
with open(dist_file_path, "rb") as distributions_file:
FileNotFoundError: [Errno 2] No such file or directory: './data/truecase/english_with_questions.dist'
What am I missing? What file is this?
Metadata
Assignees
Labels
No labels