Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add msmarco v2 document segmentation script #706

Merged
merged 32 commits into from
Jul 16, 2021
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5a66347
add tasb msmarco dev subset reproduce
May 26, 2021
066c3d5
resolve version comment
May 27, 2021
397280e
manually resolve conflict
Jun 5, 2021
8b74b0a
initialize tct-colbert-v2 doc
Jun 8, 2021
0ef7d58
fix alpha for doct5query fusion
Jun 8, 2021
044eabb
add baseline
justram Jun 8, 2021
437ad41
Merge branch 'master' of github.com:jacklin64/pyserini
justram Jun 8, 2021
9f56a36
add tct_colbert-v2 integration test
justram Jun 8, 2021
2457f6d
add distilbert_tasb integration
Jun 8, 2021
5d13d11
fix typo
justram Jun 8, 2021
0be15ea
add baseline exp
justram Jun 8, 2021
4bd4231
Merge branch 'master' of github.com:jacklin64/pyserini
justram Jun 8, 2021
9f3729f
rearrange
justram Jun 8, 2021
2936de6
rearrange tct-v2 exp order
justram Jun 8, 2021
555c87b
resolve conflict
justram Jun 8, 2021
f6180a0
fix function name
Jun 8, 2021
853cba9
Delete test_distilbert_tasb.py
jacklin64 Jun 8, 2021
cbd1c53
Delete test_tct_colbert-v2.py
jacklin64 Jun 8, 2021
2b24168
clarify the results in the table
Jun 8, 2021
719de11
add tasb and tct-v2 integration
Jun 10, 2021
7737c26
Merge branch 'castorini:master' into master
jacklin64 Jun 10, 2021
557dd2b
Merge branch 'castorini:master' into master
justram Jun 24, 2021
975e4b6
add tct doc encoding
Jun 26, 2021
e75aa36
resolve conflict
justram Jul 2, 2021
b14f182
Merge branch 'castorini-master'
justram Jul 2, 2021
53ee49c
sync to master
Jul 16, 2021
df2659b
Merge pull request #3 from castorini/master
jacklin64 Jul 16, 2021
e3de279
add msmarco v2 document segmentation
Jul 16, 2021
196ee31
Merge branch 'master' of github.com:jacklin64/pyserini
Jul 16, 2021
df0ac90
rename and add boilerplate header.
Jul 16, 2021
5b2cccb
delete redundant file
Jul 16, 2021
ba5a40b
fix description and help
Jul 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
fix description and help
  • Loading branch information
Lin Jack committed Jul 16, 2021
commit ba5a40bf10204883d2ea059b5faa284accdd1fc7
13 changes: 7 additions & 6 deletions scripts/msmarco_v2/segment_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Starting point for writing this script
# https://github.com/castorini/docTTTTTquery/blob/master/convert_msmarco_passages_doc_to_anserini.py
import argparse
import os
import sys
Expand Down Expand Up @@ -66,11 +68,11 @@ def split_document(f_ins, f_out):

if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Concatenate MS MARCO original docs with predicted queries')
description='Segment MS MARCO V2 original docs into passages')
parser.add_argument('--input', required=True, help='MS MARCO V2 corpus path.')
parser.add_argument('--output', required=True, help='Output file path with json format.')
parser.add_argument('--max_length', default=10)
parser.add_argument('--stride', default=5)
parser.add_argument('--output', required=True, help='output file path with json format.')
parser.add_argument('--max_length', default=10, help='maximum sentence length per passage')
parser.add_argument('--stride', default=5, help='the distance between each beginning sentence of passage in a document')
parser.add_argument('--num_workers', default=1, type=int)
args = parser.parse_args()

Expand All @@ -84,7 +86,6 @@ def split_document(f_ins, f_out):
nlp.add_pipe(nlp.create_pipe("sentencizer"))

files = glob.glob(os.path.join(args.original_docs_path, '*.gz'))
# split_document(files, os.path.join(args.output_docs_path, 'doc' + str(0) + '.json'))
num_files = len(files)
pool = Pool(args.num_workers)
num_files_per_worker=num_files//args.num_workers
Expand All @@ -97,7 +98,7 @@ def split_document(f_ins, f_out):

pool.apply_async(split_document ,(file_list, f_out))

pool.close() # close the process pool and no longer accept new processes
pool.close()
pool.join()

print('Done!')
Expand Down