-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Format #9
Comments
How about the index of the "DOCSTART" ? 0? |
"DOCSTART" in my data sets is placed in a separated sentence, like |
I get it, thanks for your reply ! |
Thanks for your explanation on the data format, but I am still confused about the word embedding format or standard you used, can you give me some details on this? |
The detailed information about word embedding is introduced in Ma's paper(Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J]. 2016.). It writes in the paper that Standford's Glove 100 dimensional embedding achieve best result. |
Thanks a lot, I had checked that out after several minutes of sending the comment. Sorry for bother, and thanks for your reply, again! |
I am still not clear about the format. Is index of each word per sentence or gets incremented for all words? |
Also i am getting error $ bash ./examples/run_ner_crf.sh Is is possible for you to share your data files for NER task? |
@nrasiwas sorry for late response. 1 Peter NNP I-NP I-PER The index is of each word per sentence. |
@XuezheMax, here's a script for adding the starting indexes. Do you think it's ok?
|
The following is the code I used (just added the line index to Xuezhe's code) for converting the original CoNLL2003 files to the format used by
|
Could you give a more detailed explanation on the data format for dependency parsing? Plus, does the format you used include a line with annotation? For example, conllu format typically has two lines starting with #, to indicate sentence id and raw text. Thanks in advance! |
The second column is reserved for lemma, the same as conllu. But our model does not use lemma information. So the second column can be filled with any thing. Our format does not include the lines starting with # |
@XuezheMax Could you share the data used for POS tagging? Thanks in advance! |
Hi, the data is under PTB licence. If it is not an issue, it is good for me to send you the data. Can you give me your email? |
@XuezheMax I've sent you an email.Thank you very much! |
Hi, Thanks for your codes and data format. But I am still confused about the data format. So I don't sure that I used it correctly. Could you give information about whole schema of your CoNLL-X format and NER data format? Or could you share your data for me? Thanks in advance. I guess schema of CoNLL format: Is it right schema? |
For CoNLL-x format, the schema is: For NER data, the schema is: |
Thank you for your reply! |
Hi |
Hi, Thanks for your codes and data format. But I am still confused about the datasets. |
For the POS tagging dataset, you need to get it from Penn Treebank. |
Hi, I'm very interested in your nice work, and I'd love to build my new model upon yours. |
Hey @YuxianMeng For the data for dependency parsing, please provide your email so that I can send you the data. |
@XuezheMax Hi, license is not an issue for me. Actually we have downloaded and processed PTB now. Just want to double-check our data :). My email is yuxian_meng@shannonai.com and thanks again~ |
@XuezheMax Hello, I am very interested in your work on dependency parsing, which has greatly inspired me. I am currently trying to reproduce your research results. Would it be possible for you to share the conllx-style dependency parsing data you used, so that I can replicate your experiments? Additionally, could you provide the source or link for downloading the sskip.eng.100.gz file? Thank you very much for your help! My email is chen_yn@zju.edu.cn and thanks again! |
For the data used for POS tagging and Dependency Parsing, our data format follows the CoNLL-X format. Following is an example:
1 No _ RB RB _ 7 discourse _ _
2 , _ , , _ 7 punct _ _
3 it _ PR PRP _ 7 nsubj _ _
4 was _ VB VBD _ 7 cop _ _
5 n't _ RB RB _ 7 neg _ _
6 Black _ NN NNP _ 7 nn _ _
7 Monday _ NN NNP _ 0 root _ _
8 . _ . . _ 7 punct _ _
For the data used for NER, our data format is similar to that used in CoNLL 2003 shared task, with a little bit difference. An example is in following:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O
1 Peter NNP I-NP I-PER
2 Blackburn NNP I-NP I-PER
3 BRUSSELS NNP I-NP I-LOC
4 1996-08-22 CD I-NP O
...
where we add an column at the beginning to store the index of each word.
The original CoNLL-03 data can be downloaded here:
https://github.com/glample/tagger/tree/master/dataset
Make sure to convert the original tagging schema to the standard BIO (or more advanced BIOES)
Here is the code I used to convert it to BIO
The text was updated successfully, but these errors were encountered: