Data Format #9

XuezheMax · 2018-02-27T20:53:34Z

For the data used for POS tagging and Dependency Parsing, our data format follows the CoNLL-X format. Following is an example:
1 No _ RB RB _ 7 discourse _ _
2 , _ , , _ 7 punct _ _
3 it _ PR PRP _ 7 nsubj _ _
4 was _ VB VBD _ 7 cop _ _
5 n't _ RB RB _ 7 neg _ _
6 Black _ NN NNP _ 7 nn _ _
7 Monday _ NN NNP _ 0 root _ _
8 . _ . . _ 7 punct _ _

For the data used for NER, our data format is similar to that used in CoNLL 2003 shared task, with a little bit difference. An example is in following:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O

1 Peter NNP I-NP I-PER
2 Blackburn NNP I-NP I-PER
3 BRUSSELS NNP I-NP I-LOC
4 1996-08-22 CD I-NP O
...
where we add an column at the beginning to store the index of each word.

The original CoNLL-03 data can be downloaded here:
https://github.com/glample/tagger/tree/master/dataset

Make sure to convert the original tagging schema to the standard BIO (or more advanced BIOES)
Here is the code I used to convert it to BIO

def transform(ifile, ofile):
	with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
		prev = 'O'
		for line in reader:
			line = line.strip()
			if len(line) == 0:
				prev = 'O'
				writer.write('\n')
				continue

			tokens = line.split()
			# print tokens
			label = tokens[-1]
			if label != 'O' and label != prev:
				if prev == 'O':
					label = 'B-' + label[2:]
				elif label[2:] != prev[2:]:
					label = 'B-' + label[2:]
				else:
					label = label
			writer.write(" ".join(tokens[:-1]) + " " + label)
			writer.write('\n')
			prev = tokens[-1]

HAWLYQ · 2018-03-06T13:02:21Z

How about the index of the "DOCSTART" ? 0?

XuezheMax · 2018-03-06T21:01:40Z

"DOCSTART" in my data sets is placed in a separated sentence, like
1 -DOCSTART- -X- O O
But as it provide no useful information, you can remove it from your data.

HAWLYQ · 2018-03-07T01:45:05Z

I get it, thanks for your reply !

ichn-hu · 2018-03-22T06:14:54Z

Thanks for your explanation on the data format, but I am still confused about the word embedding format or standard you used, can you give me some details on this?

HAWLYQ · 2018-03-22T08:57:55Z

The detailed information about word embedding is introduced in Ma's paper(Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J]. 2016.). It writes in the paper that Standford's Glove 100 dimensional embedding achieve best result.

ichn-hu · 2018-03-22T09:16:24Z

Thanks a lot, I had checked that out after several minutes of sending the comment. Sorry for bother, and thanks for your reply, again!

nrasiwas · 2018-05-17T10:07:29Z

I am still not clear about the format. Is index of each word per sentence or gets incremented for all words?

nrasiwas · 2018-05-17T11:28:35Z

Also i am getting error

$ bash ./examples/run_ner_crf.sh
loading embedding: glove from data/glove/glove.6B/glove.6B.100d.gz
2018-05-17 16:56:01,917 - NERCRF - INFO - Creating Alphabets
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 48 (0)
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Character Alphabet Size: 35
2018-05-17 16:56:01,922 - Create Alphabets - INFO - POS Alphabet Size: 19
2018-05-17 16:56:01,922 - Create Alphabets - INFO - Chunk Alphabet Size: 9
2018-05-17 16:56:01,922 - Create Alphabets - INFO - NER Alphabet Size: 125
2018-05-17 16:56:01,923 - NERCRF - INFO - Word Alphabet Size: 48
2018-05-17 16:56:01,923 - NERCRF - INFO - Character Alphabet Size: 35
2018-05-17 16:56:01,923 - NERCRF - INFO - POS Alphabet Size: 19
2018-05-17 16:56:01,923 - NERCRF - INFO - Chunk Alphabet Size: 9
2018-05-17 16:56:01,923 - NERCRF - INFO - NER Alphabet Size: 125
2018-05-17 16:56:01,923 - NERCRF - INFO - Reading Data
Reading data from data/conll2003/english/eng.train.bioes.conll
Traceback (most recent call last):
File "examples/NERCRF.py", line 248, in
main()
File "examples/NERCRF.py", line 110, in main
data_train = conll03_data.read_data_to_variable(train_path, word_alphabet, char_alphabet, pos_alphabet, chunk_alphabet, ner_alphabet, use_gpu=use_gpu)
File "./neuronlp2/io/conll03_data.py", line 313, in read_data_to_variable
max_size=max_size, normalize_digits=normalize_digits)
File "./neuronlp2/io/conll03_data.py", line 157, in read_data
inst = reader.getNext(normalize_digits)
File "./neuronlp2/io/reader.py", line 165, in getNext
pos_ids.append(self.__pos_alphabet.get_index(pos))
File "./neuronlp2/io/alphabet.py", line 64, in get_index
raise KeyError("instance not found: %s" % instance)
KeyError: u'instance not found: NNP'

Is is possible for you to share your data files for NER task?

XuezheMax · 2018-06-06T17:25:07Z

@nrasiwas sorry for late response.
Here is a more clear example of the data format.
The following is the correct format for your examples:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O

1 Peter NNP I-NP I-PER
2 Blackburn NNP I-NP I-PER
3 BRUSSELS NNP I-NP I-LOC
4 1996-08-22 CD I-NP O

The index is of each word per sentence.
And make sure to remove the alphabet folder in 'data/' when you use a different data set or different versions of a data set. Otherwise, the program will load the old vocabulary from disk.

pvcastro · 2018-06-07T11:08:37Z

@XuezheMax, here's a script for adding the starting indexes. Do you think it's ok?

def add_starting_index(ifile, ofile):
    with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
        prev = None
        skip_next = False
        for line in reader:
            if skip_next:
                skip_next = False
                continue
            line = line.strip()
            docstart = line.startswith('-DOCSTART-')
            if docstart:
                skip_next = True
            if len(line) == 0 or docstart:
                prev = None
                if not docstart:
                    writer.write('\n')
                continue

            tokens = line.split()

            if prev is None:
                prev = 1
            else:
                prev += 1

            indexed_tokens = [str(prev)] + tokens

            # print tokens
            writer.write(" ".join(indexed_tokens))
            writer.write('\n')

ducalpha · 2018-06-07T20:39:46Z

The following is the code I used (just added the line index to Xuezhe's code) for converting the original CoNLL2003 files to the format used by run_ner_crf.sh which yielded F1 score 91.36% in the best case (consistent with the paper).

def transform(ifile, ofile):
    """
    Transform original CoNLL2003 format to BIO format for the named entity column (last column) only
    :param ifile: input file name (a original CoNLL2003 data file)
    :param ofile: output file name
    """
    with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
        prev = 'O'
        line_idx = 1
        for line in reader:
            line = line.strip()
            if len(line) == 0:
                line_idx = 1
                prev = 'O'
                writer.write('\n')
                continue

            tokens = line.split()
            # print tokens
            label = tokens[-1]
            if label != 'O' and label != prev:
                if prev == 'O':
                    label = 'B-' + label[2:]
                elif label[2:] != prev[2:]:
                    label = 'B-' + label[2:]
                else:
                    label = label
            tokens.insert(0, str(line_idx))
            writer.write(" ".join(tokens[:-1]) + " " + label)
            writer.write('\n')
            prev = tokens[-1]
            line_idx += 1


transform("eng.train", "eng.train.bio.conll")
transform("eng.testa", "eng.dev.bio.conll")
transform("eng.testb", "eng.test.bio.conll")

hwijeen · 2018-07-16T07:32:17Z

Could you give a more detailed explanation on the data format for dependency parsing?
You have already provided an example, but I am still not clear what each column means.
(The second column is _ for everything: what does it mean? Shouldn't it be something related to lemma as it is the case in conllu?)

Plus, does the format you used include a line with annotation? For example, conllu format typically has two lines starting with #, to indicate sentence id and raw text.

Thanks in advance!

XuezheMax · 2018-07-16T22:12:06Z

The second column is reserved for lemma, the same as conllu. But our model does not use lemma information. So the second column can be filled with any thing.

Our format does not include the lines starting with #

steambread666 · 2018-08-25T12:36:16Z

@XuezheMax Could you share the data used for POS tagging? Thanks in advance!

XuezheMax · 2018-08-26T23:35:50Z

Hi, the data is under PTB licence. If it is not an issue, it is good for me to send you the data. Can you give me your email?

steambread666 · 2018-08-27T03:45:14Z

@XuezheMax I've sent you an email.Thank you very much!

KyrieEleison10 · 2019-03-13T09:31:08Z

Hi, Thanks for your codes and data format. But I am still confused about the data format. So I don't sure that I used it correctly. Could you give information about whole schema of your CoNLL-X format and NER data format? Or could you share your data for me? Thanks in advance.

I guess schema of CoNLL format:
( ID, FORM, LEMMA, POSTAG1, POSTAG2, CPOSTAG, HEAD, DEPREL, PHEAD, PDEPREL )
and NER data format:
( ID, FORM, POSTAG, CHUNK, NERTAG )

Is it right schema?

XuezheMax · 2019-03-13T16:33:17Z

For CoNLL-x format, the schema is:
ID, FORM, LEMMA, CPOSTAG, POSTAG, MORPH-FEATURES, HEAD, DEPREL, PHEAD, PDEPREL

For NER data, the schema is:
ID, FORM, POSTAG, CHUNK, NERTAG

KyrieEleison10 · 2019-03-14T01:43:24Z

Thank you for your reply!

subbayya · 2019-12-05T01:19:05Z

Hi
How do I get the penn tree bank datasets?
POS-penn/wsj
Thanks,
Sankar

hyenee · 2020-03-30T11:05:56Z

Hi, Thanks for your codes and data format. But I am still confused about the datasets.
I want to know how to get 'data/POS-penn/wsj/'
Thanks in advance!

XuezheMax · 2020-03-30T12:10:43Z

For the POS tagging dataset, you need to get it from Penn Treebank.

YuxianMeng · 2020-12-07T08:29:13Z

Hi, I'm very interested in your nice work, and I'd love to build my new model upon yours.
However, I cannot find appropriate data to reproduce your work. Could you please share the conllx-style dependency parsing data you used so I can reproduce your results?
Looking forward to your reply @XuezheMax ~

XuezheMax · 2020-12-11T02:13:40Z

Hey @YuxianMeng

For the data for dependency parsing, please provide your email so that I can send you the data.
Since the data are from PTB corpus, please make sure that license is not an issue for you.

YuxianMeng · 2020-12-12T09:10:12Z

@XuezheMax Hi, license is not an issue for me. Actually we have downloaded and processed PTB now. Just want to double-check our data :). My email is yuxian_meng@shannonai.com and thanks again~

ArthurWish · 2024-11-02T08:10:53Z

@XuezheMax Hello, I am very interested in your work on dependency parsing, which has greatly inspired me. I am currently trying to reproduce your research results. Would it be possible for you to share the conllx-style dependency parsing data you used, so that I can replicate your experiments? Additionally, could you provide the source or link for downloading the sskip.eng.100.gz file? Thank you very much for your help! My email is chen_yn@zju.edu.cn and thanks again!

ayrtondenner mentioned this issue Apr 24, 2018

Using the NeuroNLP2 in a different data format #11

Closed

pvcastro mentioned this issue Jun 7, 2018

Trying to achieve same results as "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" paper #13

Closed

XuezheMax mentioned this issue Sep 4, 2018

No such file or directory: 'data/sskip/sskip.ger.64.gz' && data/sskip/sskip.eng.100.gz && data/conll2003/english/eng.train.bioes.conll #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Format #9

Data Format #9

XuezheMax commented Feb 27, 2018 •

edited

Loading

HAWLYQ commented Mar 6, 2018

XuezheMax commented Mar 6, 2018

HAWLYQ commented Mar 7, 2018

ichn-hu commented Mar 22, 2018

HAWLYQ commented Mar 22, 2018

ichn-hu commented Mar 22, 2018

nrasiwas commented May 17, 2018

nrasiwas commented May 17, 2018

XuezheMax commented Jun 6, 2018

pvcastro commented Jun 7, 2018

ducalpha commented Jun 7, 2018

hwijeen commented Jul 16, 2018

XuezheMax commented Jul 16, 2018

steambread666 commented Aug 25, 2018

XuezheMax commented Aug 26, 2018

steambread666 commented Aug 27, 2018

KyrieEleison10 commented Mar 13, 2019 •

edited

Loading

XuezheMax commented Mar 13, 2019

KyrieEleison10 commented Mar 14, 2019

subbayya commented Dec 5, 2019

hyenee commented Mar 30, 2020

XuezheMax commented Mar 30, 2020

YuxianMeng commented Dec 7, 2020

XuezheMax commented Dec 11, 2020

YuxianMeng commented Dec 12, 2020

ArthurWish commented Nov 2, 2024

Data Format #9

Data Format #9

Comments

XuezheMax commented Feb 27, 2018 • edited Loading

HAWLYQ commented Mar 6, 2018

XuezheMax commented Mar 6, 2018

HAWLYQ commented Mar 7, 2018

ichn-hu commented Mar 22, 2018

HAWLYQ commented Mar 22, 2018

ichn-hu commented Mar 22, 2018

nrasiwas commented May 17, 2018

nrasiwas commented May 17, 2018

XuezheMax commented Jun 6, 2018

pvcastro commented Jun 7, 2018

ducalpha commented Jun 7, 2018

hwijeen commented Jul 16, 2018

XuezheMax commented Jul 16, 2018

steambread666 commented Aug 25, 2018

XuezheMax commented Aug 26, 2018

steambread666 commented Aug 27, 2018

KyrieEleison10 commented Mar 13, 2019 • edited Loading

XuezheMax commented Mar 13, 2019

KyrieEleison10 commented Mar 14, 2019

subbayya commented Dec 5, 2019

hyenee commented Mar 30, 2020

XuezheMax commented Mar 30, 2020

YuxianMeng commented Dec 7, 2020

XuezheMax commented Dec 11, 2020

YuxianMeng commented Dec 12, 2020

ArthurWish commented Nov 2, 2024

XuezheMax commented Feb 27, 2018 •

edited

Loading

KyrieEleison10 commented Mar 13, 2019 •

edited

Loading