LREC

Seed Annotated Corpus for Entity Resolution in Email Conversations

Corpus containing 46 Enron email threads manually-annotated for entity coreference resolution task. The actual emails can be downloaded from here.

More details are available in our paper (which should be cited if you use or discuss this corpus in your work).

@inproceedings{dakle-etal-2020-study,
    title = "A Study on Entity Resolution for Email Conversations",
    author = "Dakle, Parag Pravin  and
      Desai, Takshak  and
      Moldovan, Dan",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.8",
    pages = "65--73",
    abstract = "This paper investigates the problem of entity resolution for email conversations and presents a seed annotated corpus of email threads labeled with entity coreference chains. Characteristics of email threads concerning reference resolution are first discussed, and then the creation of the corpus and annotation steps are explained. Finally, performance of the current state-of-the-art deep learning models on the seed corpus is evaluated and qualitative error analysis on the predictions obtained is presented.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Corpus Description

The seed corpus contains 46 email threads comprising of 245 email messages. These threads are split into a 36:10 train:test split.

An email thread annotation is saved in the CoNLL format with the following naming convention:

username_email_no.conll

where:

username - Name of the user directory in the Enron Email Corpus.

email_no - Filename in the inbox folder of the specific user.

Each annotation file is a four column tab separated file and contains speaker, entity type (P: PER, O: ORG, L: LOC, D: DIG) and coreference annotations. Detailed column information in the order found is as follows:

The columns contain:

Column	Type	Description
1	Token	The actual token as found in the email thread
2	Speaker	The speaker of the token
3	Entity Type	The type of the entity this token represents. This column also contains two additional annotations - coreference chain informaion for the entity type encoded in a parenthesis structure and if the entity is the antecedent given by "". E.g. In "(P0", ( implies the token is starting a mention span, P implies the token is of PER entity type, 0 implies the token belongs to the coreference chain with id 0 for PER entity type, and * implies it is part of the antecedent of the coreference chain.
4	Coreference	Coreference chain information encoded in a parenthesis structure.

Experiments

The code used to generate the results can be found here. Evalution scripts for all metrics can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LREC

LREC

README.md

Seed Annotated Corpus for Entity Resolution in Email Conversations

Corpus Description

Experiments

Name		Name	Last commit message	Last commit date
parent directory ..
test		test
train		train
README.md		README.md

Files

LREC

Directory actions

More options

Directory actions

More options

Latest commit

History

LREC

Folders and files

parent directory

README.md

Seed Annotated Corpus for Entity Resolution in Email Conversations

Corpus Description

Experiments