Corpus containing 46 Enron email threads manually-annotated for entity coreference resolution task. The actual emails can be downloaded from here.
More details are available in our paper (which should be cited if you use or discuss this corpus in your work).
@inproceedings{dakle-etal-2020-study, title = "A Study on Entity Resolution for Email Conversations", author = "Dakle, Parag Pravin and Desai, Takshak and Moldovan, Dan", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.8", pages = "65--73", abstract = "This paper investigates the problem of entity resolution for email conversations and presents a seed annotated corpus of email threads labeled with entity coreference chains. Characteristics of email threads concerning reference resolution are first discussed, and then the creation of the corpus and annotation steps are explained. Finally, performance of the current state-of-the-art deep learning models on the seed corpus is evaluated and qualitative error analysis on the predictions obtained is presented.", language = "English", ISBN = "979-10-95546-34-4", }
The seed corpus contains 46 email threads comprising of 245 email messages. These threads are split into a 36:10 train:test split.
An email thread annotation is saved in the CoNLL format with the following naming convention:
username_email_no.conll
where:
username - Name of the user directory in the Enron Email Corpus.
email_no - Filename in the inbox folder of the specific user.
Each annotation file is a four column tab separated file and contains speaker, entity type (P: PER, O: ORG, L: LOC, D: DIG) and coreference annotations. Detailed column information in the order found is as follows:
The columns contain:
Column | Type | Description |
---|---|---|
1 | Token | The actual token as found in the email thread |
2 | Speaker | The speaker of the token |
3 | Entity Type | The type of the entity this token represents. This column also contains two additional annotations - coreference chain informaion for the entity type encoded in a parenthesis structure and if the entity is the antecedent given by "". E.g. In "(P0", ( implies the token is starting a mention span, P implies the token is of PER entity type, 0 implies the token belongs to the coreference chain with id 0 for PER entity type, and * implies it is part of the antecedent of the coreference chain. |
4 | Coreference | Coreference chain information encoded in a parenthesis structure. |
The code used to generate the results can be found here. Evalution scripts for all metrics can be found here.