This repository contains UA-GEC data and an accompanying Python library.
All corpus data and metadata stay under the ./data
. It has two subfolders
for train and test splits
Each split (train and test) has further subfolders for different data representations:
./data/{train,test}/annotated
stores documents in the annotated format
./data/{train,test}/source
and ./data/{train,test}/target
store the
original and the corrected versions of documents. Text files in these
directories are plain text with no annotation markup. These files were
produced from the annotated data and are, in some way, redundant. We keep them
because this format is convenient in some use cases.
./data/metadata.csv
stores per-document metadata. It's a CSV file with
the following fields:
id
(str): document identifier.author_id
(str): document author identifier.is_native
(int): 1 if the author is native-speaker, 0 otherwiseregion
(str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.gender
(str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other).occupation
(str): one of "Технічна", "Гуманітарна", "Природнича", "Інша"submission_type
(str): one of "essay", "translation", or "text_donation"source_language
(str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl".annotator_id
(int): ID of the annotator who corrected the document.partition
(str): one of "test" or "train"
Annotated files are text files that use the following in-text annotation format:
{error=>edit:::error_type=Tag}
, where error
and edit
stand for the text item before
and after correction respectively, and Tag
denotes an error category
(Grammar
, Spelling
, Punctuation
, or Fluency
).
Example of an annotated sentence:
I {likes=>like:::error_type=Grammar} turtles.
An accompanying Python package, ua_gec
, provides many tools for working with
annotated texts. See its documentation for details.
We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).
Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.
Next section lists the per-split statistics.
UA-GEC contains:
Split | Documents | Sentences | Tokens | Authors |
---|---|---|---|---|
train | 851 | 18,225 | 285,247 | 416 |
test | 160 | 2,490 | 43,432 | 76 |
TOTAL | 1,011 | 20,715 | 328,779 | 492 |
The following command computes the corpus statistics (note that the
ua-gec
package must be installed first):
$ python ./python/ua_gec/stats.py
Alternatively to operating on data files directly, you may use a Python package
called ua_gec
. This package includes the data and has classes to iterate over
documents, read metadata, work with annotations, etc.
The package can be easily installed by pip
:
$ pip install ua_gec==1.0
Alternatively, you can install it from the source code:
$ cd python
$ python setup.py develop
Once installed, you may get annotated documents from the Python code:
>>> from ua_gec import Corpus
>>> corpus = Corpus(partition="train")
>>> for doc in corpus:
... print(doc.source)
... print(doc.target)
... print(doc.annotated)
... print(doc.meta.region)
[The docs are under construction]
-
The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/
-
Code improvements and document are welcomed. Please submit a pull request.