论文的PDF版本可以在以下链接中进行查看:FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation
如果您认为我们的工作对您的研究有帮助,请引用我们的论文:
@inproceedings{Hanyue_Du_CIKM23,
author = {Du, Hanyue and Zhao, Yike and Tian, Qingyuan and Wang, Jiani and Wang, Lei and Lan, Yunshi and Lu, Xuesong},
title = {FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-Grained Linguistic Annotation},
year = {2023},
isbn = {9798400701245},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3583780.3615119},
doi = {10.1145/3583780.3615119},
booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
pages = {5321–5325},
numpages = {5},
keywords = {Chinese grammatical error correction, deep learning, fine-grained linguistic annotation},
location = {Birmingham, United Kingdom},
series = {CIKM '23}
}
中文语法错误纠正 (CGEC) 旨在检测和纠正句子中的所有语法错误,已受到越来越多的研究人员的关注。尽管目前已经开发了多个 CGEC 数据集来支持研究,但这些数据集仍缺乏提供语法错误的深层语言拓扑的能力。为解决这个限制,本仓库提供了一个新的 CGEC 数据集:FlaCGEC,它具有细粒度的语言注释,包含 78 个实例化语法点和 3 种编辑类型, 数据的整体统计如下表所示。
Properties | Train | Dev | Test |
---|---|---|---|
Sentences | 10804 | 1334 | 1325 |
Average source sentence length | 35.09 | 34.76 | 35.83 |
Average target sentence length | 35.59 | 35.29 | 36.34 |
Edits per sentence | 1.72 | 1.69 | 1.71 |
Grammar points | 77 | 69 | 72 |
数据集下载地址见本仓库data文件夹:https://github.com/hyDududu/FlaCGEC/tree/main/data
FlaCGEC数据集以 JSON 文件形式进行存储,具体数据结构如下所示:
下表中展示了 FlaCGEC 数据集的一些示例,一个句子可能存在多个错误,并且错误涉及句子的不同组成部分。
[S] 节日期间,每饭店纷纷推出特色餐饮特惠措施,吸引市民走进饭店. Translation: During the festival, per hotel introduces special cuisines promotion activities, attracting citizens to walk in. [T] 节日期间,各饭店纷纷推出特色餐饮和特惠措施,吸引市民走进饭店。 Translation: During the festival, every hotel introduces special cuisines and promotion activities, attracting citizens to walk in. [A] 5 5|||S-Demonstrative pronouns指示代词|||各;16 16|||M-Prepositions for objects介词引出对象|||和 |
[S] 睡觉时,身体感觉到,人就容易梦到什么内容。 Translation: During sleeping, people easily dream the bodies feel. [T] 睡觉时,身体感觉到什么,人就容易梦到什么内容。 Translation: During sleeping, people easily dream what the bodies feel. [A] 9 9|||M-Non-interrogative use of interrogative pronouns疑问词的非疑问用法|||什么 |
[S] 他听很不服气地说:“我尽力而为了已经!” Translation: He listens and said disgruntledly: “I already have tried !” [T] 他听了很不服气地说:“我已经尽力而为了!” Translation: He listened and said disgruntledly: “I have already tried !” [A] 2 2|||M-Aspect particle动态助词|||了;16 17|||W-Adverbs of time时间副词|||None |
[S] 但有没受到老板的责备,而且他心里很失落。 Translation: But did he receive the blame from his boss, and he is upset. [T] 虽然没有受到老板的责备,但是他心里很失落。 Translation: Even though he did not receive the blame from his boss, he is upset. [A] 0 0\|\|\|S-Conjunctions for connecting clauses介词连接分句|||虽然;2 2|||W-Negative adverb否定副词|||没;11 12|||W-Conjunctions for connecting clauses介词连接分句|||但是 |
下表列出了部分实例化语法点、和它们相应的示例。
Grammar Points | Instantiations | Examples |
Adverbs of degree[程度副词] | 很 | 有的人很从容。 |
有点儿 | 左边这瓶有点儿酸。 | |
Conjunctions for connecting clauses[介词连接分句] | 如果 | 如果没有标记,散落的片断将… |
因此 | 因此,人们以乌龟指长寿。 | |
总之 | 总之,电视带给我们知识和娱乐。 | |
Modal verbs[能愿动词] | 需要 | 这项工程至少需要10年时间才能完工。 |
得 | 妈妈生病了,我得马上回国去看她。 | |
Passive sentences[被动句] | 被 | 自行车被当做一种交通工具。 |
被…所 | 快乐的人不会被痛苦所左右。 | |
Successive complex sentences[承接复句] | 于是 | 唐太宗很生气,于是召集群臣,当面训斥魏征。 |
便 | 司马光受父亲影响,自幼便聪明好学。 |