中文通用信息抽取大模型(Chinchilla)

这是Chinachilla项目的存储库，该项目旨在构建一个大型中文通用信息抽取模型。

欢迎您向我们提供任何未收集的信息抽取数据集(或其来源)。我们将统一它们的格式，并通过我们所构建的instructions融入统一的数据集中，我们会通过该统一数据集训练我们的模型，进行广泛的实证研究，并开源模型检查点。我们希望我们的项目能够为信息抽取模型的开源进程做出微薄的贡献，降低信息抽取任务的难度。

数据集合 (Data Collection)

语言:

EN: English (英文)
CN: Chinese (中文)
ML: Multiple languages (多语言)

任务:

NER: Named Entity Recognition (命名实体识别)
RE: Relation Extraction (关系抽取)
EE: Event Extraction (事件抽取)

数据集	领域	数目	语言	任务	来源
DuIE2.0	人文	210K	CN	RE	https://www.luge.ai/#/luge/dataDetail?id=5
DuEE1.0	新闻	17K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=6
DuEE-fin	金融	11.7K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=7
IREE	金融	50K	CN	EE	https://www.luge.ai/#/luge/dataDetail?id=72
SanWen	中国文学	21K	CN	RE	https://github.com/thunlp/Chinese_NRE/tree/master/data/SanWen
BosonNER	通用	120K	CN	NER	https://github.com/HuHsinpang/BosonNER-Pretreatment/tree/master/boson/data
MSRANER	通用	50K	CN	NER	https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
FinRe	金融	18K	CN	RE	https://github.com/thunlp/Chinese_NRE/tree/master/data/FinRE
SemEval-2010 Task 8	通用	10K	EN	RE	https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_semeval.sh
TACRED	通用	106K	EN	NER, RE	https://github.com/yuhaozhang/tacred-relation/tree/master/dataset/tacred
NYT10	通用	694K	EN	RE	https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_nyt10.sh
DocRED	通用	UNK	EN	RE	https://drive.google.com/drive/folders/1c5-0YwnoJx8NS6CV2f-NoTHR__BdkNqw
CLUENER2020	通用	11K	CN	NER	https://www.cluebenchmarks.com/introduce.html
Title2Event	新闻	42K	CN	EE	https://open-event-hub.github.io/title2event/
BioRED	生物医学	UNK	EN	RE	https://github.com/ncbi/BioRED
文娱NER-Youku	文娱	10K	CN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/youku
CONLL2003	新闻	284K	EN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/conll2003
电商NER-Taobao	电商	8K	CN	NER	https://github.com/allanj/ner_incomplete_annotation/tree/master/data/ecommerce
财经NER-新浪财经	金融	5K	CN	NER	https://github.com/jiesutd/LatticeLSTM/tree/master/data
人民日报-NER	新闻	26K+	CN	NER	https://github.com/zjy-ucas/ChineseNER/tree/master/data
智慧教育开放知识数据集	教育	185K	CN	NER, RE	https://blog.csdn.net/qq_36426650/article/details/87719204
军事装备试验鉴定-NER	军事	0.8K	CN	NER	https://github.com/hy-struggle/ccks_ner/tree/master/militray/PreModel_Encoder_CRF/data
CMeEE	医学	23K	CN	NER	https://tianchi.aliyun.com/dataset/95414
CMeIE	医学	22K	CN	RE	https://tianchi.aliyun.com/dataset/95414
银行借贷2021-NER	金融	10K	CN	NER	https://www.heywhale.com/mw/dataset/617969ec768f3b0017862990/file
SKE 2019	通用	210K	CN	NER, RE	https://toscode.gitee.com/yiweilu/Entity-Relation-Extraction/tree/master/raw_data
任务对话2018-NER	通用	21K	CN	NER	http://tcci.ccf.org.cn/conference/2018/taskdata.php#
CoNLL04	新闻	9K	EN	RE	http://lavis.cs.hs-rm.de/storage/spert/public/datasets/conll04/
OntoNotes 4.0	新闻	50K	CN	NER	https://www.datafountain.cn/competitions/510/datasets
firefly-train-1.1M	通用	UNK	CN	NER	https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
IE INSTRUCTIONS	通用	UNK	EN	NER, RE, EE	https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view

数据格式

参考文献

To do

数据收集阶段

尽可能收集并整理现有的信息抽取相关的数据集，包括中文及英文。
将英文数据集通过机器翻译模型翻译成中文。
构建模型以进行数据的自动化清洗和质量控制。

数据构建阶段

针对不同的信息抽取任务，构建不同的instructions。
将数据格式统一，并加入instructions，生成大型中文信息抽取指令微调数据集。

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
static		static
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

中文通用信息抽取大模型(Chinchilla)

数据集合 (Data Collection)

数据格式

参考文献

To do

数据收集阶段

数据构建阶段

About

Releases

Packages

Contributors 2

License

hccngu/Viscacha

Folders and files

Latest commit

History

Repository files navigation

中文通用信息抽取大模型(Chinchilla)

数据集合 (Data Collection)

数据格式

参考文献

To do

数据收集阶段

数据构建阶段

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages