Skip to content

Commit

Permalink
Add dh_msra dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
jinhuakst committed Apr 24, 2018
1 parent 68228a3 commit f372bcf
Show file tree
Hide file tree
Showing 2 changed files with 164 additions and 0 deletions.
Binary file added datasets/dh_msra/dh_msra.zip
Binary file not shown.
164 changes: 164 additions & 0 deletions datasets/dh_msra/intro.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dh_msra 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n",
"1. **数据概览:** 5 万多条中文命名实体识别标注数据([IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式,符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准)\n",
"2. **推荐实验:** 中文命名实体识别\n",
"2. **数据来源:** 不详\n",
"3. **原数据集:** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF),网上搜集,具体作者、来源不详,可能是来自于 MSRA 的语料\n",
"4. **加工处理:**\n",
" 1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import codecs\n",
"import random\n",
"\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = 'dh_msra_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. dh_msra.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def load_iob2(file_path):\n",
" '''加载 IOB2 格式的数据'''\n",
" token_seqs = []\n",
" label_seqs = []\n",
" tokens = []\n",
" labels = []\n",
" with codecs.open(file_path) as f:\n",
" for index, line in enumerate(f):\n",
" items = line.strip().split()\n",
" if len(items) == 2:\n",
" token, label = items\n",
" tokens.append(token)\n",
" labels.append(label)\n",
" elif len(items) == 0:\n",
" if tokens:\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels)\n",
" tokens = []\n",
" labels = []\n",
" else:\n",
" print('格式错误。行号:{} 内容:{}'.format(index, line))\n",
" continue\n",
" \n",
" if tokens: # 如果文件末尾没有空行,手动将最后一条数据加入序列的列表中\n",
" token_seqs.append(tokens)\n",
" label_seqs.append(labels) \n",
" \n",
" return np.array(token_seqs), np.array(label_seqs)\n",
"\n",
"\n",
"def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n",
" '''显示 IOB2 格式数据'''\n",
" if shuffle:\n",
" length = len(token_seqs)\n",
" indexes = [random.randrange(0, length) for i in range(num)] \n",
" zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n",
" else:\n",
" zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n",
" \n",
" for tokens, labels in zip_seqs:\n",
" for token, label in zip(tokens, labels):\n",
" print('{}/{} '.format(token, label), end='')\n",
" print('\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n",
"\n",
"print(len(token_seqs), len(label_seqs))\n",
"print() \n",
"show_iob2(token_seqs, label_seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 标签说明\n",
"\n",
"| 标签 | 说明 |\n",
"| ---- | ---- |\n",
"| LOC | 地点 (LOCATION) |\n",
"| ORG | 机构 (ORGANIZATION) |\n",
"| PER | 人物 (PERSON) |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set([label for labels in label_seqs for label in labels])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit f372bcf

Please sign in to comment.