-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
164 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# dh_msra 说明\n", | ||
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)\n", | ||
"1. **数据概览:** 5 万多条中文命名实体识别标注数据([IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式,符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准)\n", | ||
"2. **推荐实验:** 中文命名实体识别\n", | ||
"2. **数据来源:** 不详\n", | ||
"3. **原数据集:** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF),网上搜集,具体作者、来源不详,可能是来自于 MSRA 的语料\n", | ||
"4. **加工处理:**\n", | ||
" 1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import codecs\n", | ||
"import random\n", | ||
"\n", | ||
"import numpy as np" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"path = 'dh_msra_文件夹_所在_路径'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 1. dh_msra.txt" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 加载数据" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def load_iob2(file_path):\n", | ||
" '''加载 IOB2 格式的数据'''\n", | ||
" token_seqs = []\n", | ||
" label_seqs = []\n", | ||
" tokens = []\n", | ||
" labels = []\n", | ||
" with codecs.open(file_path) as f:\n", | ||
" for index, line in enumerate(f):\n", | ||
" items = line.strip().split()\n", | ||
" if len(items) == 2:\n", | ||
" token, label = items\n", | ||
" tokens.append(token)\n", | ||
" labels.append(label)\n", | ||
" elif len(items) == 0:\n", | ||
" if tokens:\n", | ||
" token_seqs.append(tokens)\n", | ||
" label_seqs.append(labels)\n", | ||
" tokens = []\n", | ||
" labels = []\n", | ||
" else:\n", | ||
" print('格式错误。行号:{} 内容:{}'.format(index, line))\n", | ||
" continue\n", | ||
" \n", | ||
" if tokens: # 如果文件末尾没有空行,手动将最后一条数据加入序列的列表中\n", | ||
" token_seqs.append(tokens)\n", | ||
" label_seqs.append(labels) \n", | ||
" \n", | ||
" return np.array(token_seqs), np.array(label_seqs)\n", | ||
"\n", | ||
"\n", | ||
"def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):\n", | ||
" '''显示 IOB2 格式数据'''\n", | ||
" if shuffle:\n", | ||
" length = len(token_seqs)\n", | ||
" indexes = [random.randrange(0, length) for i in range(num)] \n", | ||
" zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])\n", | ||
" else:\n", | ||
" zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])\n", | ||
" \n", | ||
" for tokens, labels in zip_seqs:\n", | ||
" for token, label in zip(tokens, labels):\n", | ||
" print('{}/{} '.format(token, label), end='')\n", | ||
" print('\\n')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')\n", | ||
"\n", | ||
"print(len(token_seqs), len(label_seqs))\n", | ||
"print() \n", | ||
"show_iob2(token_seqs, label_seqs)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 标签说明\n", | ||
"\n", | ||
"| 标签 | 说明 |\n", | ||
"| ---- | ---- |\n", | ||
"| LOC | 地点 (LOCATION) |\n", | ||
"| ORG | 机构 (ORGANIZATION) |\n", | ||
"| PER | 人物 (PERSON) |" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"set([label for labels in label_seqs for label in labels])" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.5" | ||
}, | ||
"widgets": { | ||
"state": {}, | ||
"version": "1.1.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |