Skip to content

Commit

Permalink
chore: update data-sft cookbook
Browse files Browse the repository at this point in the history
  • Loading branch information
danielhjz committed Nov 9, 2023
1 parent c6974d0 commit 3de5d01
Show file tree
Hide file tree
Showing 2 changed files with 1,014 additions and 28 deletions.
242 changes: 214 additions & 28 deletions cookbook/console-finetune/console-finetune.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"### 前言\n",
"\n",
"本篇主要介绍end-to-end的LLMops流程中的SFT微调->发布->推理流程,使用的SDK版本为0.1.0。建议提前熟悉预测服务相关SDK功能作为前置知识。"
"本篇主要介绍end-to-end的LLMops流程中的数据->SFT微调->发布->推理流程,使用的SDK版本为0.1.3。建议提前熟悉预测服务相关SDK功能作为前置知识。"
]
},
{
Expand All @@ -16,64 +16,250 @@
"metadata": {},
"outputs": [],
"source": [
"# 通过环境变量传递(作用于全局,优先级最低)\n",
"import os\n",
"os.environ[\"QIANFAN_ACCESS_KEY\"] = \"your_iam_ak\"\n",
"os.environ[\"QIANFAN_SECRET_KEY\"] = \"your_iam_sk\"\n",
"# 初始化百度智能云的IAM ak, sk用于bos和千帆平台的鉴权\n",
"bce_ak = \"your_iam_ak\"\n",
"bce_sk = \"your_iam_sk\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据上传\n",
"\n",
"# 通过内置函数传递(作用于全局,优先级大于环境变量)\n",
"# import qianfan\n",
"# qianfan.AccessKey(\"...\")\n",
"# qianfan.SecretKey(\"...\")\n",
"在进行SFT微调训练前,我们需要准备我们的训练数据;不同的训练任务需要准备不同类型的数据集,具体来说,对于LLM SFT训练任务,需要准备的是`已标注的、非排序的对话数据集`\n",
"推荐使用的数据格式为`jsonl`,即每一行文本都包含了一个json字符串,此json需要包含prompt,response两个字段,以下是一个示例,[下载](https://console.bce.baidu.com/api/qianfan/canghai/entity/static/sample-text-dialog-unsort-annotated.jsonl):\n",
"```\n",
"[{\"prompt\" : \"你好\", \"response\": [[\"你需要什么帮助\"]]}]\n",
"```\n",
"每一行表示一组数据,每组数据中的prompt和response加起来之和字符数不超过8000Token(包括中英文、数字、符号等),超出部分将被截断。\n",
"\n",
"# 调用相关接口时传递(仅作用于该请求,优先级最高)\n",
"# import qianfan\n",
"# task = qianfan.FineTune.create_task(ak=\"...\", sk=\"...\")"
"### Bos\n",
"\n",
"Bos是百度智能云提供的对象存储云服务,可以高效的存取数据。本篇教程基于Bos,实现本地的数据集到千帆平台数据集的导入:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 首先我们需要安装bce-python-sdk\n",
"!pip install bce-python-sdk"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{metadata:{date:u'Thu, 09 Nov 2023 10:50:57 GMT',content_length:u'0',connection:u'keep-alive',content_md5:u'kbo1u82WYdCFGVLAbeqXbQ==',etag:u'91ba35bbcd9661d0851952c06dea976d',server:u'BceBos',bce_content_crc_32:u'86170999',bce_debug_id:u'JUrX2nUmpvcbaRPRMsY+uS3KUFDB1YjYIbZ9aaJtEgw16FpXFpCwVQG7+iVDt2rD4dVWAh+SmNZzCEUXGOXHiQ==',bce_flow_control_type:u'-1',bce_is_transition:u'false',bce_request_id:u'b65583f2-c7fb-4fa6-ad52-c07569270120'}}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from baidubce.bce_client_configuration import BceClientConfiguration\n",
"from baidubce.auth.bce_credentials import BceCredentials\n",
"from baidubce.services.bos.bos_client import BosClient\n",
"\n",
"# 初始化bos配置\n",
"BosEndpoint = \"bj.bcebos.com\"\n",
"bucket_name = \"your_bucketname\"\n",
"\n",
"bos_config = BceClientConfiguration(credentials=BceCredentials(bce_ak, bce_sk), endpoint=BosEndpoint)\n",
"\n",
"file_name = \"./data/sample-text-dialog-unsort-annotated.jsonl\"\n",
"key = \"/dataset/dialog01/sample-text-dialog-unsort-annotated.jsonl\"\n",
"prefix = \"/dataset/dialog01/\"\n",
"\n",
"bos_client = BosClient(bos_config)\n",
"bos_client.put_object_from_file(bucket_name, key, file_name)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 大模型调优\n",
"千帆平台支持SFT/RLHF两种方法进行模型优化,当前SDK已支持对SFT训练微调任务的创建和管理。\n",
"SFT 相关操作使用“安全认证/Access Key”中的 Access Key ID 和 Secret Access Key 进行鉴权,无法使用获取Access Token的方式鉴权,相关 key 可以在百度智能云控制台中安全认证获取,详细流程可以参见文档。\n",
"鉴权方式除`命名`外,使用方法与预测功能使用的AK 与 SK 方式相同,提供如下三种方式:\n",
"## 大模型平台鉴权介绍:\n",
"\n",
"- 通过`环境变量`传递(作用于全局,优先级最低)\n",
"- 通过`内置函数`传递(作用于全局,优先级大于环境变量)\n",
"- 通过`调用接口`时传递(仅作用于该请求,优先级最高)"
"大模型平台和Bos同处于百度智能云下,所以可以使用同一个AK,SK来通过权限校验:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"QIANFAN_ACCESS_KEY\"] = bce_ak\n",
"os.environ[\"QIANFAN_SECRET_KEY\"] = bce_sk"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据导入\n",
"\n",
"在完成了以上从本地到bos的上传过程后,我们就开始着手创建数据集并导入之前上传到bos的数据"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.1.0'"
"QfResponse(code=200, headers={'Content-Length': '1110', 'Content-Type': 'application/json; charset=utf-8', 'Date': 'Thu, 09 Nov 2023 08:41:06 GMT', 'X-Bce-Gateway-Region': 'BJ', 'X-Bce-Request-Id': '8aef6c3b-8630-49db-823d-55a0115203d5'}, body={'log_id': 'qnxrdigwje6aiyyf', 'result': {'id': 32518, 'groupId': 26707, 'groupName': 'hi_sft_ds', 'displayName': '', 'createFrom': 0, 'bmlDatasetId': 'ds-nu54erbqtvfpgpr9', 'isBmlLocking': 0, 'easyDLProId': 0, 'versionId': 1, 'userId': 1493592, 'projectId': '', 'organizationId': '', 'visibility': 'Project', 'productId': 3, 'dataType': 4, 'projectType': 20, 'templateType': 2001, 'scene': 0, 'remark': '', 'storageType': 'usrBos', 'storageInfo': {'storageId': 'qianfanhj', 'storagePath': '/qianfanhj/dataset/dialog01/_system_/dataset/ds-nu54erbqtvfpgpr9/texts', 'storageName': 'qianfanhj', 'rawStoragePath': '/dataset/dialog01/', 'region': 'bj'}, 'importStatus': -1, 'importProgress': 0, 'importScheduledJobId': 0, 'importJobId': 0, 'exportStatus': -1, 'releaseStatus': 0, 'publishPublicStatus': '', 'publishPublicErrCode': 0, 'statsJobId': 0, 'statisticStatus': 0, 'statisticProgress': 0, 'ShouldHide': 0, 'status': 0, 'isUnique': 0, 'isConfirm': 0, 'publishStatus': 0, 'errCode': None, 'hasTitle': 0, 'displayFeatures': '', 'latestDeltaIndex': 0, 'adversarialStatus': 0, 'createTime': '2023-11-09T16:41:06.600928653+08:00', 'modifyTime': '2023-11-09T16:41:06.600940755+08:00'}, 'status': 200, 'success': True})"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
"output_type": "display_data"
}
],
"source": [
"import qianfan\n",
"from qianfan import Data\n",
"from qianfan.resources.console.consts import DataSetType, DataProjectType, DataTemplateType, DataStorageType\n",
"\n",
"# 创建数据集\n",
"ds = Data.create_bare_dataset(name=\"hi_sft_ds\", \n",
" data_set_type=DataSetType.TextOnly,\n",
" project_type=DataProjectType.Conversation,\n",
" template_type=DataTemplateType.AnnotatedConversation,\n",
" storage_type=DataStorageType.PrivateBos,\n",
" storage_id=bucket_name,\n",
" storage_path=prefix)\n",
"ds\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 使用bos进行数据导入\n",
"from qianfan.resources.console.consts import DataSourceType\n",
"\n",
"ds_id=ds[\"result\"][\"id\"]\n",
"import_resp = Data.create_data_import_task(dataset_id=ds_id,\n",
" is_annotated=True,\n",
" import_source=DataSourceType.PrivateBos,\n",
" file_url=\"bos:/{}/{}\".format(bucket_name, key))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 获取数据集详情\n",
"ds_info = Data.get_dataset_info(ds_id)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 监听导入状态\n",
"\n",
"由于数据集导入是一个耗时任务,所以我们需要等待其完成才能进行下一步的动作,这里我们通过轮询的方式简单的监听任务状态直到数据完成导入成功。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"ImportSuccess = 2\n",
"\n",
"# 模型调优,模型管理与发布能力仅在qianfan>=0.1.0支持\n",
"qianfan.__version__"
"while True:\n",
" # 获取数据集详情\n",
" ds_info = Data.get_dataset_info(ds_id)\n",
" import_status = ds_info[\"result\"][\"versionInfo\"][\"importStatus\"]\n",
" if import_status == ImportSuccess:\n",
" print(\"dataset import finish, ready to release\")\n",
" break\n",
" print(\"current_import_status\", import_status)\n",
" time.sleep(10)\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 发布数据集\n",
"\n",
"恭喜你到达了进行SFT训练的最后一步,我们已经完成了数据集的准备,现在需要发布数据集。\n",
"> Note:\n",
"> 发布数据集后后无法再进行数据集的处理,导入或者修改!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"current_release_status 1\n",
"current_release_status 1\n",
"current_release_status 1\n",
"current_release_status 1\n",
"dataset release finish, ready to train\n"
]
}
],
"source": [
"# 发布 并监听数据集发布状态\n",
"ReleasedSuccess = 2\n",
"resp = Data.release_dataset(ds_id)\n",
"while True:\n",
" # 获取数据集详情\n",
" ds_info = Data.get_dataset_info(ds_id)\n",
" release_status = ds_info[\"result\"][\"versionInfo\"][\"releaseStatus\"]\n",
" if release_status == ReleasedSuccess:\n",
" print(\"dataset release finish, ready to train\")\n",
" break\n",
" print(\"current_release_status\", release_status)\n",
" time.sleep(10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"至此,数据部分的准备已经完成!我们话不多说赶紧开始LLM的Finetune:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finetune\n",
"\n",
"目前千帆平台支持如下 SFT 相关操作:\n",
"* 创建训练任务\n",
"* 创建任务运行\n",
Expand Down Expand Up @@ -129,7 +315,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"req QfRequest(method='POST', url='/wenxinworkshop/finetune/createJob', query={}, headers={}, json_body={'taskId': 12765, 'baseTrainType': 'ERNIE-Bot-turbo', 'trainType': 'ERNIE-Bot-turbo-0725', 'trainMode': 'SFT', 'peftType': 'ALL', 'trainConfig': {'epoch': 1, 'learningRate': 2e-05, 'maxSeqLen': 4096}, 'trainset': [{'type': 1, 'id': 12563}], 'trainsetRate': 20}, retry_config=RetryConfig(retry_count=1, timeout=10, backoff_factor=0))\n"
"req QfRequest(method='POST', url='/wenxinworkshop/finetune/createJob', query={}, headers={}, json_body={'taskId': 12765, 'baseTrainType': 'ERNIE-Bot-turbo', 'trainType': 'ERNIE-Bot-turbo-0725', 'trainMode': 'SFT', 'peftType': 'ALL', 'trainConfig': {'epoch': 1, 'learningRate': 2e-05, 'maxSeqLen': 4096}, 'trainset': [{'type': 1, 'id': 32518}], 'trainsetRate': 20}, retry_config=RetryConfig(retry_count=1, timeout=10, backoff_factor=0))\n"
]
},
{
Expand Down Expand Up @@ -160,7 +346,7 @@
" \"trainset\": [\n",
" {\n",
" \"type\": 1,\n",
" \"id\": 12563\n",
" \"id\": ds_id\n",
" }\n",
" ],\n",
" \"trainsetRate\": 20\n",
Expand Down
Loading

0 comments on commit 3de5d01

Please sign in to comment.