Merge pull request THUDM#245 from zRzRzRzRzRzRzR/main

公开数据集
limeng101192 · Dec 26, 2023 · 54f28da · 54f28da
2 parents d7e88e1 + b3e8682
commit 54f28da
Show file tree

Hide file tree

Showing 4 changed files with 152 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -62,9 +62,8 @@ dialogue with images, GUI Agent, Grounding**, and more.
     - [Citation \& Acknowledgements](#citation--acknowledgements)
 
 ## Release
-
-- 🔥 **News**: ```2023/12/18```: **New Web UI Launched!** We have launched a new web UI based on Streamlit, users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
-
+- 🔥🔥🔥  **News**: ```2023/12/26```: We have released the [CogVLM-SFT-311K](dataset.md) dataset, which contains over 150,000 pieces of data that we used for training. Welcome to follow and use.
+- 🔥🔥 **News**: ```2023/12/18```: **New Web UI Launched!** We have launched a new web UI based on Streamlit, users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
 - 🔥 **News**: ```2023/12/15```: **CogAgent Officially Launched!** CogAgent is an image understanding model developed
   based on CogVLM. It features **visual-based GUI Agent capabilities** and has further enhancements in image
   understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including

diff --git a/README_zh.md b/README_zh.md
@@ -60,10 +60,8 @@ Agent、Grounding等多种能力。
     - [Citation \& Acknowledgements](#citation--acknowledgements)
 
 ## 发布
-
-- 🔥 **News**: ```2023/12/18```: **新的Streamlit用户界面**
-  已经上线！我们已经基于Streamlit推出了新的网页用户界面，用户可以在我们的界面上轻松与CogVLM，CogAgent交谈。带来更好的用户体验。
-
+- 🔥🔥🔥 **News**: ```2023/12/26```:我们公开了 [CogVLM-SFT-311K](dataset_zh.md) 数据集，它包含了超过15万条我们用于训练的数据。欢迎关注和使用。
+- 🔥🔥 **News**: ```2023/12/18```: **新的Streamlit用户界面**已经上线！我们已经基于Streamlit推出了新的网页用户界面，用户可以在我们的界面上轻松与CogVLM，CogAgent交谈。带来更好的用户体验。
 - 🔥 **News**: ```2023/12/15```: **CogAgent 正式发布！** CogAgent是基于CogVLM开发的图像理解模型。它具有基于视觉的GUI
   Agent功能，并在图像理解方面进行了进一步的增强。它支持分辨率为1120*1120的图像输入，并具有包括与图像进行多轮对话、GUI
   Agent、Grounding等多种能力。
@@ -560,7 +558,7 @@ CogAgent-18B在9个经典的跨模态基准测试中实现了最先进的全能
 |       cogvlm-base-224       |   base    |
 |       cogvlm-base-490       |   base    |
 
-### FAQ
+### 常见问题
 
 * 如果你在访问huggingface.co时遇到问题，你可以添加 `--local_tokenizer /path/to/vicuna-7b-v1.5` 来加载分词器。
 * 如果你在使用🔨 [SAT](https://github.com/THUDM/SwissArmyTransformer)自动下载模型时遇到问题 , 尝试从 🤖[modelscope](https://www.modelscope.cn/models/ZhipuAI/CogVLM/summary) 或
@@ -595,5 +593,4 @@ CogAgent-18B在9个经典的跨模态基准测试中实现了最先进的全能
 
 ```
 
-在CogVLM的指令微调阶段，我们使用了来自 [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR)
-和[Shikra](https://github.com/shikras/shikra)项目的一些英文图像-文本数据，以及许多经典的跨模态工作数据集。我们衷心感谢他们的贡献。
+在CogVLM的指令微调阶段，我们使用了来自 [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR) 和 [Shikra](https://github.com/shikras/shikra)项目的一些英文图像-文本数据，以及许多经典的跨模态工作数据集。我们衷心感谢他们的贡献。
diff --git a/dataset.md b/dataset.md
@@ -0,0 +1,75 @@
+# CogVLM-SFT-311K: Bilingual Visual Instruction Data in CogVLM SFT
+
+CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The process of constructing this dataset is as follows:
+1. Approximately 3500 high-quality data samples were selected from the open source [MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align), known as minigpt4-3500.
+2. Minigpt4-3500 was integrated with [Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and translated into Chinese through a language model.
+3. We discovered significant noise in the detailed description part of minigpt4-3500 and Llava-instruct. Thus, we corrected these Chinese corpora and retranslated them into English.
+
+## License
+
++ Due to non-commercial agreements, we did not use these data in the bilingual version of CogVLM or any other models involving commercialization.
++ The dataset license adheres to: <br> Attribution-NonCommercial 4.0 International. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
+This will not allow you to use these data for any **commercial activitiesI**.
+
+## Dataset Address
+
++ [CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K)
+
+## Dataset Information
+
+The dataset contains three folders corresponding to the mixed part of minigpt4-3500 and llava, the llava solo conversation, and the multi-turn conversation datasets. Their layout is as follows:
+```
+.CogVLM-SFT-311K
+├── llava_details-minigpt4_3500_formate
+├── llava_instruction_multi_conversations_formate
+└── llava_instruction_single_conversation_formate
+```
+In our open-source data, the datasets are distributed as follows:
+```
+.llava_details-minigpt4_3500_formate
+├── images
+│   └── 00000001.jpg
+└── labels
+    └── 00000001.json
+```
+Images are stored in the images folder, while labels containing the corresponding image description or dialogue are stored in the labels folder.
+
+## Dataset Quantity
+
++ llava_details-minigpt4_3500_formate: 22,464 images and descriptions
++ llava_instruction_multi_conversations_formate: 56,673 images and multi-turn conversations
++ llava_instruction_single_conversation_formate: 76,634 images and single-turn conversations
+
+Dataset Format
+
+Caption format for image description
+```
+{
+  "captions": [
+    {
+      "role": "caption",
+      "content": "The photograph features a beach scene with a group of people surfing in the ocean. There are ten individuals spaced out in the water, sitting or lying on various sizes of body boards. The surfers line up in a row, close to the sandy shoreline. Some of the body boards are visible floating on the surface of the water, with the surfers riding on them or nearby."
+    }
+  ]
+}
+```
+Conversation format for image dialogue
+```
+{
+  "conversations": [
+    {
+      "role": "user",
+      "content": "What can be inferred about the zebras' behavior and surroundings?"
+    },
+    {
+      "role": "assistant",
+      "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
+    }
+  ]
+}
+```
+
+## References
+This project utilizes data and concepts based on the following research papers:
+- Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
+- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485. 
diff --git a/dataset_zh.md b/dataset_zh.md
@@ -0,0 +1,71 @@
+# CogVLM-SFT-311K：CogVLM SFT 中的双语视觉指令数据集
+
+CogVLM-SFT-311K 是我们在训练 **CogVLM v1.0** 最初版本时使用的主要对齐语料库。此数据集的构建过程如下：
+1. 从开源的 [MiniGPT-4](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align) 中选取了大约3500个高质量数据样本，称为 minigpt4-3500。
+2. 将 minigpt4-3500 与 [Llava-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) 整合，并通过语言模型翻译获得中文部分。
+3. 我们发现在 minigpt4-3500 和 Llava-instruct 的详细描述部分存在许多噪声。因此，我们纠正了这两部分的中文语料，并将纠正后的语料重新翻译成英语。
+
+## 许可证
++ 由于非商业协议限制，我们没有在 CogVLM的双语版本 和其他任何 涉及商业化的模型 中使用这些数据。 
++ 数据集许可证遵守：<br> Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
+这将不允许你使用这些数据进行任何 **商业化行为**。
+
+## 数据集地址
+
++ [CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K)
+
+## 数据集信息
+数据集共有三个文件夹，分别对应混合 minigpt4-3500 与llava混合的一部分数据集，llava 单论对话和多轮对话数据集。其布局如下：
+```
+.CogVLM-SFT-311K
+├── llava_details-minigpt4_3500_formate
+├── llava_instruction_multi_conversations_formate
+└── llava_instruction_single_conversation_formate
+```
+在我们开源的数据中，数据集按照以下格式分布
+```
+.llava_details-minigpt4_3500_formate
+├── images
+│   └── 00000001.jpg
+└── labels
+    └──  00000001.json
+```
+其中，images存储图像数据，而labels存储这张图像对应的描述或对话。
+
+## 数据集数量
++ llava_details-minigpt4_3500_formate 22464 张图片和描述
++ llava_instruction_muti_conversations_formate 56673 张图片和多轮连续对话
++ llava_instruction_single_conversation_formate 76634 张图片和单轮对话
+## 数据集格式
+
+图像描述 Caption 格式
+```
+{
+  "captions": [
+    {
+      "role": "caption",
+      "content": "这张照片展示了一男一女，他们都穿着红嘿色衣服，坐在桌子旁，在参加活动时面带微笑。这对情侣正在一起摆姿势，背景中还有其他人。在整个场景中可以看到另外几个人，一些人坐在附近的桌子旁，另一些人坐得离这对夫妇更近。各种餐桌设置，如酒杯和杯子，被放置在桌子上。排在最前面的男子打着领带。"
+    }
+  ]
+}
+```
+图像对话 Json 格式
+``` 
+{
+    "conversations": [
+      {
+        "role": "user",
+        "content": "在尝试保持这个浴室有序时可能会遇到哪些挑战？",
+      },
+      {
+        "role": "assistant",
+        "content": "在尝试保持这空间和各种物品的存在，例如杯子和手机占用了可用的空间。在图像中，浴室柜台上只有一个水槽可用，这导致多个用户可能必须共享空间放置自己的物品。这可能导致杂乱和混乱外的储物解决方案，如架子、橱柜或墙壁挂架，以保持浴室有序并最小化柜台上的杂乱。"
+      },
+    ]
+}
+```
+
+## References
+This project utilizes data and concepts based on the following research papers:
+- Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
+- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485.