代码整合，支持自定义提示词

CosmosShadow · Jul 3, 2024 · c27d0c3 · c27d0c3
1 parent 8ebc302
commit c27d0c3
Show file tree

Hide file tree

Showing 4 changed files with 794 additions and 215 deletions.
diff --git a/README.md b/README.md
@@ -53,26 +53,67 @@ print(content)
 See more in [test/test.py](test/test.py)
 
 
-
 ## API
 
-**parse_pdf**(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)
+### parse_pdf
+
+**Function**: `parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False, gpt_worker=1)`
+
+Parses a PDF file into a Markdown file and returns the Markdown content along with all image paths.
+
+**Parameters**:
+
+- **pdf_path**: *str*  
+  Path to the PDF file
+
+- **output_dir**: *str*, default: './'  
+  Output directory to store all images and the Markdown file
 
-parse pdf file to markdown file, and return markdown content and all image paths.
+- **api_key**: *Optional[str]*, optional  
+  OpenAI API key. If not provided, the `OPENAI_API_KEY` environment variable will be used.
 
-- **pdf_path**: pdf file path
+- **base_url**: *Optional[str]*, optional  
+  OpenAI base URL. If not provided, the `OPENAI_BASE_URL` environment variable will be used. This can be modified to call other large model services with OpenAI API interfaces, such as `GLM-4V`.
 
-- **output_dir**: output directory. store all images and markdown file
+- **model**: *str*, default: 'gpt-4o'  
+  OpenAI API formatted multimodal large model. If you need to use other models, such as:
+  - [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start) (untested)
+  - [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v) (tested)
+  - Azure OpenAI, by setting the `base_url` to `https://xxxx.openai.azure.com/` to use Azure OpenAI, where `api_key` is the Azure API key, and the model is similar to `azure_xxxx`, where `xxxx` is the deployed model name (tested).
 
-- **api_key**: OpenAI API Key (optional). If not provided, Use OPENAI_API_KEY environment variable.
+- **verbose**: *bool*, default: False  
+  Verbose mode. When enabled, the content parsed by the large model will be displayed in the command line.
 
-- **base_url**: OpenAI Base URL. (optional). If not provided, Use OPENAI_BASE_URL environment variable.
+- **gpt_worker**: *int*, default: 1  
+  Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing.
 
-- **model**: OpenAI Vision Large Model, default is 'gpt-4o'. 
-        You also can use [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start) (not tested yet)
-        [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v) by change the `OPENAI_BASE_URL` or specify `base_url`. 
-        Also you can use Azure OpenAI by specify `base_url` to `https://xxxx.openai.azure.com/`, api_key is Azure API Key, model is like 'azure_xxxx' where xxxx is the deployed model name (not openai model name)
+- **prompt**: *dict*, optional  
+  If the model you are using does not match the default prompt provided in this repository and cannot achieve the best results, we support adding custom prompts. The prompts in the repository are divided into three parts:
+  - `prompt`: Mainly used to guide the model on how to process and convert text content in images.
+  - `rect_prompt`: Used to handle cases where specific areas (such as tables or images) are marked in the image.
+  - `role_prompt`: Defines the role of the model to ensure the model understands it is performing a PDF document parsing task.
 
-- **verbose**: verbose mode
+  You can pass custom prompts in the form of a dictionary to replace any of the prompts. Here is an example:
 
-- **gpt_worker**: gpt parse worker number. default is 1. If your machine performance is good, you can increase it appropriately to improve parsing speed.
+  ```python
+  prompt = {
+      "prompt": "Custom prompt text",
+      "rect_prompt": "Custom rect prompt",
+      "role_prompt": "Custom role prompt"
+  }
+
+  content, image_paths = parse_pdf(
+      pdf_path=pdf_path,
+      output_dir='./output',
+      model="gpt-4o",
+      prompt=prompt,
+      verbose=False,
+  )
+
+## Join Us 👏🏻
+
+Scan the QR code below with WeChat to join our group chat or contribute.
+
+<p align="center">
+<img src="./docs/wechat.jpg" alt="wechat" width=400/>
+</p>
diff --git a/README_CN.md b/README_CN.md
@@ -15,8 +15,6 @@
 
 [pdfgpt-ui](https://github.com/daodao97/gptpdf-ui) 是一个基于 gptpdf 的可视化工具。
 
-
-
 ## 处理流程
 
 1. 使用 PyMuPDF 库，对 PDF 进行解析出所有非文本区域，并做好标记，比如:
@@ -25,61 +23,86 @@
 
 2. 使用视觉大模型（如 GPT-4o）进行解析，得到 markdown 文件。
 
-
-
 ## 样例
 
-有关 PDF，请参阅 [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf)。
-
-
+有关
+PDF，请参阅 [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf)。
 
 ## 安装
 
 ```bash
 pip install gptpdf
 ```
 
-
-
 ## 使用
 
 ```python
 from gptpdf import parse_pdf
+
 api_key = 'Your OpenAI API Key'
 content, image_paths = parse_pdf(pdf_path, api_key=api_key)
 print(content)
 ```
 
 更多内容请见 [test/test.py](test/test.py)
 
-
-
 ## API
 
-**parse_pdf**(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)
-
-将 pdf 文件解析为 markdown 文件，并返回 markdown 内容和所有图片路径列表。
-
-- **pdf_path**：pdf 文件路径
-
-- **output_dir**：输出目录。存储所有图片和 markdown 文件
-
-- **api_key**：OpenAI API 密钥（可选）。如果未提供，则使用 OPENAI_API_KEY 环境变量。
-
-- **base_url**：OpenAI 基本 URL。（可选）。如果未提供，则使用 OPENAI_BASE_URL 环境变量。
-
-- **model**：OpenAI API格式的多模态大模型，默认为 “gpt-4o”。
-    如果您需要使用其他模型，例如 [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start) (尚未测试)
-
-    [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v), 可以通过修改环境变量 `OPENAI_BASE_URL` 或 指定API参数 `base_url` 来使用。 (已经测试)
-
-    您也可以通过将 `base_url` 指定为 `https://xxxx.openai.azure.com/` 来使用 Azure OpenAI，api_key 是 Azure API 密钥，模型类似于 'azure_xxxx'，其中 xxxx 是部署的模型名称（不是 openai 模型名称）(已经测试)
-
-- **verbose**：详细模式
-
-- **gpt_worker**: gpt解析工作线程数，默认为1. 如果您的机器性能较好，可以适当调高，以提高解析速度。
-
-
+### parse_pdf
+
+**函数
+**：`parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False, gpt_worker=1)`
+
+将 PDF 文件解析为 Markdown 文件，并返回 Markdown 内容和所有图片路径列表。
+
+**参数**：
+
+- **pdf_path**：*str*  
+  PDF 文件路径
+
+- **output_dir**：*str*，默认值：'./'  
+  输出目录，存储所有图片和 Markdown 文件
+
+- **api_key**：*Optional[str]*，可选  
+  OpenAI API 密钥。如果未提供，则使用 `OPENAI_API_KEY` 环境变量。
+
+- **base_url**：*Optional[str]*，可选  
+  OpenAI 基本 URL。如果未提供，则使用 `OPENAI_BASE_URL` 环境变量。可以通过修改该环境变量调用 OpenAI API 类接口的其他大模型服务，例如`GLM-4V`。
+
+- **model**：*str*，默认值：'gpt-4o'。OpenAI API 格式的多模态大模型。如果需要使用其他模型，例如 
+  - [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start)（尚未测试）
+  - [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v)（已测试）
+  - Azure OpenAI，通过将 `base_url` 指定为 `https://xxxx.openai.azure.com/` 来使用 Azure OpenAI，`api_key` 是 Azure API 密钥，模型类似于 `azure_xxxx`，其中 `xxxx` 是部署的模型名称（已测试）。
+
+- **verbose**：*bool*，默认值：False，详细模式，开启后会在命令行显示大模型解析的内容。
+
+- **gpt_worker**：*int*，默认值：1  
+  GPT 解析工作线程数。如果您的机器性能较好，可以适当调高，以提高解析速度。
+
+- **prompt**: *dict*, 可选，如果您使用的模型与本仓库默认的提示词不匹配，无法发挥出最佳效果，我们支持自定义加入提示词。
+  仓库中，提示词分为三个部分，分别是：
+    + `prompt`：主要用于指导模型如何处理和转换图片中的文本内容。
+    + `rect_prompt`：用于处理图片中标注了特定区域（例如表格或图片）的情况。
+    + `role_prompt`：定义了模型的角色，确保模型理解它在执行PDF文档解析任务。
+      您可以用字典的形式传入自定义的提示词，实现对任意提示词的替换，这是一个例子：
+
+  ```python
+  prompt = {
+      "prompt": "自定义提示词语",
+      "rect_prompt": "自定义提示词",
+      "role_prompt": "自定义提示词"
+  }
+
+  content, image_paths = parse_pdf(
+      pdf_path=pdf_path,
+      output_dir='./output',
+      model="gpt-4o",
+      prompt="",
+      verbose=False,
+  )
+
+  ```
+  您不需要替换所有的提示词，如果您没有传入自定义提示词，仓库会自动使用默认的提示词。默认提示词使用的是中文，如果您的PDF文档是英文的，或者您的模型不支持中文，建议您自定义提示词。
 
 ## 加入我们👏🏻