Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
CosmosShadow committed Jun 28, 2024
1 parent 22919ba commit 91ffc53
Show file tree
Hide file tree
Showing 3 changed files with 114 additions and 48 deletions.
91 changes: 43 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,73 @@
# gptpdf
Using GPT to parse PDF

## Introduction
<p align="center">
<a href="README_CN.md"><img src="https://img.shields.io/badge/文档-中文版-blue.svg" alt="CN doc"></a>
<a href="README.md"><img src="https://img.shields.io/badge/document-English-blue.svg" alt="EN doc"></a>
</p>

This package uses OpenAI's GPT-4o to parse PDFs to Markdowns.
Using VLLM (like GPT-4o) to parse PDF into markdown.

It perfectly parse text, image, math equations, charts, and tables.
Our method can almost perfectly parse typesetting, mathematical formulas, tables, pictures, charts, etc.

It almost cost $0.013 per page.
Average price per page: $0.013

This package use [GeneralAgent](https://github.com/CosmosShadow/GeneralAgent) lib to interact with OpenAI API.

TODO: add parse work flow


## Process steps

1. Use the PyMuPDF library to parse the PDF and extract all non-text rectangular areas (tables, pictures, icons, etc.)
2. Convert all non-text rectangular areas on the PDF into pictures and number them
3. Mark each page of the PDF with a red rectangle and number and save it as an image, similar to the following:

![](docs/demo.jpg)

4. Based on the picture in step 3, use a large visual model (such as GPT-4o) to parse and get the markdown content (including pictures, tables, formulas, etc.)



## DEMO

See [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) for PDF [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf).



## Installation

```bash
pip install gptpdf
```



## Usage

```python
from gptpdf import parse_pdf
pdf_path = '../examples/attention_is_all_you_need.pdf'
output_dir = '../examples/attention_is_all_you_need/'
api_key = os.getenv('OPENAI_API_KEY')
base_url = os.getenv('OPENAI_API_BASE')
# Manually provide OPENAI_API_KEY and OPEN_API_BASE
content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, api_key=api_key, base_url=base_url, model='gpt-4o')
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)
print(image_paths)
# also output_dir/output.md is generated
```

```python
from gptpdf import parse_pdf
pdf_path = '../examples/attention_is_all_you_need.pdf'
output_dir = '../examples/attention_is_all_you_need/'
# Use OPENAI_API_KEY and OPENAI_API_BASE from environment variables
content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, model='gpt-4o', verbose=True)
print(content)
print(image_paths)
# also output_dir/output.md is generated
```
See more in [test/test.py](test/test.py)



## API

```python
def parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False):
"""
parse pdf file to markdown file
:param pdf_path: pdf file path
:param output_dir: output directory. store all images and markdown file
:param api_key: OpenAI API Key (optional). If not provided, Use OPENAI_API_KEY environment variable.
:param base_url: OpenAI Base URL. (optional). If not provided, Use OPENAI_BASE_URL environment variable.
:param model: OpenAI Vison LLM Model, default is 'gpt-4o'. You also can use qwen-vl-max
:param verbose: verbose mode
:return: markdown content with ![](path/to/image.png) and all rect image (image, table, chart, ...) paths.
"""

"""
解析PDF文件到markdown文件
:param pdf_path: pdf文件路径
:param output_dir: 输出目录。存储所有的图片和markdown文件
:param api_key: OpenAI API Key(可选)。如果未提供,则使用OPENAI_API_KEY环境变量。
:param base_url: OpenAI Base URL。 (可选)。如果未提供,则使用OPENAI_BASE_URL环境变量。
:param model: OpenAI Vison LLM Model,默认为'gpt-4o'。您还可以使用qwen-vl-max
:param verbose: 详细模式,默认为False
:return: (content, all_rect_images), markdown内容,带有![](path/to/image.png) 和 所有矩形图像(图像、表格、图表等)路径列表。
"""
```
parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)

parse pdf file to markdown file, and return markdown content and all image paths.

**pdf_path**: pdf file path

**output_dir**: output directory. store all images and markdown file

**api_key**: OpenAI API Key (optional). If not provided, Use OPENAI_API_KEY environment variable.

**base_url**: OpenAI Base URL. (optional). If not provided, Use OPENAI_BASE_URL environment variable.

**model**: OpenAI Vison LLM Model, default is 'gpt-4o'. You also can use qwen-vl-max

**verbose**: verbose mode
71 changes: 71 additions & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# gptpdf

<p align="center">
<a href="README_CN.md"><img src="https://img.shields.io/badge/文档-中文版-blue.svg" alt="CN doc"></a>
<a href="README.md"><img src="https://img.shields.io/badge/document-English-blue.svg" alt="EN doc"></a>
</p>

使用视觉大语言模型(如 GPT-4o)将 PDF 解析为 markdown。

我们的方法几乎可以完美地解析排版、数学公式、表格、图片、图表等。

每页平均价格:0.013 美元

改库使用 [GeneralAgent](https://github.com/CosmosShadow/GeneralAgent) lib 与 OpenAI API 交互。



## 处理流程

1. 使用 PyMuPDF 库,对 PDF 进行解析,提取所有非文本的矩形区域(表格、图片、图标等)
2. 将 PDF 上所有非文本的矩形区域转成图片,并进行编号
3. 在每页PDF上标记好红色矩形框和编号,保存为图片,类似如下:

![](docs/demo.jpg)

4. 基于第3步的图片,使用视觉大模型(如 GPT-4o)进行解析,得到 markdown 内容(并包含图片、表格、公式等)



## 样例

有关 PDF,请参阅 [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf)



## 安装

```bash
pip install gptpdf
```



## 使用

```python
from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)
```

更多内容请见 [test/test.py](test/test.py)



## API

**parse_pdf**(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)

- **pdf_path**:pdf 文件路径

- **output_dir**:输出目录。存储所有图片和 markdown 文件

- **api_key**:OpenAI API 密钥(可选)。如果未提供,则使用 OPENAI_API_KEY 环境变量。

- **base_url**:OpenAI 基本 URL。(可选)。如果未提供,则使用 OPENAI_BASE_URL 环境变量。

- **model**:OpenAI Vison LLM 模型,默认为“gpt-4o”。您也可以使用 qwen-vl-max

- **verbose**:详细模式
Binary file added docs/demo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 91ffc53

Please sign in to comment.