Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
CosmosShadow committed Jun 28, 2024
1 parent d7cb8f5 commit 0d05d81
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 14 deletions.
10 changes: 3 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

Using VLLM (like GPT-4o) to parse PDF into markdown.

Our method can almost perfectly parse typesetting, mathematical formulas, tables, pictures, charts, etc.
Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.

Average price per page: $0.013

Expand All @@ -17,15 +17,11 @@ This package use [GeneralAgent](https://github.com/CosmosShadow/GeneralAgent) li

## Process steps

1. Use the PyMuPDF library to parse the PDF and extract all non-text areas.

2. Convert all non-text areas on the PDF into images and number them

3. Mark the non-text areas and numbers on each page of the PDF and save them as images, similar to the following:
1. 使用 PyMuPDF 库,对 PDF 进行解析出所有非文本区域,并做好标记,比如:

![](docs/demo.jpg)

4. Based on the image in step 3, use a large visual model (such as GPT-4o) to parse and obtain the markdown content.
2. 使用视觉大模型(如 GPT-4o)进行解析,得到 markdown 文件。



Expand Down
10 changes: 3 additions & 7 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

使用视觉大语言模型(如 GPT-4o)将 PDF 解析为 markdown。

我们的方法几乎可以完美地解析排版、数学公式、表格、图片、图表等。
我们的方法非常简单(只有293行代码),但几乎可以完美地解析排版、数学公式、表格、图片、图表等。

每页平均价格:0.013 美元

Expand All @@ -17,15 +17,11 @@

## 处理流程

1. 使用 PyMuPDF 库,对 PDF 进行解析,提取所有非文本区域(包括表格、图片、图标等)

2. 将 PDF 上所有非文本区域转成图片,并进行编号

3. 在每页PDF上标记非文本区域和编号,保存为图片,类似如下:
1. 使用 PyMuPDF 库,对 PDF 进行解析出所有非文本区域,并做好标记,比如:

![](docs/demo.jpg)

4. 基于第3步的图片,使用视觉大模型(如 GPT-4o)进行解析,得到 markdown 内容(并包含图片、表格、公式等)
2. 使用视觉大模型(如 GPT-4o)进行解析,得到 markdown 文件。



Expand Down

0 comments on commit 0d05d81

Please sign in to comment.