Using VLLM (like GPT-4o) to parse PDF into markdown.
Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.
Average cost per page: $0.013
This package use GeneralAgent lib to interact with OpenAI API.
- Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:
- Use a large visual model (such as GPT-4o) to parse and get a markdown file.
See examples/attention_is_all_you_need/output.md for PDF examples/attention_is_all_you_need.pdf.
pip install gptpdf
from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(pdf_path, api_key=api_key)
print(content)
See more in test/test.py
parse_pdf(pdf_path, output_dir='./', api_key=None, base_url=None, model='gpt-4o', verbose=False)
parse pdf file to markdown file, and return markdown content and all image paths.
-
pdf_path: pdf file path
-
output_dir: output directory. store all images and markdown file
-
api_key: OpenAI API Key (optional). If not provided, Use OPENAI_API_KEY environment variable.
-
base_url: OpenAI Base URL. (optional). If not provided, Use OPENAI_BASE_URL environment variable.
-
model: OpenAI Vision Large Model, default is 'gpt-4o'. You also can use qwen-vl-max (not tested yet) GLM-4V by change the
OPENAI_BASE_URL
or specifybase_url
. Also you can use Azure OpenAI by specifybase_url
tohttps://xxxx.openai.azure.com/
, api_key is Azure API Key, model is like 'azure_xxxx' where xxxx is the deployed model name (not openai model name) -
verbose: verbose mode
-
gpt_worker: gpt parse worker number. default is 1. If your machine performance is good, you can increase it appropriately to improve parsing speed.