The Streamlit pdf to markdown project is designed to take one or more pdfs and transform them into a corresponding set of markdown files.
It assists users in preparing files for ingestion in Retrieval-Augmented Generation (RAG) workflows.
- This is currently an OCR-free approach.
- Tables are handled reasonably well, though improvements will be made. It is important to check the output manually to confirm there weren't any errors introduced.
- Images are excluded both from the final markdown file output and are not saved locally to the file.
This project will likely form part of a pipeline based approach to handle different types of pdfs in the future.
This README provides step-by-step instructions for setting up and using the project on your local machine.
The main branch of project is designed to run entirely on your local machine. This version of project doesn't rely on external API calls and offers greater control over your data. If you're looking for a self-contained solution, the main
branch is the way to go.
-TDB-
To get started with the pdfmd project, you'll need to follow these installation steps:
- Create a virtual environment and activate on your local machine to isolate the project's dependencies.
Mac:
python -m venv pdfmd-env
source pdfmd-env/bin/activate
Windows:
python -m venv pdfmd-env
source venv/Scripts/activate
-
Navigate to the project directory, and clone the project repository.
git clone https://github.com/headstrongpete/pdfmd.git
-
Install the required Python packages using
pip
.pip install -r requirements.txt
Open your terminal and run the following command to start the project application:
streamlit run app.py
- You can now select an individual file or multiple files to convert as well as the folder location where you would like the newly transformed files to be saved.
- Select Convert to initiate the transformation process.
If you encounter any issues, have suggestions, or want to report a bug, please visit the Issues section of the project repository and create a new issue. Provide detailed information about the problem you're facing, and I'll do my best to assist you.
This project is licensed under the Apache License 2.0. For details, see the LICENSE file..