Skip to content

A test project for processing PDF data via a LLM.

Notifications You must be signed in to change notification settings

dingbat91/PDFAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DINGPDF

Details

Hey everyone, this is my learning attempt to create a functional tool for asking questions against an input pdf via a LLM Locally rather than relying on chatGPT or other cloud based solutions.

The UI is created using streamlit and the LLM is operated from processed using Langchain and FAISS for vector database storage. the model used is hardcoded currently as the Mistral-7B-OpenOrca model from huggingface.

included is my CV, which is the pdf I have been using to test the model, and because if you're looking for someone excited to learn. Here I am!

Usage

To begin create a virtual environment and install the requirements.txt file. Then run the following command to begin the application:

streamlit run app.py

This will open a browser, that after loading (and if needed, downloading) the model will display the following information:

display of the url webpage

using the sidebar on the left import a pdf (you can import multiple but your mileage may vary on success of processing) after that hit the "submit PDF" button

image of the sidebar showing an uploaded pdf

After the spinner has finished you can begin interacting with the model by typing into the main chat windows.

image of the chat window

Debug information can be found in the console window that the streamlit app is running in.

TBD

  • Storing processed vectors and loading them in for later reuse
  • optimisation work

limitations

So the program is currently not hugely optimised in any way. There is an initial loading period that can be long as it downloads the LLM, and any pdf input needs to be fairly clean. It also runs the LLM locally so while you won't need an API key, it's ability to function is limited by your local machine.

It also primarily uses the CPU rather than the GPU, I believe this can be solved by changing to a different version of pytorch but I have yet to attempt implementing or testing that.

About

A test project for processing PDF data via a LLM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages