Skip to content

Cannot index 5MB PDF with default settings using bedrock  #94

Open
@dirkpetersen

Description

I try to upload this file (5MB, 2,384,000 chars) to LibreChat with bedrock API activated
https://pve.proxmox.com/pve-docs/pve-admin-guide.pdf

I tried dev and dev-lite containers but am getting an upload error ("An Error occurred while uploading a file) in the LibreChat GUI but no real error in the logs with DEBUG_RAG_API=true, Strange

If set CHUNK_SIZE=5000 it works however, these are my RAG settings

DEBUG_RAG_API=true
RAG_USE_FULL_CONTEXT=true
PDF_EXTRACT_IMAGES=false # false is default
CHUNK_SIZE=5000 # 1500 is default

AWS_DEFAULT_REGION=us-west-2
AWS_ACCESS_KEY_ID=cc
AWS_SECRET_ACCESS_KEY=cc

EMBEDDINGS_PROVIDER=bedrock
EMBEDDINGS_MODEL=amazon.titan-embed-text-v1

RAG_API_URL=http://host-gateway:8000

Activity

dirkpetersen

dirkpetersen commented on Oct 27, 2024

@dirkpetersen
Author

Further testing shows that CHUNK_SIZE=5000 does not fully fix the issue, more testing needed, ChatGPT accepts this document but Claude says it is too big

FinnConnor

FinnConnor commented on Oct 28, 2024

@FinnConnor
Collaborator

I tested with CHUNK_SIZE=1500 EMBEDDINGS_PROVIDER=bedrock
EMBEDDINGS_MODEL=amazon.titan-embed-text-v1 PDF_EXTRACT_IMAGES=False.

I was unable to to see any issue with indexing this pdf (5 MB) and querying in both with docker and with only the rag_api.

If you are getting a file upload error. I would run just the rag_api (and database) and see if you are able to use the \embed to upload the 5MB pdf. This will help confirm if it is an issue with embedding the file or something else.

If your not having an issue with that, it may be that you have RAG_USE_FULL_CONTEXT=true this will send the entire context (all text of 5MB PDF) to chat, which very likely exceed the max number of input tokens.

Thanks for bringing this up @dirkpetersen

dirkpetersen

dirkpetersen commented on Oct 28, 2024

@dirkpetersen
Author

Thanks @ScarFX I set RAG_USE_FULL_CONTEXT=false but the problem persists.

LibreChat-NGINX   | 97.113.82.140 - - [28/Oct/2024:23:02:31 +0000] "POST /api/convos/gen_title HTTP/2.0" 200 54 "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" "-"
LibreChat-NGINX   | 2024/10/28 23:02:38 [warn] 30#30: *1 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: 97.113.82.140, server: _, request: "POST /api/files HTTP/2.0", host: "ochat1028b.aws.internetchen.de", referrer: "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78"
rag_api-1         | /usr/local/lib/python3.10/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from this module in 48.0.0.
rag_api-1         |   from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
chat-mongodb      | {"t":{"$date":"2024-10-28T23:03:04.400+00:00"},"s":"I",  "c":"WTCHKPT",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":{"ts_sec":1730156584,"ts_usec":400378,"thread":"1:0xffff8cf8e6c0","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_CHECKPOINT_PROGRESS","category_id":7,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"saving checkpoint snapshot min: 21, snapshot max: 21 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 545"}}}
LibreChat-NGINX   | 2024/10/28 23:03:38 [error] 30#30: *1 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 97.113.82.140, server: _, request: "POST /api/files HTTP/2.0", upstream: "http://172.20.0.6:3080/api/files", host: "ochat1028b.aws.internetchen.de", referrer: "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78"
LibreChat-NGINX   | 97.113.82.140 - - [28/Oct/2024:23:03:38 +0000] "POST /api/files HTTP/2.0" 504 569 "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" "-"
rag_api-1         | 2024-10-28 23:03:38,768 - root - INFO - Request POST http://rag_api:8000/embed - 200

It seems there is a timeout: Next is trying RAG API standalone

FinnConnor

FinnConnor commented on Nov 18, 2024

@FinnConnor
Collaborator

@dirkpetersen were you able to get RAG API to work?

dvejsada

dvejsada commented on Jan 14, 2025

@dvejsada

We have been experiencing this issue as well. From around 3MB, the file upload fails. For smaller files, it works fine. Here I attach one of the failed files for replication purposes (saved article to PDF from website).
Clanek_SeznamZpravy.pdf. @danny-avila could you please have a look what may be causing this? We use Azure OpenAI embeddings, with text-embedding-3-large model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Cannot index 5MB PDF with default settings using bedrock · Issue #94 · danny-avila/rag_api