[improvement] .render()
isn't that robust - wrong ordered results #1586
Description
Bug description
The default OCR model works very well, but the render()
algorithm which converts coordinates to text positions is very buggy.
This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.
I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago.
And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)
Good job! You are leading the OCR leaderboard.
I attached a sample .PDF file and a snippet to reproduce the problem.
I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-(
...but how are we suposed to feed our hungry LLMs?
Code snippet to reproduce the bug
import argparse
import os
import json
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
def convert_pdf_to_txt(input_pdf, output_txt):
"""
Converts a PDF file to a text file using DocTR OCR.
Args:
input_pdf (str): Path to the input PDF file.
output_txt (str): Path to the output text file.
"""
print("Load pre-trained OCR model")
model = ocr_predictor(pretrained=True)
# Ensure input PDF exists
if not os.path.exists(input_pdf):
raise ValueError(f"Input PDF file '{input_pdf}' does not exist.")
# Load the PDF document
try:
doc = DocumentFile.from_pdf(input_pdf)
except Exception as e:
raise ValueError(f"Error loading PDF '{input_pdf}': {e}")
# Perform OCR and extract text
try:
result = model(doc)
#exp = result.export()
#text = json.dumps(exp)
text = result.render()
except Exception as e:
raise ValueError(f"Error performing OCR on '{input_pdf}': {e}")
# Write extracted text to output file
with open(output_txt, 'w', encoding='utf-8') as f:
f.write(text)
print(f"PDF '{input_pdf}' converted to text file '{output_txt}'.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Convert PDF to text using DocTR OCR")
parser.add_argument("input_pdf", help="Path to the input PDF file")
parser.add_argument("output_txt", help="Path to the output text file")
args = parser.parse_args()
convert_pdf_to_txt(args.input_pdf, args.output_txt)
Error traceback
No error
Environment
Linux, conda, python 3.9
Deep Learning backend
Default model.
test-ocr.pdf