Skip to content

[improvement] .render() isn't that robust - wrong ordered results #1586

Open
@kripper

Description

Bug description

The default OCR model works very well, but the render() algorithm which converts coordinates to text positions is very buggy.
This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.

I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago.
And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)

Good job! You are leading the OCR leaderboard.

I attached a sample .PDF file and a snippet to reproduce the problem.
I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-(
...but how are we suposed to feed our hungry LLMs?

Code snippet to reproduce the bug

import argparse
import os
import json

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

def convert_pdf_to_txt(input_pdf, output_txt):
  """
  Converts a PDF file to a text file using DocTR OCR.

  Args:
      input_pdf (str): Path to the input PDF file.
      output_txt (str): Path to the output text file.
  """

  print("Load pre-trained OCR model")
  model = ocr_predictor(pretrained=True)

  # Ensure input PDF exists
  if not os.path.exists(input_pdf):
    raise ValueError(f"Input PDF file '{input_pdf}' does not exist.")

  # Load the PDF document
  try:
    doc = DocumentFile.from_pdf(input_pdf)
  except Exception as e:
    raise ValueError(f"Error loading PDF '{input_pdf}': {e}")

  # Perform OCR and extract text
  try:
    result = model(doc)
    #exp = result.export()
    #text = json.dumps(exp)
    text = result.render()
  except Exception as e:
    raise ValueError(f"Error performing OCR on '{input_pdf}': {e}")

  # Write extracted text to output file
  with open(output_txt, 'w', encoding='utf-8') as f:
    f.write(text)

  print(f"PDF '{input_pdf}' converted to text file '{output_txt}'.")

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Convert PDF to text using DocTR OCR")
  parser.add_argument("input_pdf", help="Path to the input PDF file")
  parser.add_argument("output_txt", help="Path to the output text file")
  args = parser.parse_args()

  convert_pdf_to_txt(args.input_pdf, args.output_txt)

Error traceback

No error

Environment

Linux, conda, python 3.9

Deep Learning backend

Default model.
test-ocr.pdf

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions