Skip to content

VikParuchuri/pdftext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFText

Extracts text from pdfs in a similar way to PymuPDF, but without the AGPL license. Built on pypdfium2.

Installation

You'll need python 3.9+.

Install with:

pip install pdftext

Usage

  • Inspect the settings in pdftext/settings.py. You can override any settings with environment variables.

Plain text

This command will write out a text file with the extracted plain text.

pdftext PDF_PATH --out_path output.txt
  • PDF_PATH must be a single pdf file.
  • --out_path path to the output txt file. If not specified, will write to stdout.
  • --sort will attempt to sort in reading order if specified.

JSON

pdftext PDF_PATH --out_path output.txt --output_type json
  • PDF_PATH must be a single pdf file.
  • --out_path path to the output txt file. If not specified, will write to stdout.
  • --output_type specifies whether to write out plain text (default) or json
  • --sort will attempt to sort in reading order if specified.

The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:

  • bbox - the page bbox, in [x1, y1, x2, y2] format

  • rotation - how much the page is rotated, in degrees (0, 90, 180, or 270)

  • page_idx - the index of the page

  • blocks - the blocks that make up the text in the pdf. Approximately equal to a paragraph.

    • bbox - the block bbox, in [x1, y1, x2, y2] format
    • lines - the lines inside the block
      • bbox - the line bbox, in [x1, y1, x2, y2] format
      • chars - the individual characters in the line
        • char - the actual character, encoded in utf-8
        • rotation - how much the character is rotated, in degrees
        • bbox - the character bbox, in [x1, y1, x2, y2] format
        • origin - the original pdf coordinate origin
        • char_idx - the index of the character on the page (from 0 to number of characters, in original pdf order)
        • font this is font info straight from the pdf, see this pdfium code
          • size - the size of the font used for the character
          • weight - font weight
          • name - font name, may be None
          • flags - font flags, in the format of the PDF spec 1.7 Section 5.7.1 Font Descriptor Flags