Extracts text from pdfs in a similar way to PymuPDF, but without the AGPL license. Built on pypdfium2.
You'll need python 3.9+.
Install with:
pip install pdftext
- Inspect the settings in
pdftext/settings.py
. You can override any settings with environment variables.
This command will write out a text file with the extracted plain text.
pdftext PDF_PATH --out_path output.txt
PDF_PATH
must be a single pdf file.--out_path
path to the output txt file. If not specified, will write to stdout.--sort
will attempt to sort in reading order if specified.
pdftext PDF_PATH --out_path output.txt --output_type json
PDF_PATH
must be a single pdf file.--out_path
path to the output txt file. If not specified, will write to stdout.--output_type
specifies whether to write out plain text (default) or json--sort
will attempt to sort in reading order if specified.
The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:
-
bbox
- the page bbox, in [x1, y1, x2, y2] format -
rotation
- how much the page is rotated, in degrees (0, 90, 180, or 270) -
page_idx
- the index of the page -
blocks
- the blocks that make up the text in the pdf. Approximately equal to a paragraph.bbox
- the block bbox, in [x1, y1, x2, y2] formatlines
- the lines inside the blockbbox
- the line bbox, in [x1, y1, x2, y2] formatchars
- the individual characters in the linechar
- the actual character, encoded in utf-8rotation
- how much the character is rotated, in degreesbbox
- the character bbox, in [x1, y1, x2, y2] formatorigin
- the original pdf coordinate originchar_idx
- the index of the character on the page (from 0 to number of characters, in original pdf order)font
this is font info straight from the pdf, see this pdfium codesize
- the size of the font used for the characterweight
- font weightname
- font name, may be Noneflags
- font flags, in the format of thePDF spec 1.7 Section 5.7.1 Font Descriptor Flags