PDFText

Extracts text from pdfs in a similar way to PymuPDF, but without the AGPL license. Built on pypdfium2.

Installation

You'll need python 3.9+.

Install with:

pip install pdftext

Usage

Inspect the settings in pdftext/settings.py. You can override any settings with environment variables.

Plain text

This command will write out a text file with the extracted plain text.

pdftext PDF_PATH --out_path output.txt

PDF_PATH must be a single pdf file.
--out_path path to the output txt file. If not specified, will write to stdout.
--sort will attempt to sort in reading order if specified.

JSON

pdftext PDF_PATH --out_path output.txt --output_type json

PDF_PATH must be a single pdf file.
--out_path path to the output txt file. If not specified, will write to stdout.
--output_type specifies whether to write out plain text (default) or json
--sort will attempt to sort in reading order if specified.

The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:

bbox - the page bbox, in [x1, y1, x2, y2] format
rotation - how much the page is rotated, in degrees (0, 90, 180, or 270)
page_idx - the index of the page
blocks - the blocks that make up the text in the pdf. Approximately equal to a paragraph.
- bbox - the block bbox, in [x1, y1, x2, y2] format
- lines - the lines inside the block
  - bbox - the line bbox, in [x1, y1, x2, y2] format
  - chars - the individual characters in the line
    - char - the actual character, encoded in utf-8
    - rotation - how much the character is rotated, in degrees
    - bbox - the character bbox, in [x1, y1, x2, y2] format
    - origin - the original pdf coordinate origin
    - char_idx - the index of the character on the page (from 0 to number of characters, in original pdf order)
    - font this is font info straight from the pdf, see this pdfium code
      - size - the size of the font used for the character
      - weight - font weight
      - name - font name, may be None
      - flags - font flags, in the format of the PDF spec 1.7 Section 5.7.1 Font Descriptor Flags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
pdftext		pdftext
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
extract_text.py		extract_text.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFText

Installation

Usage

Plain text

JSON

About

Releases 32

Packages

Contributors 4

Languages

License

VikParuchuri/pdftext

Folders and files

Latest commit

History

Repository files navigation

PDFText

Installation

Usage

Plain text

JSON

About

Resources

License

Stars

Watchers

Forks

Releases 32

Packages 0

Contributors 4

Languages

Packages