Skip to content

Releases: VikParuchuri/pdftext

Pin pypdfium2

30 Dec 20:44
ea2e9b5
Compare
Choose a tag to compare

There's a bug with pypdfium 4.30.1 and text extraction - pinning to previous version.

Improved Segmentation with Heuristic-Based Approach

12 Dec 16:12
cd9d41d
Compare
Choose a tag to compare

We’ve removed pdftext's reliance on the decision tree for segmenting spans, lines, and blocks and are now utilizing simpler heuristics for more efficient and accurate segmentation.

Fix loose charbox for quotes

03 Dec 20:39
f26428a
Compare
Choose a tag to compare

Special chars don't work well with the loose charbox. We'll remove loose entirely soon, but this is an intermediate fix for an annoying issue with misplaced quotes.

Fix memory leak warnings

19 Nov 18:32
c065ac0
Compare
Choose a tag to compare

Close the PDF documents properly to avoid warnings + memory leaks.

Fix PDF flattening

25 Oct 17:47
10d979b
Compare
Choose a tag to compare

Ensure it flattens when multiprocessing

Better device coordinate extraction

18 Oct 15:41
c88e23c
Compare
Choose a tag to compare

There were some cases where visual and text coordinates didn't align. This fixes that issue.

Revert extraction changes

17 Oct 19:57
c6a85c6
Compare
Choose a tag to compare
Merge pull request #14 from VikParuchuri/dev

Revert extraction

Python 3.13 compatibility

17 Oct 18:58
a7cd4fb
Compare
Choose a tag to compare
Merge pull request #13 from VikParuchuri/dev

Python 3.13 support

Ignore special chars, break lines more aggressively

17 Oct 18:51
7460bf4
Compare
Choose a tag to compare
Merge pull request #12 from VikParuchuri/dev

Improve line breaks, ignore special chars

Fix flattening bug

08 Oct 16:07
5915750
Compare
Choose a tag to compare
Merge pull request #11 from VikParuchuri/dev

Fix bug with flattening