Releases: VikParuchuri/pdftext
Releases · VikParuchuri/pdftext
Pin pypdfium2
There's a bug with pypdfium 4.30.1 and text extraction - pinning to previous version.
Improved Segmentation with Heuristic-Based Approach
We’ve removed pdftext's reliance on the decision tree for segmenting spans, lines, and blocks and are now utilizing simpler heuristics for more efficient and accurate segmentation.
Fix loose charbox for quotes
Special chars don't work well with the loose charbox. We'll remove loose entirely soon, but this is an intermediate fix for an annoying issue with misplaced quotes.
Fix memory leak warnings
Close the PDF documents properly to avoid warnings + memory leaks.
Fix PDF flattening
Ensure it flattens when multiprocessing
Better device coordinate extraction
There were some cases where visual and text coordinates didn't align. This fixes that issue.
Revert extraction changes
Merge pull request #14 from VikParuchuri/dev Revert extraction
Python 3.13 compatibility
Merge pull request #13 from VikParuchuri/dev Python 3.13 support
Ignore special chars, break lines more aggressively
Merge pull request #12 from VikParuchuri/dev Improve line breaks, ignore special chars
Fix flattening bug
Merge pull request #11 from VikParuchuri/dev Fix bug with flattening