-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text extraction code for columns. #366
Merged
Merged
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
6fe0d20
Fixed filename:page in logging
peterwilliams97 22680be
Got CMap working for multi-rune entries
peterwilliams97 a9910e7
Treat CMap entries as strings instead of runes to handle multi-byte e…
peterwilliams97 0c54cec
Added a test for multibyte encoding.
peterwilliams97 6103fb8
Merge branch 'development' of https://github.com/unidoc/unipdf into cmap
peterwilliams97 e9c46fa
Merge branch 'cmap' into columns
peterwilliams97 6b13a99
First version of text extraction that recognizes columns
peterwilliams97 a5c538f
Added an expanation of the text columns code to README.md.
peterwilliams97 8303318
fixed typos
peterwilliams97 c515472
Abstracted textWord depth calculation. This required change textMark …
peterwilliams97 603b5ff
Added function comments.
peterwilliams97 fad1552
Fixed text state save/restore.
peterwilliams97 6b4314f
Adjusted inter-word search distance to make paragrah division work fo…
peterwilliams97 d21e2f8
Got text_test.go passing.
peterwilliams97 418f859
Reinstated hyphen suppression
peterwilliams97 2260e24
Handle more cases of fonts not being set in text extraction code.
peterwilliams97 a14d8e7
Fixed typo
peterwilliams97 49bbef0
More verbose logging
peterwilliams97 40806d7
Adding tables to text extractor.
peterwilliams97 29f2d9b
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 af9508c
Added tests for columns extraction.
peterwilliams97 16b3c1c
Removed commented code
peterwilliams97 30fc953
Check for textParas that are on the same line when writing out extrac…
peterwilliams97 b4d90b6
Absorb text to the left of paras into paras e.g. Footnote numbers
peterwilliams97 975e038
Removed funny character from text_test.go
peterwilliams97 e6be021
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 a7779a3
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 5d7e4aa
Commented out a creator_test.go test that was broken by my text extra…
peterwilliams97 acb5caa
Big changes to columns text extraction code for PR.
peterwilliams97 80b54ef
Updated extractor/README
peterwilliams97 91479a7
Cleaned up some comments and removed a panic
peterwilliams97 72155a0
Increased threshold for truncating extracted text when there is no li…
peterwilliams97 09ebbcf
Improved an error message.
peterwilliams97 1c54e01
Removed irrelevant spaces
peterwilliams97 17bee4d
Commented code and removed unused functions.
peterwilliams97 e65fb04
Reverted PdfRectangle changes
peterwilliams97 5933a3d
Added duplicate text detection.
peterwilliams97 933021c
Combine diacritic textMarks in text extraction
peterwilliams97 f3770ee
Reinstated a diacritic recombination test.
peterwilliams97 e8abebd
Small code reorganisation
peterwilliams97 3f1df97
Reinstated handling of rotated text
peterwilliams97 3cca581
Addressed issues in PR review
peterwilliams97 b39f205
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 d5c344d
Added color fields to TextMark
peterwilliams97 fe6afef
Updated README
peterwilliams97 8be2607
Reinstated the disabled tests I missed before.
peterwilliams97 a5e21a7
Tightened definition for tables to prevent detection of tables where …
peterwilliams97 8f64966
Compute line splitting search range based on fontsize of first word i…
peterwilliams97 25414d4
Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported f…
peterwilliams97 cf91ad6
Fixed some naming and added some comments.
peterwilliams97 9caa40e
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 b7f91fd
errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility
peterwilliams97 d3deac8
Removed code that doesn't ever get called.
peterwilliams97 fe35826
Removed unused test
peterwilliams97 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Reinstated handling of rotated text
- Loading branch information
commit 3f1df971e5108ed5cc5617b24466de1f8a4bebd4
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these todos current?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update them every day. If there haven't any commits in the last 24 hours they should be up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are up to date now.