Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extraction code for columns. #366

Merged
merged 54 commits into from
Jun 30, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
6fe0d20
Fixed filename:page in logging
peterwilliams97 May 19, 2020
22680be
Got CMap working for multi-rune entries
peterwilliams97 May 19, 2020
a9910e7
Treat CMap entries as strings instead of runes to handle multi-byte e…
peterwilliams97 May 20, 2020
0c54cec
Added a test for multibyte encoding.
peterwilliams97 May 20, 2020
6103fb8
Merge branch 'development' of https://github.com/unidoc/unipdf into cmap
peterwilliams97 May 20, 2020
e9c46fa
Merge branch 'cmap' into columns
peterwilliams97 May 24, 2020
6b13a99
First version of text extraction that recognizes columns
peterwilliams97 May 24, 2020
a5c538f
Added an expanation of the text columns code to README.md.
peterwilliams97 May 24, 2020
8303318
fixed typos
peterwilliams97 May 24, 2020
c515472
Abstracted textWord depth calculation. This required change textMark …
peterwilliams97 May 24, 2020
603b5ff
Added function comments.
peterwilliams97 May 25, 2020
fad1552
Fixed text state save/restore.
peterwilliams97 May 26, 2020
6b4314f
Adjusted inter-word search distance to make paragrah division work fo…
peterwilliams97 May 26, 2020
d21e2f8
Got text_test.go passing.
peterwilliams97 May 27, 2020
418f859
Reinstated hyphen suppression
peterwilliams97 May 27, 2020
2260e24
Handle more cases of fonts not being set in text extraction code.
peterwilliams97 May 28, 2020
a14d8e7
Fixed typo
peterwilliams97 May 28, 2020
49bbef0
More verbose logging
peterwilliams97 May 28, 2020
40806d7
Adding tables to text extractor.
peterwilliams97 Jun 1, 2020
29f2d9b
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 Jun 5, 2020
af9508c
Added tests for columns extraction.
peterwilliams97 Jun 5, 2020
16b3c1c
Removed commented code
peterwilliams97 Jun 5, 2020
30fc953
Check for textParas that are on the same line when writing out extrac…
peterwilliams97 Jun 5, 2020
b4d90b6
Absorb text to the left of paras into paras e.g. Footnote numbers
peterwilliams97 Jun 5, 2020
975e038
Removed funny character from text_test.go
peterwilliams97 Jun 15, 2020
e6be021
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 Jun 15, 2020
a7779a3
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 Jun 22, 2020
5d7e4aa
Commented out a creator_test.go test that was broken by my text extra…
peterwilliams97 Jun 22, 2020
acb5caa
Big changes to columns text extraction code for PR.
peterwilliams97 Jun 22, 2020
80b54ef
Updated extractor/README
peterwilliams97 Jun 22, 2020
91479a7
Cleaned up some comments and removed a panic
peterwilliams97 Jun 22, 2020
72155a0
Increased threshold for truncating extracted text when there is no li…
peterwilliams97 Jun 22, 2020
09ebbcf
Improved an error message.
peterwilliams97 Jun 22, 2020
1c54e01
Removed irrelevant spaces
peterwilliams97 Jun 22, 2020
17bee4d
Commented code and removed unused functions.
peterwilliams97 Jun 23, 2020
e65fb04
Reverted PdfRectangle changes
peterwilliams97 Jun 23, 2020
5933a3d
Added duplicate text detection.
peterwilliams97 Jun 23, 2020
933021c
Combine diacritic textMarks in text extraction
peterwilliams97 Jun 24, 2020
f3770ee
Reinstated a diacritic recombination test.
peterwilliams97 Jun 24, 2020
e8abebd
Small code reorganisation
peterwilliams97 Jun 24, 2020
3f1df97
Reinstated handling of rotated text
peterwilliams97 Jun 25, 2020
3cca581
Addressed issues in PR review
peterwilliams97 Jun 25, 2020
b39f205
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 Jun 25, 2020
d5c344d
Added color fields to TextMark
peterwilliams97 Jun 25, 2020
fe6afef
Updated README
peterwilliams97 Jun 25, 2020
8be2607
Reinstated the disabled tests I missed before.
peterwilliams97 Jun 25, 2020
a5e21a7
Tightened definition for tables to prevent detection of tables where …
peterwilliams97 Jun 25, 2020
8f64966
Compute line splitting search range based on fontsize of first word i…
peterwilliams97 Jun 26, 2020
25414d4
Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported f…
peterwilliams97 Jun 27, 2020
cf91ad6
Fixed some naming and added some comments.
peterwilliams97 Jun 27, 2020
9caa40e
Merge branch 'development' of https://github.com/unidoc/unipdf into c…
peterwilliams97 Jun 27, 2020
b7f91fd
errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility
peterwilliams97 Jun 29, 2020
d3deac8
Removed code that doesn't ever get called.
peterwilliams97 Jun 29, 2020
fe35826
Removed unused test
peterwilliams97 Jun 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Reinstated handling of rotated text
  • Loading branch information
peterwilliams97 committed Jun 25, 2020
commit 3f1df971e5108ed5cc5617b24466de1f8a4bebd4
4 changes: 2 additions & 2 deletions extractor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ TODO
-----

* Remove serial code?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these todos current?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update them every day. If there haven't any commits in the last 24 hours they should be up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are up to date now.

* Remove verbose* logging?
* Reinstate rotated text handling.
* Remove `verbose*` logging?
* Come up with a better name for *reading* direction. Scanning direction? [Word order](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694615/)?
peterwilliams97 marked this conversation as resolved.
Show resolved Hide resolved
* Handle diagonal text.
* Get R to L text extraction working.
* Get top to bottom text extraction working.
* Remove TM from ligature map.
24 changes: 20 additions & 4 deletions extractor/text.go
Original file line number Diff line number Diff line change
Expand Up @@ -838,8 +838,7 @@ func (to *textObject) renderText(data []byte) error {
} else {
// TODO: This lookup seems confusing. Went from bytes <-> charcodes already.
// NOTE: This is needed to register runes by the font encoder - for subsetting (optimization).
original, ok := font.Encoder().CharcodeToRune(code)
if ok {
if original, ok := font.Encoder().CharcodeToRune(code); ok {
mark.original = string(original)
}
}
Expand Down Expand Up @@ -923,8 +922,25 @@ func (pt PageText) Tables() []TextTable {
// The comments above the TextMark definition describe how to use the []TextMark to
// maps substrings of the page text to locations on the PDF page.
func (pt *PageText) computeViews() {
common.Log.Trace("ToTextLocation: %d elements", len(pt.marks))
paras := makeTextPage(pt.marks, pt.pageSize, 0)
// Extract text paragraphs one orientation at a time.
// If there are texts with several orientations on a page then the all the text of the same
// orientation gets extracted togther.
var paras paraList
n := len(pt.marks)
for orient := 0; orient < 360 && n > 0; orient += 90 {
marks := make([]*textMark, 0, len(pt.marks)-n)
for _, tm := range pt.marks {
if tm.orient == orient {
marks = append(marks, tm)
}
}
if len(marks) > 0 {
parasOrient := makeTextPage(marks, pt.pageSize)
paras = append(paras, parasOrient...)
n -= len(marks)
}
}
// Build the public viewable fields from the paraLis
b := new(bytes.Buffer)
paras.writeText(b)
pt.viewText = b.String()
Expand Down
2 changes: 2 additions & 0 deletions extractor/text_const.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ const (

// The following constants are the tuning parameter for text extracton
const (
// Change in angle of text in degrees that we treat as a different orientatiom/
orientationGranularity = 10
// Size of depth bins in points
depthBinPoints = 6

Expand Down
69 changes: 54 additions & 15 deletions extractor/text_mark.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,17 @@ import (
// textMark represents text drawn on a page and its position in device coordinates.
// All dimensions are in device coordinates.
type textMark struct {
serial int // Sequence number for debugging.
model.PdfRectangle // Bounding box.
text string // The text (decoded via ToUnicode).
original string // Original text (decoded).
font *model.PdfFont // The font the mark was drawn with.
fontsize float64 // The font size the mark was drawn with.
charspacing float64 // TODO (peterwilliams97: Should this be exposed in TextMark?
trm transform.Matrix // The current text rendering matrix (TRM above).
end transform.Point // The end of character device coordinates.
serial int // Sequence number for debugging.
model.PdfRectangle // Bounding box oriented so character base is at bottom
orient int // Orientation
text string // The text (decoded via ToUnicode).
original string // Original text (decoded).
font *model.PdfFont // The font the mark was drawn with.
fontsize float64 // The font size the mark was drawn with.
charspacing float64 // TODO (peterwilliams97: Should this be exposed in TextMark?
trm transform.Matrix // The current text rendering matrix (TRM above).
end transform.Point // The end of character device coordinates.
originaBBox model.PdfRectangle // Bounding box without orientation correction.
}

// newTextMark returns a textMark for text `text` rendered with text rendering matrix (TRM) `trm`
Expand All @@ -34,7 +36,7 @@ type textMark struct {
func (to *textObject) newTextMark(text string, trm transform.Matrix, end transform.Point,
spaceWidth float64, font *model.PdfFont, charspacing float64) (textMark, bool) {
theta := trm.Angle()
orient := nearestMultiple(theta, 10)
orient := nearestMultiple(theta, orientationGranularity)
var height float64
if orient%180 != 90 {
height = trm.ScalingFactorY()
Expand All @@ -51,7 +53,12 @@ func (to *textObject) newTextMark(text string, trm transform.Matrix, end transfo
bbox.Ury -= height
case 270:
bbox.Urx += height
case 0:
bbox.Ury += height
default:
// This is a hack to capture diagonal text.
// TODO(peterwilliams97): Extract diagonal text.
orient = 0
bbox.Ury += height
}
if bbox.Llx > bbox.Urx {
Expand All @@ -68,20 +75,52 @@ func (to *textObject) newTextMark(text string, trm transform.Matrix, end transfo
}
bbox = clipped

// The orientedBBox is bbox rotated and translated so the base of the character is at Lly.
orientedBBox := bbox
orientedMBox := to.e.mediaBox

switch orient % 360 {
case 90:
orientedMBox.Urx, orientedMBox.Ury = orientedMBox.Ury, orientedMBox.Urx
orientedBBox = model.PdfRectangle{
Llx: orientedMBox.Urx - bbox.Ury,
Urx: orientedMBox.Urx - bbox.Lly,
Lly: bbox.Llx,
Ury: bbox.Urx}
case 180:
orientedBBox = model.PdfRectangle{
Llx: bbox.Llx,
Urx: bbox.Urx,
Lly: orientedMBox.Ury - bbox.Lly,
Ury: orientedMBox.Ury - bbox.Ury}
case 270:
orientedMBox.Urx, orientedMBox.Ury = orientedMBox.Ury, orientedMBox.Urx
orientedBBox = model.PdfRectangle{
Llx: bbox.Ury,
Urx: bbox.Lly,
Lly: orientedMBox.Ury - bbox.Llx,
Ury: orientedMBox.Ury - bbox.Urx}
}
if orientedBBox.Llx > orientedBBox.Urx {
orientedBBox.Llx, orientedBBox.Urx = orientedBBox.Urx, orientedBBox.Llx
}
if orientedBBox.Lly > orientedBBox.Ury {
orientedBBox.Lly, orientedBBox.Ury = orientedBBox.Ury, orientedBBox.Lly
}

tm := textMark{
text: text,
PdfRectangle: bbox,
PdfRectangle: orientedBBox,
originaBBox: bbox,
font: font,
fontsize: height,
charspacing: charspacing,
trm: trm,
end: end,
orient: orient,
serial: serial.mark,
}
serial.mark++
if !isTextSpace(tm.text) && tm.Width() == 0.0 {
common.Log.Debug("ERROR: Zero width text. tm=%s", tm.String())
}
if verboseGeom {
common.Log.Info("newTextMark: start=%.2f end=%.2f %s", start, end, tm.String())
}
Expand All @@ -106,7 +145,7 @@ func (tm *textMark) ToTextMark() TextMark {
count: int64(tm.serial),
Text: tm.text,
Original: tm.original,
BBox: tm.PdfRectangle,
BBox: tm.originaBBox,
Font: tm.font,
FontSize: tm.fontsize,
}
Expand Down
2 changes: 1 addition & 1 deletion extractor/text_page.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ import (
// 3) Detect textParas arranged as cells in a table and convert each one to a textPara containing a
// textTable.
// 4) Sort the textParas in reading order.
func makeTextPage(marks []*textMark, pageSize model.PdfRectangle, rot int) paraList {
func makeTextPage(marks []*textMark, pageSize model.PdfRectangle) paraList {
common.Log.Trace("makeTextPage: %d elements pageSize=%.2f", len(marks), pageSize)
if len(marks) == 0 {
return nil
Expand Down
37 changes: 16 additions & 21 deletions extractor/text_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -214,21 +214,21 @@ var fileExtractionTests = []struct {
},
},
// TODO(peterwilliams97): Reinstate rotation handling and this text.
// {filename: "000026.pdf",
// pageTerms: map[int][]string{
// 1: []string{"Fresh Flower",
// "Care & Handling",
// },
// },
// },
{filename: "000026.pdf",
pageTerms: map[int][]string{
1: {"Fresh Flower",
"Care & Handling",
},
},
},
{filename: "search_sim_key.pdf",
pageTerms: map[int][]string{
2: {"A cryptographic scheme which enables searching",
"Untrusted server should not be able to search for a word without authorization",
},
},
},
{filename: "Theil_inequality.pdf",
{filename: "Theil_inequality.pdf", // 270° rotated file.
pageTerms: map[int][]string{
1: {"London School of Economics and Political Science"},
4: {"The purpose of this paper is to set Theil’s approach"},
Expand Down Expand Up @@ -273,10 +273,6 @@ var fileExtractionTests = []struct {
1: {"entropy of a system of n identical resonators in a stationary radiation field"},
},
},
// Case where combineDiacritics was combining ' and " with preceeding letters.
// NOTE(peterwilliams97): Part of the reason this test fails is that we don't currently read
// Type0:CIDFontType0 font metrics and assume zero displacemet so that we place the ' and " too
// close to the preceeding letters.
{filename: "/rfc6962.txt.pdf",
pageTerms: map[int][]string{
4: {"timestamps for certificates they then don’t log",
Expand All @@ -288,15 +284,14 @@ var fileExtractionTests = []struct {
10: {"الله"},
},
},
// TODO(peterwilliams97): Reinstate these 2 tests when diacritic combination is fixed.
// {filename: "Ito_Formula.pdf",
// pageTerms: map[int][]string{
// 1: {"In the Itô stochastic calculus",
// "In standard, non-stochastic calculus, one computes a derivative"},
// 2: {"Financial Economics Itô’s Formula"},
// },
// },
{filename: "thanh.pdf",
{filename: "Ito_Formula.pdf", // 90° rotated with diacritics in different textMarks to base.
pageTerms: map[int][]string{
1: {"In the Itô stochastic calculus",
"In standard, non-stochastic calculus, one computes a derivative"},
2: {"Financial Economics Itô’s Formula"},
},
},
{filename: "thanh.pdf", // Diacritics in different textMarks to base.
pageTerms: map[int][]string{
1: {"Hàn Thế Thành"},
6: {"Petr Olšák"},
Expand Down