Improved an error message.

unidoc · gunnsth · Jun 30, 2020 · May 19, 2020 · May 19, 2020 · May 20, 2020
commit 09ebbcf5771794a5e4e8d45dc785c22fd395ad32
diff --git a/extractor/README.md b/extractor/README.md
@@ -14,54 +14,56 @@ In English text,
 HOW TEXT IS EXTRACTED
 ---------------------
 
-`text_page.go` **makeTextPage** is the top level function that builds the `textPara`s.
+`text_page.go` **makeTextPage()** is the top level text extraction function. It returns an ordered
+list of `textPara`s which are described below.
 
-* A page's `textMark`s are obtained from its contentstream. They are in the order they occur in the contentstrem.
+* A page's `textMark`s are obtained from its content stream. They are in the order they occur in the content stream.
 * The `textMark`s are grouped into word fragments called`textWord`s by scanning through the textMarks
- and spltting on space characters and the gaps between marks.
-* The `textWords`s are grouped into `textParas`s based on their bounding boxes' proximities to other
- textWords.
+ and splitting on space characters and the gaps between marks.
+* The `textWords`s are grouped into rectangular regions  based on their bounding boxes' proximities
+  to other `textWords`. These rectangular regions are called `textParas`s. (In the current implementation
+  there is an intermediate step where the `textWords` are divided into containers called `wordBags`.)
 * The `textWord`s in each `textPara` are arranged into `textLine`s (`textWord`s of similar depth).
 * Within each `textLine`, `textWord`s are sorted in reading order and each one that starts a whole
-word is marked.
-See `textLine.text()`.
-* `textPara.writeCellText()` shows how to extract the paragraph text from this arrangment.
+word is marked by setting its `newWord` flag to true. (See `textLine.text()`.)
 * All the `textPara`s on a page are checked to see if they are arranged as cells within a table and,
 if they are, they are combined into `textTable`s and a `textPara` containing the `textTable` replaces
 the `textPara`s containing the cells.
 * The `textPara`s, some of which may be tables, are sorted into reading order (the order in which they
-are reading, not in the reading directions).
+are read, not in the *reading* direction).
 
 
-The entire order of extracted text from a page is expressed in `paraList.writeText()` which
+The entire order of extracted text from a page is expressed in `paraList.writeText()`.
 
-* Iterates through the `textParas1, which are sorted in reading.
-* For each `textPara` with a table, iterates through through the table cell `textPara`s.
-* For each (top level or table cell) `textPara` iterates through the `textLine`s.
-* For each `textLine` iterates through the `textWord`s inserting a space before each one that has
+* This function iterates through the `textPara`s, which are sorted in reading order.
+* For each `textPara` with a table, it iterates through the table cell `textPara`s. (See
+ `textPara.writeCellText()`.)
+* For each (top level or table cell) `textPara`, it iterates through the `textLine`s.
+* For each `textLine`, it iterates through the `textWord`s inserting a space before each one that has
  the `newWord` flag set.
 
 
 ### `textWord` creation
 
-* `makeTextWords()` combines `textMark`s into `textWord`s, word fragments
-* textWord`s are the atoms of the text extraction code.
+* `makeTextWords()` combines `textMark`s into `textWord`s, word fragments.
+* `textWord`s are the atoms of the text extraction code.
 
 ### `textPara` creation
 
-* `dividePage()` combines `textWord`s, that are close to each other into groups in rectangular
+* `dividePage()` combines `textWord`s that are close to each other into groups in rectangular
  regions called `wordBags`.
-* wordBag.arrangeText() arranges the textWords in the rectangle into `textLine`s, groups textWords
-of about the same depth sorted left to right.
-* textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.
+* `wordBag.arrangeText()` arranges the `textWord`s in the rectangular regions into `textLine`s,
+  groups textWords of about the same depth sorted left to right.
+* `textLine.markWordBoundaries()` marks the `textWord`s in each `textLine` that start whole words.
 
 TODO
 -----
 
-* Remove serial code????
-* Remove verbose* logginng?
+* Remove serial code?
+* Remove verbose* logging?
 * Reinstate rotated text handling.
 * Reinstate  diacritic composition.
 * Reinstate duplicate text removal.
-* Reinstate creater_test.go extraction test.
-* Come up with a better name for _reading_ direction,
+* Come up with a better name for *reading* direction.
+* Get R to L text extraction working.
+* Get top to bottom text extraction working.
diff --git a/extractor/extractor.go b/extractor/extractor.go
@@ -6,6 +6,8 @@
 package extractor
 
 import (
+	"fmt"
+
 	"github.com/unidoc/unipdf/v3/model"
 )
 
@@ -46,7 +48,7 @@ func New(page *model.PdfPage) (*Extractor, error) {
 
 	mediaBox, err := page.GetMediaBox()
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("extractor requires mediaBox. %w", err)
 	}
 	e := &Extractor{
 		contents:    contents,

diff --git a/extractor/text_const.go b/extractor/text_const.go
@@ -11,7 +11,7 @@ const (
 	verboseGeom     = false
 	verbosePage     = false
 	verbosePara     = false
-	verboseParaLine = verbosePara && true
+	verboseParaLine = verbosePara && false
 	verboseParaWord = verboseParaLine && false
 	verboseTable    = false
 )