Skip to content

Commit

Permalink
Improve model quality
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed Apr 25, 2024
1 parent 1501377 commit cc6a6e4
Show file tree
Hide file tree
Showing 5 changed files with 21 additions and 14 deletions.
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PDFText

Text extraction like [PyMuPDF]((https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.
Text extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.

# Installation

Expand Down Expand Up @@ -81,13 +81,11 @@ I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthe

Here are the scores:

+------------+-------------------+-----------------------------------------+
| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
+------------+-------------------+-----------------------------------------+
| pymupdf | 0.31 | -- |
| pdftext | 1.45 | 95.64 |
| pdfplumber | 2.97 | 89.88 |
+------------+-------------------+-----------------------------------------+
| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
|------------|-------------------|-----------------------------------------|
| pymupdf | 0.32 | -- |
| pdftext | 1.79 | 96.22 |
| pdfplumber | 3.0 | 89.88 |

pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same information).

Expand Down Expand Up @@ -127,6 +125,6 @@ This is built on some amazing open source work, including:

- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [scikit-learn](https://scikit-learn.org/stable/index.html)
- [pypdf2](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks
- [pypdf](https://github.com/py-pdf/benchmarks) for very thorough and fair benchmarks

Thank you to the [pymupdf](https://github.com/pymupdf/PyMuPDF) devs for creating such a great library - I just wish it had a simpler license!
2 changes: 1 addition & 1 deletion benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def main():
table_alignments.insert(0, "--")

table = [(tool, time, alignment) for tool, time, alignment in zip(times_tools, table_times, table_alignments)]
table = tabulate.tabulate(table, tablefmt="pretty", headers=headers)
table = tabulate.tabulate(table, tablefmt="github", headers=headers)
print(table)

results = {
Expand Down
Binary file modified models/dt.joblib
Binary file not shown.
15 changes: 12 additions & 3 deletions pdftext/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def update_current(current, new_char):
return current


def create_training_row(char_info, prev_char, currblock):
def create_training_row(char_info, prev_char, currblock, currline):
char = char_info["char"]
char_center_x = (char_info["bbox"][2] + char_info["bbox"][0]) / 2
char_center_y = (char_info["bbox"][3] + char_info["bbox"][1]) / 2
Expand All @@ -42,10 +42,18 @@ def create_training_row(char_info, prev_char, currblock):
"font_match": font_match,
"x_outer_gap": char_info["bbox"][2] - prev_char["bbox"][0],
"y_outer_gap": char_info["bbox"][3] - prev_char["bbox"][1],
"line_x_center_gap": char_center_x - currline["center_x"],
"line_y_center_gap": char_center_y - currline["center_y"],
"line_x_gap": char_info["bbox"][0] - currline["bbox"][2],
"line_y_gap": char_info["bbox"][1] - currline["bbox"][3],
"line_x_start_gap": char_info["bbox"][0] - currline["bbox"][0],
"line_y_start_gap": char_info["bbox"][1] - currline["bbox"][1],
"block_x_center_gap": char_center_x - currblock["center_x"],
"block_y_center_gap": char_center_y - currblock["center_y"],
"block_x_gap": char_info["bbox"][0] - currblock["bbox"][2],
"block_y_gap": char_info["bbox"][1] - currblock["bbox"][3]
"block_y_gap": char_info["bbox"][1] - currblock["bbox"][3],
"block_x_start_gap": char_info["bbox"][0] - currblock["bbox"][0],
"block_y_start_gap": char_info["bbox"][1] - currblock["bbox"][1]
}

return training_row
Expand Down Expand Up @@ -80,7 +88,7 @@ def infer_single_page(text_chars):
span = {"chars": []}
for i, char_info in enumerate(text_chars["chars"]):
if prev_char:
training_row = create_training_row(char_info, prev_char, block)
training_row = create_training_row(char_info, prev_char, block, line)
training_row = [v for _, v in sorted(training_row.items())]

prediction = yield training_row
Expand All @@ -97,6 +105,7 @@ def infer_single_page(text_chars):
block = update_block(blocks, block)

span["chars"].append(char_info)
line = update_current(line, char_info)
block = update_current(block, char_info)

prev_char = char_info
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "pdftext"
version = "0.1.0"
version = "0.1.1"
description = "Extract structured text from pdfs quickly"
authors = ["Vik Paruchuri <vik.paruchuri@gmail.com>"]
license = "Apache-2.0"
Expand Down

0 comments on commit cc6a6e4

Please sign in to comment.