Add github action, speed up inference

VikParuchuri · Apr 25, 2024 · 1501377 · 1501377
1 parent 7359e5e
commit 1501377
Show file tree

Hide file tree

Showing 13 changed files with 173 additions and 88 deletions.
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -0,0 +1,27 @@
+name: Python package
+on:
+  push:
+    tags:
+      - "v*.*.*"
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+      - name: Install python dependencies
+        run: |
+          pip install poetry
+          poetry install
+      - name: Build package
+        run: |
+          poetry build
+      - name: Publish package
+        env:
+          PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+        run: |
+          poetry config pypi-token.pypi "$PYPI_TOKEN"
+          poetry publish
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -0,0 +1,25 @@
+name: Integration test
+
+on: [push]
+
+env:
+  TORCH_DEVICE: "cpu"
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+      - name: Install python dependencies
+        run: |
+          pip install poetry
+          poetry install
+      - name: Run detection benchmark test
+        run: |
+          poetry run python benchmark.py --max 5 --result_path results
+          poetry run python scripts/verify_benchmark_scores.py results/results.json 
+
diff --git a/README.md b/README.md
@@ -24,6 +24,8 @@ pdftext PDF_PATH --out_path output.txt
 
 ## JSON
 
+This command outputs structured blocks and lines with font and other information.
+
 ```shell
 pdftext PDF_PATH --out_path output.txt --output_type json
 ```
@@ -35,18 +37,18 @@ pdftext PDF_PATH --out_path output.txt --output_type json
 
 The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order).  Each page will include the following keys:
 
-- `bbox` - the page bbox, in [x1, y1, x2, y2] format
-- `rotation` - how much the page is rotated, in degrees (0, 90, 180, or 270)
+- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format
+- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)
 - `page_idx` - the index of the page
 - `blocks` - the blocks that make up the text in the pdf.  Approximately equal to a paragraph.
-  - `bbox` - the block bbox, in [x1, y1, x2, y2] format
+  - `bbox` - the block bbox, in `[x1, y1, x2, y2]` format
   - `lines` - the lines inside the block
-    - `bbox` - the line bbox, in [x1, y1, x2, y2] format
+    - `bbox` - the line bbox, in `[x1, y1, x2, y2]` format
     - `chars` - the individual characters in the line
       - `char` - the actual character, encoded in utf-8
       - `rotation` - how much the character is rotated, in degrees
-      - `bbox` - the character bbox, in [x1, y1, x2, y2] format
-      - `char_idx` - the index of the character on the page (from 0 to number of characters, in original pdf order)
+      - `bbox` - the character bbox, in `[x1, y1, x2, y2]` format
+      - `char_idx` - the index of the character on the page (from `0` to number of characters, in original pdf order)
       - `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)
         - `size` - the size of the font used for the character
         - `weight` - font weight
@@ -75,16 +77,16 @@ If you want more customization, check out the `pdftext.extraction._get_pages` fu
 
 # Benchmarks
 
-I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext.
+I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext.  I chose pymupdf because it extracts blocks and lines.  Pdfplumber extracts words and bboxes.  I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual words/lines and bbox information.
 
 Here are the scores:
 
 +------------+-------------------+-----------------------------------------+
 |  Library   | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
 +------------+-------------------+-----------------------------------------+
 |  pymupdf   |       0.31        |                   --                    |
-|  pdftext   |       1.55        |                  95.73                  |
-| pdfplumber |       3.39        |                  89.55                  |
+|  pdftext   |       1.45        |                  95.64                  |
+| pdfplumber |       2.97        |                  89.88                  |
 +------------+-------------------+-----------------------------------------+
 
 pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same information).
@@ -95,9 +97,25 @@ There are additional benchmarks for pypdfium2 and other tools [here](https://git
 
 I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.
 
-For each library, I used a detailed extraction method, to pull out font information, as well as just the words.  This ensured we were comparing similar elements.
+For each library, I used a detailed extraction method, to pull out font information, as well as just the words.  This ensured we were comparing similar performance numbers.
+
+For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage.  I used the text extracted by pymupdf as the pseudo-ground truth.
+
+## Running benchmarks
+
+You can run the benchmarks yourself.  To do so, you have to first install pdftext manually.  The install assumes you have poetry and Python 3.9+ installed.
+
+```shell
+git clone https://github.com/VikParuchuri/pdftext.git
+cd pdftext
+poetry install
+python benchmark.py # Will download the benchmark pdfs automatically
+```
+
+The benchmark script has a few options:
 
-For the alignment score, I extracted the text, flattened it by removing all non-newline whitespace, then used the rapidfuzz library to find the alignment percentage.  I used the text extracted by pymupdf as the pseudo-ground truth.
+- `--max` this controls the maximum number of pdfs to benchmark
+- `--result_path` a folder to save the results.  A file called `results.json` will be created in the folder.
 
 # How it works
 

diff --git a/benchmark.py b/benchmark.py
@@ -1,18 +1,21 @@
 import argparse
 import tempfile
 import time
+from collections import defaultdict
+from functools import partial
 from statistics import mean
 import os
 import json
-import re
 
 import fitz as pymupdf
 import datasets
 import pdfplumber
 from rapidfuzz import fuzz
 import tabulate
+from tqdm import tqdm
 
 from pdftext.extraction import paginated_plain_text_output
+from pdftext.model import get_model
 from pdftext.settings import settings
 
 
@@ -38,18 +41,14 @@ def pdfplumber_inference(pdf_path):
         pages = []
         for i in range(len(pdf.pages)):
             page = pdf.pages[i]
-            text = page.extract_text()
+            words = page.extract_words(use_text_flow=True)
+            text = "".join([word["text"] for word in words])
             pages.append(text)
     return pages
 
 
-def flatten_text(page: str):
-    # Replace all text, except newlines, so we can compare block parsing effectively.
-    return re.sub(r'[ \t\r\f\v]+', '', page)
-
-
 def compare_docs(doc1: str, doc2: str):
-    return fuzz.ratio(flatten_text(doc1), flatten_text(doc2))
+    return fuzz.ratio(doc1, doc2)
 
 
 def main():
@@ -63,58 +62,46 @@ def main():
         split = f"train[:{args.max}]"
     dataset = datasets.load_dataset(settings.BENCH_DATASET_NAME, split=split)
 
-    mu_times = []
-    pdftext_times = []
-    pdfplumber_times = []
-    pdftext_alignment = []
-    pdfplumber_alignment = []
-    for i in range(len(dataset)):
+    times = defaultdict(list)
+    alignments = defaultdict(list)
+    times_tools = ["pymupdf", "pdftext", "pdfplumber"]
+    alignment_tools = ["pdftext", "pdfplumber"]
+    model = get_model()
+    for i in tqdm(range(len(dataset)), desc="Benchmarking"):
         row = dataset[i]
         pdf = row["pdf"]
+        tool_pages = {}
         with tempfile.NamedTemporaryFile(suffix=".pdf") as f:
             f.write(pdf)
             f.seek(0)
             pdf_path = f.name
 
-            start = time.time()
-            mu_pages = pymupdf_inference(pdf_path)
-            mu_times.append(time.time() - start)
-
-
-            start = time.time()
-            pdftext_pages = paginated_plain_text_output(pdf_path)
-            pdftext_times.append(time.time() - start)
+            pdftext_inference = partial(paginated_plain_text_output, model=model)
+            inference_funcs = [pymupdf_inference, pdftext_inference, pdfplumber_inference]
+            for tool, inference_func in zip(times_tools, inference_funcs):
+                start = time.time()
+                pages = inference_func(pdf_path)
+                times[tool].append(time.time() - start)
+                tool_pages[tool] = pages
 
-            start = time.time()
-            pdfplumber_pages = pdfplumber_inference(pdf_path)
-            pdfplumber_times.append(time.time() - start)
-
-            alignments = [compare_docs(mu_page, pdftext_page) for mu_page, pdftext_page in zip(mu_pages, pdftext_pages)]
-            pdftext_alignment.append(mean(alignments))
-
-            alignments = [compare_docs(mu_page, pdfplumber_page) for mu_page, pdfplumber_page in zip(mu_pages, pdfplumber_pages)]
-            pdfplumber_alignment.append(mean(alignments))
+            for tool in alignment_tools:
+                alignments[tool].append(
+                    mean([compare_docs(tool_pages["pymupdf"][i], tool_pages[tool][i]) for i in range(len(tool_pages["pymupdf"]))])
+                )
 
     print("Benchmark Scores")
     headers = ["Library", "Time (s per page)", "Alignment Score (% accuracy vs pymupdf)"]
-    table = [
-        ["pymupdf", round(mean(mu_times), 2), "--"],
-        ["pdftext", round(mean(pdftext_times), 2), round(mean(pdftext_alignment), 2)],
-        ["pdfplumber", round(mean(pdfplumber_times), 2), round(mean(pdfplumber_alignment), 2)]
-    ]
+    table_times = [round(mean(times[tool]), 2) for tool in times_tools]
+    table_alignments = [round(mean(alignments[tool]), 2) for tool in alignment_tools]
+    table_alignments.insert(0, "--")
+
+    table = [(tool, time, alignment) for tool, time, alignment in zip(times_tools, table_times, table_alignments)]
     table = tabulate.tabulate(table, tablefmt="pretty", headers=headers)
     print(table)
 
     results = {
-        "times": {
-            "pymupdf": mean(mu_times),
-            "pdftext": mean(pdftext_times),
-            "pdfplumber": mean(pdfplumber_times)
-        },
-        "alignments": {
-            "pdftext": pdftext_alignment,
-            "pdfplumber": pdfplumber_alignment
-        }
+        "times": times,
+        "alignments": alignments
     }
 
     result_path = args.result_path

diff --git a/models/dt.joblib b/models/dt.joblib
diff --git a/pdftext/extraction.py b/pdftext/extraction.py
@@ -7,28 +7,29 @@
 from pdftext.postprocessing import merge_text, sort_blocks, postprocess_text
 
 
-def _get_pages(pdf_path):
-    model = get_model()
+def _get_pages(pdf_path, model=None):
+    if model is None:
+        model = get_model()
     text_chars = get_pdfium_chars(pdf_path)
     pages = inference(text_chars, model)
     return pages
 
 
-def plain_text_output(pdf_path, sort=False) -> str:
-    text = paginated_plain_text_output(pdf_path, sort=sort)
+def plain_text_output(pdf_path, sort=False, model=None) -> str:
+    text = paginated_plain_text_output(pdf_path, sort=sort, model=model)
     return "\n".join(text)
 
 
-def paginated_plain_text_output(pdf_path, sort=False) -> List[str]:
-    pages = _get_pages(pdf_path)
+def paginated_plain_text_output(pdf_path, sort=False, model=None) -> List[str]:
+    pages = _get_pages(pdf_path, model)
     text = []
     for page in pages:
         text.append(merge_text(page, sort=sort).strip())
     return text
 
 
-def dictionary_output(pdf_path, sort=False):
-    pages = _get_pages(pdf_path)
+def dictionary_output(pdf_path, sort=False, model=None):
+    pages = _get_pages(pdf_path, model)
     for page in pages:
         for block in page["blocks"]:
             bad_keys = [key for key in block.keys() if key not in ["lines", "bbox"]]

diff --git a/pdftext/inference.py b/pdftext/inference.py
@@ -19,7 +19,7 @@ def update_current(current, new_char):
     return current
 
 
-def create_training_row(char_info, prev_char, currblock, avg_x_gap, avg_y_gap):
+def create_training_row(char_info, prev_char, currblock):
     char = char_info["char"]
     char_center_x = (char_info["bbox"][2] + char_info["bbox"][0]) / 2
     char_center_y = (char_info["bbox"][3] + char_info["bbox"][1]) / 2
@@ -42,8 +42,6 @@ def create_training_row(char_info, prev_char, currblock, avg_x_gap, avg_y_gap):
         "font_match": font_match,
         "x_outer_gap": char_info["bbox"][2] - prev_char["bbox"][0],
         "y_outer_gap": char_info["bbox"][3] - prev_char["bbox"][1],
-        "x_gap_ratio": x_gap / avg_x_gap if avg_x_gap > 0 else 0,
-        "y_gap_ratio": y_gap / avg_y_gap if avg_y_gap > 0 else 0,
         "block_x_center_gap": char_center_x - currblock["center_x"],
         "block_y_center_gap": char_center_y - currblock["center_y"],
         "block_x_gap": char_info["bbox"][0] - currblock["bbox"][2],
@@ -82,7 +80,7 @@ def infer_single_page(text_chars):
     span = {"chars": []}
     for i, char_info in enumerate(text_chars["chars"]):
         if prev_char:
-            training_row = create_training_row(char_info, prev_char, block, text_chars["avg_x_gap"], text_chars["avg_y_gap"])
+            training_row = create_training_row(char_info, prev_char, block)
             training_row = [v for _, v in sorted(training_row.items())]
 
             prediction = yield training_row

diff --git a/pdftext/pdf/chars.py b/pdftext/pdf/chars.py
@@ -15,6 +15,10 @@ def update_previous_fonts(text_chars: Dict, i: int, fontname: str, fontflags: in
     for j in range(min_update, i): # Goes from min_update to i - 1
         if regather_font_info:
             fontname, fontflags = get_fontname(text_page, j)
+
+        # If we hit the region with the previous fontname, we can bail out
+        if fontname == prev_fontname:
+            break
         text_chars["chars"][j]["font"]["name"] = fontname
         text_chars["chars"][j]["font"]["flags"] = fontflags
 
@@ -38,17 +42,14 @@ def get_pdfium_chars(pdf_path, fontname_sample_freq=settings.FONTNAME_SAMPLE_FRE
             "bbox": pdfium_page_bbox_to_device_bbox(page, bbox, page_width, page_height)
         }
 
-        prev_bbox = None
         fontname = None
         fontflags = None
-        x_gaps = decimal.Decimal(0)
-        y_gaps = decimal.Decimal(0)
         total_chars = text_page.count_chars()
         for i in range(total_chars):
             char = pdfium_c.FPDFText_GetUnicode(text_page, i)
             char = chr(char)
-            fontsize = pdfium_c.FPDFText_GetFontSize(text_page, i)
-            fontweight = pdfium_c.FPDFText_GetFontWeight(text_page, i)
+            fontsize = round(pdfium_c.FPDFText_GetFontSize(text_page, i), 1)
+            fontweight = round(pdfium_c.FPDFText_GetFontWeight(text_page, i), 1)
             if fontname is None or i % fontname_sample_freq == 0:
                 prev_fontname = fontname
                 fontname, fontflags = get_fontname(text_page, i)
@@ -73,13 +74,6 @@ def get_pdfium_chars(pdf_path, fontname_sample_freq=settings.FONTNAME_SAMPLE_FRE
             }
             text_chars["chars"].append(char_info)
 
-            if prev_bbox:
-                x_gaps += decimal.Decimal(device_coords[0] - prev_bbox[2])
-                y_gaps += decimal.Decimal(device_coords[1] - prev_bbox[3])
-            prev_bbox = device_coords
-
-        text_chars["avg_x_gap"] = float(x_gaps / total_chars) if total_chars > 0 else 0
-        text_chars["avg_y_gap"] = float(y_gaps / total_chars) if total_chars > 0 else 0
         text_chars["total_chars"] = total_chars
         blocks.append(text_chars)
     return blocks