Skip to content

Commit

Permalink
Add github action, speed up inference
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed Apr 25, 2024
1 parent 7359e5e commit 1501377
Show file tree
Hide file tree
Showing 13 changed files with 173 additions and 88 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Python package
on:
push:
tags:
- "v*.*.*"
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install python dependencies
run: |
pip install poetry
poetry install
- name: Build package
run: |
poetry build
- name: Publish package
env:
PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
run: |
poetry config pypi-token.pypi "$PYPI_TOKEN"
poetry publish
25 changes: 25 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: Integration test

on: [push]

env:
TORCH_DEVICE: "cpu"

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install python dependencies
run: |
pip install poetry
poetry install
- name: Run detection benchmark test
run: |
poetry run python benchmark.py --max 5 --result_path results
poetry run python scripts/verify_benchmark_scores.py results/results.json
40 changes: 29 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ pdftext PDF_PATH --out_path output.txt

## JSON

This command outputs structured blocks and lines with font and other information.

```shell
pdftext PDF_PATH --out_path output.txt --output_type json
```
Expand All @@ -35,18 +37,18 @@ pdftext PDF_PATH --out_path output.txt --output_type json

The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:

- `bbox` - the page bbox, in [x1, y1, x2, y2] format
- `rotation` - how much the page is rotated, in degrees (0, 90, 180, or 270)
- `bbox` - the page bbox, in `[x1, y1, x2, y2]` format
- `rotation` - how much the page is rotated, in degrees (`0`, `90`, `180`, or `270`)
- `page_idx` - the index of the page
- `blocks` - the blocks that make up the text in the pdf. Approximately equal to a paragraph.
- `bbox` - the block bbox, in [x1, y1, x2, y2] format
- `bbox` - the block bbox, in `[x1, y1, x2, y2]` format
- `lines` - the lines inside the block
- `bbox` - the line bbox, in [x1, y1, x2, y2] format
- `bbox` - the line bbox, in `[x1, y1, x2, y2]` format
- `chars` - the individual characters in the line
- `char` - the actual character, encoded in utf-8
- `rotation` - how much the character is rotated, in degrees
- `bbox` - the character bbox, in [x1, y1, x2, y2] format
- `char_idx` - the index of the character on the page (from 0 to number of characters, in original pdf order)
- `bbox` - the character bbox, in `[x1, y1, x2, y2]` format
- `char_idx` - the index of the character on the page (from `0` to number of characters, in original pdf order)
- `font` this is font info straight from the pdf, see [this pdfium code](https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h)
- `size` - the size of the font used for the character
- `weight` - font weight
Expand Down Expand Up @@ -75,16 +77,16 @@ If you want more customization, check out the `pdftext.extraction._get_pages` fu

# Benchmarks

I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext.
I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual words/lines and bbox information.

Here are the scores:

+------------+-------------------+-----------------------------------------+
| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
+------------+-------------------+-----------------------------------------+
| pymupdf | 0.31 | -- |
| pdftext | 1.55 | 95.73 |
| pdfplumber | 3.39 | 89.55 |
| pdftext | 1.45 | 95.64 |
| pdfplumber | 2.97 | 89.88 |
+------------+-------------------+-----------------------------------------+

pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same information).
Expand All @@ -95,9 +97,25 @@ There are additional benchmarks for pypdfium2 and other tools [here](https://git

I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.

For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar elements.
For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers.

For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.

## Running benchmarks

You can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.

```shell
git clone https://github.com/VikParuchuri/pdftext.git
cd pdftext
poetry install
python benchmark.py # Will download the benchmark pdfs automatically
```

The benchmark script has a few options:

For the alignment score, I extracted the text, flattened it by removing all non-newline whitespace, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.
- `--max` this controls the maximum number of pdfs to benchmark
- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.

# How it works

Expand Down
77 changes: 32 additions & 45 deletions benchmark.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
import argparse
import tempfile
import time
from collections import defaultdict
from functools import partial
from statistics import mean
import os
import json
import re

import fitz as pymupdf
import datasets
import pdfplumber
from rapidfuzz import fuzz
import tabulate
from tqdm import tqdm

from pdftext.extraction import paginated_plain_text_output
from pdftext.model import get_model
from pdftext.settings import settings


Expand All @@ -38,18 +41,14 @@ def pdfplumber_inference(pdf_path):
pages = []
for i in range(len(pdf.pages)):
page = pdf.pages[i]
text = page.extract_text()
words = page.extract_words(use_text_flow=True)
text = "".join([word["text"] for word in words])
pages.append(text)
return pages


def flatten_text(page: str):
# Replace all text, except newlines, so we can compare block parsing effectively.
return re.sub(r'[ \t\r\f\v]+', '', page)


def compare_docs(doc1: str, doc2: str):
return fuzz.ratio(flatten_text(doc1), flatten_text(doc2))
return fuzz.ratio(doc1, doc2)


def main():
Expand All @@ -63,58 +62,46 @@ def main():
split = f"train[:{args.max}]"
dataset = datasets.load_dataset(settings.BENCH_DATASET_NAME, split=split)

mu_times = []
pdftext_times = []
pdfplumber_times = []
pdftext_alignment = []
pdfplumber_alignment = []
for i in range(len(dataset)):
times = defaultdict(list)
alignments = defaultdict(list)
times_tools = ["pymupdf", "pdftext", "pdfplumber"]
alignment_tools = ["pdftext", "pdfplumber"]
model = get_model()
for i in tqdm(range(len(dataset)), desc="Benchmarking"):
row = dataset[i]
pdf = row["pdf"]
tool_pages = {}
with tempfile.NamedTemporaryFile(suffix=".pdf") as f:
f.write(pdf)
f.seek(0)
pdf_path = f.name

start = time.time()
mu_pages = pymupdf_inference(pdf_path)
mu_times.append(time.time() - start)


start = time.time()
pdftext_pages = paginated_plain_text_output(pdf_path)
pdftext_times.append(time.time() - start)
pdftext_inference = partial(paginated_plain_text_output, model=model)
inference_funcs = [pymupdf_inference, pdftext_inference, pdfplumber_inference]
for tool, inference_func in zip(times_tools, inference_funcs):
start = time.time()
pages = inference_func(pdf_path)
times[tool].append(time.time() - start)
tool_pages[tool] = pages

start = time.time()
pdfplumber_pages = pdfplumber_inference(pdf_path)
pdfplumber_times.append(time.time() - start)

alignments = [compare_docs(mu_page, pdftext_page) for mu_page, pdftext_page in zip(mu_pages, pdftext_pages)]
pdftext_alignment.append(mean(alignments))

alignments = [compare_docs(mu_page, pdfplumber_page) for mu_page, pdfplumber_page in zip(mu_pages, pdfplumber_pages)]
pdfplumber_alignment.append(mean(alignments))
for tool in alignment_tools:
alignments[tool].append(
mean([compare_docs(tool_pages["pymupdf"][i], tool_pages[tool][i]) for i in range(len(tool_pages["pymupdf"]))])
)

print("Benchmark Scores")
headers = ["Library", "Time (s per page)", "Alignment Score (% accuracy vs pymupdf)"]
table = [
["pymupdf", round(mean(mu_times), 2), "--"],
["pdftext", round(mean(pdftext_times), 2), round(mean(pdftext_alignment), 2)],
["pdfplumber", round(mean(pdfplumber_times), 2), round(mean(pdfplumber_alignment), 2)]
]
table_times = [round(mean(times[tool]), 2) for tool in times_tools]
table_alignments = [round(mean(alignments[tool]), 2) for tool in alignment_tools]
table_alignments.insert(0, "--")

table = [(tool, time, alignment) for tool, time, alignment in zip(times_tools, table_times, table_alignments)]
table = tabulate.tabulate(table, tablefmt="pretty", headers=headers)
print(table)

results = {
"times": {
"pymupdf": mean(mu_times),
"pdftext": mean(pdftext_times),
"pdfplumber": mean(pdfplumber_times)
},
"alignments": {
"pdftext": pdftext_alignment,
"pdfplumber": pdfplumber_alignment
}
"times": times,
"alignments": alignments
}

result_path = args.result_path
Expand Down
Binary file modified models/dt.joblib
Binary file not shown.
17 changes: 9 additions & 8 deletions pdftext/extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,29 @@
from pdftext.postprocessing import merge_text, sort_blocks, postprocess_text


def _get_pages(pdf_path):
model = get_model()
def _get_pages(pdf_path, model=None):
if model is None:
model = get_model()
text_chars = get_pdfium_chars(pdf_path)
pages = inference(text_chars, model)
return pages


def plain_text_output(pdf_path, sort=False) -> str:
text = paginated_plain_text_output(pdf_path, sort=sort)
def plain_text_output(pdf_path, sort=False, model=None) -> str:
text = paginated_plain_text_output(pdf_path, sort=sort, model=model)
return "\n".join(text)


def paginated_plain_text_output(pdf_path, sort=False) -> List[str]:
pages = _get_pages(pdf_path)
def paginated_plain_text_output(pdf_path, sort=False, model=None) -> List[str]:
pages = _get_pages(pdf_path, model)
text = []
for page in pages:
text.append(merge_text(page, sort=sort).strip())
return text


def dictionary_output(pdf_path, sort=False):
pages = _get_pages(pdf_path)
def dictionary_output(pdf_path, sort=False, model=None):
pages = _get_pages(pdf_path, model)
for page in pages:
for block in page["blocks"]:
bad_keys = [key for key in block.keys() if key not in ["lines", "bbox"]]
Expand Down
6 changes: 2 additions & 4 deletions pdftext/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def update_current(current, new_char):
return current


def create_training_row(char_info, prev_char, currblock, avg_x_gap, avg_y_gap):
def create_training_row(char_info, prev_char, currblock):
char = char_info["char"]
char_center_x = (char_info["bbox"][2] + char_info["bbox"][0]) / 2
char_center_y = (char_info["bbox"][3] + char_info["bbox"][1]) / 2
Expand All @@ -42,8 +42,6 @@ def create_training_row(char_info, prev_char, currblock, avg_x_gap, avg_y_gap):
"font_match": font_match,
"x_outer_gap": char_info["bbox"][2] - prev_char["bbox"][0],
"y_outer_gap": char_info["bbox"][3] - prev_char["bbox"][1],
"x_gap_ratio": x_gap / avg_x_gap if avg_x_gap > 0 else 0,
"y_gap_ratio": y_gap / avg_y_gap if avg_y_gap > 0 else 0,
"block_x_center_gap": char_center_x - currblock["center_x"],
"block_y_center_gap": char_center_y - currblock["center_y"],
"block_x_gap": char_info["bbox"][0] - currblock["bbox"][2],
Expand Down Expand Up @@ -82,7 +80,7 @@ def infer_single_page(text_chars):
span = {"chars": []}
for i, char_info in enumerate(text_chars["chars"]):
if prev_char:
training_row = create_training_row(char_info, prev_char, block, text_chars["avg_x_gap"], text_chars["avg_y_gap"])
training_row = create_training_row(char_info, prev_char, block)
training_row = [v for _, v in sorted(training_row.items())]

prediction = yield training_row
Expand Down
18 changes: 6 additions & 12 deletions pdftext/pdf/chars.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ def update_previous_fonts(text_chars: Dict, i: int, fontname: str, fontflags: in
for j in range(min_update, i): # Goes from min_update to i - 1
if regather_font_info:
fontname, fontflags = get_fontname(text_page, j)

# If we hit the region with the previous fontname, we can bail out
if fontname == prev_fontname:
break
text_chars["chars"][j]["font"]["name"] = fontname
text_chars["chars"][j]["font"]["flags"] = fontflags

Expand All @@ -38,17 +42,14 @@ def get_pdfium_chars(pdf_path, fontname_sample_freq=settings.FONTNAME_SAMPLE_FRE
"bbox": pdfium_page_bbox_to_device_bbox(page, bbox, page_width, page_height)
}

prev_bbox = None
fontname = None
fontflags = None
x_gaps = decimal.Decimal(0)
y_gaps = decimal.Decimal(0)
total_chars = text_page.count_chars()
for i in range(total_chars):
char = pdfium_c.FPDFText_GetUnicode(text_page, i)
char = chr(char)
fontsize = pdfium_c.FPDFText_GetFontSize(text_page, i)
fontweight = pdfium_c.FPDFText_GetFontWeight(text_page, i)
fontsize = round(pdfium_c.FPDFText_GetFontSize(text_page, i), 1)
fontweight = round(pdfium_c.FPDFText_GetFontWeight(text_page, i), 1)
if fontname is None or i % fontname_sample_freq == 0:
prev_fontname = fontname
fontname, fontflags = get_fontname(text_page, i)
Expand All @@ -73,13 +74,6 @@ def get_pdfium_chars(pdf_path, fontname_sample_freq=settings.FONTNAME_SAMPLE_FRE
}
text_chars["chars"].append(char_info)

if prev_bbox:
x_gaps += decimal.Decimal(device_coords[0] - prev_bbox[2])
y_gaps += decimal.Decimal(device_coords[1] - prev_bbox[3])
prev_bbox = device_coords

text_chars["avg_x_gap"] = float(x_gaps / total_chars) if total_chars > 0 else 0
text_chars["avg_y_gap"] = float(y_gaps / total_chars) if total_chars > 0 else 0
text_chars["total_chars"] = total_chars
blocks.append(text_chars)
return blocks
Loading

0 comments on commit 1501377

Please sign in to comment.