PDF Read/Merge generates broken documents since update 2.10.6 #1344

Merinorus · 2022-09-12T14:38:44Z

Hello,

Since update 2.10.6, some PDF documents are not merged correctly. Same with version 2.10.7.
Previous versions (2.10.5 and below) behave correctly.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-Ubuntu-20.04-focal
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-debian-11.2

$ python3 -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.6

Code + PDF

This is a minimal, complete example that shows the issue:

# requirements.txt
diff-pdf-visually==1.7.0
PyPDF2==2.10.6
pytest==7.1.3

# test_same_pdf.py
import io
import shutil
import tempfile
from PyPDF2 import PdfMerger, PdfReader
from diff_pdf_visually import pdf_similar
import os
import pytest
import logging


logger = logging.getLogger(__name__)

FILE_INPUT_URI = "arret_maladie.pdf"
FILE_OUTPUT_URI = "output.pdf"


def file_as_bytesio(filepath: str):
    """Open a file as BytesIO, read only."""
    with open(filepath, "rb") as f:
        return io.BytesIO(f.read())


def test_pdf_merger():

    # Open document and merge it into a temporary file
    merger = PdfMerger()

    merger.append(PdfReader(file_as_bytesio(FILE_INPUT_URI)))

    # Write the final merged document
    temp_file = tempfile.NamedTemporaryFile()
    merger.write(temp_file.name)

    temp_file_path = temp_file.name

    # Compare VISUALLY the content of the newly generated file with the expected content
    if not pdf_similar(temp_file_path, FILE_INPUT_URI):
        # If files don't match visually, the test fails.
        # Copy the newly generated file to the current directory to manually check what is wrong
        current_dir = os.path.dirname(__file__)
        new_file_path = os.path.join(current_dir, FILE_OUTPUT_URI)
        shutil.copy2(temp_file.name, new_file_path)
        logger.error(f"The newly merged file does not match with the intput file.")
        # Fail the test on purpose
        assert False

Here is the PDF that caused the issue:
input.pdf

Here is the output (simple PdfReader -> PdfMerger):
output.pdf

Traceback

This is the complete Traceback I see:

Converting each page of the PDFs to an image...
  PDFs have same number of pages. Checking each pair of converted images...
Min sig = 13.4442, significant?=True. The PDFs are different. The most different pages are: page 1 (sgf. 13.4442), page 2 (sgf. 13.6684), page 3 (sgf. 13.9239).
ERROR    test_same_pdf:test_same_pdf.py:44 The newly merged file does not match with the intput file.

Thank you for taking the time to investigate.
The document is a French official form, I guess it's fine for using it in automated tests, but not sure.

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2022-09-12T17:27:21Z

Thanks to locating the issue,
Under analysis

pubpub-zz · 2022-09-12T21:24:24Z

The PDF has some fonts where the name includes some spaces (encoded with #20). The decoding takes that properly but not the writing leaving some invalides spaces leading to invalid fields.
Some cases with UTF-8 has been identified (typically for some chinese name fonts) : To anyone some PDF would be welcomed

else fix in progress

closes py-pdf#1344

pubpub-zz · 2022-09-13T06:50:52Z

@Merinorus
Have a look at the PR, If you want to test the mod on more changes

Merinorus · 2022-09-13T09:04:39Z

Hi @pubpub-zz,
I just tested your PR with more files, working perfectly.
That was fast! Thank you very much!

Three kinds of changes were made in this PR 1) _cmap.py : the str is coming from `/Encoding` which stores a NameObject : The conversion is already performed; no need to force it. 2) _page.py : Replaced obsolete call in _debug_for_extract() 3) _base.py : 3.1) unnumber : all `#xx` should be performed prior to conversion to str (using utf-8) to allow multi language text 3.2) read_from_stream : if utf-8 (normally the only one required) or gbk (kept to prevent regression) we will use charmap to get some sequence of chars 3.3) renumber : added to recode in #xx sequence. renumber will also be compatible with utf-8 chars Closes #1344

MartinThoma · 2022-09-14T04:02:36Z

The fixing PR was just merged. I will make a release to PyPI in a few minutes.

MartinThoma · 2022-09-14T04:03:35Z

@Merinorus Thank you for reporting the issue! We value issue reports and feedback if the fixes worked. If you want, I would add you as a contributor: https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

Merinorus · 2022-09-14T08:18:38Z

@MartinThoma That's very kind! I accept your offer. ;)

Merinorus changed the title ~~PDF Merging is broken since update 2.10.6~~ PDF Read/Merge generates broken documents since update 2.10.6 Sep 12, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 13, 2022

ROB : improve nameobject reading/writing

6253a01

closes py-pdf#1344

pubpub-zz mentioned this issue Sep 13, 2022

ROB : improve nameobject reading/writing #1345

Merged

MartinThoma closed this as completed in #1345 Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Read/Merge generates broken documents since update 2.10.6 #1344

PDF Read/Merge generates broken documents since update 2.10.6 #1344

Merinorus commented Sep 12, 2022 •

edited

Loading

pubpub-zz commented Sep 12, 2022

pubpub-zz commented Sep 12, 2022

pubpub-zz commented Sep 13, 2022

Merinorus commented Sep 13, 2022

MartinThoma commented Sep 14, 2022

MartinThoma commented Sep 14, 2022

Merinorus commented Sep 14, 2022

PDF Read/Merge generates broken documents since update 2.10.6 #1344

PDF Read/Merge generates broken documents since update 2.10.6 #1344

Comments

Merinorus commented Sep 12, 2022 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Sep 12, 2022

pubpub-zz commented Sep 12, 2022

pubpub-zz commented Sep 13, 2022

Merinorus commented Sep 13, 2022

MartinThoma commented Sep 14, 2022

MartinThoma commented Sep 14, 2022

Merinorus commented Sep 14, 2022

Merinorus commented Sep 12, 2022 •

edited

Loading