Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom RDF metadata for PDF/A, especially for Factur-X eInvoices #2338

Merged
merged 6 commits into from
Jan 24, 2025

Conversation

the-infinity
Copy link
Contributor

@the-infinity the-infinity commented Dec 31, 2024

This MR adds a way to add custom RDF metadata for PDF/A s. It's implemented using a custom generator function, because RDF extensions get pretty difficult, and usually you want to write an own generator.

Background: EN 16931 / ZUGFeRD / Factur-X eInvoices require a custom RDF metadata extension to be valid, something like this: https://www.pdflib.com/fileadmin/pdf-knowledge-base/zugferd/Factur-X_extension_schema.xmp . This cannot be added with the existing mechanisms. So far, we monkey-patched weasyprint to generate valid eInvoices, which works, but is unstable and not the way how it should be done. Therefore, this MR. :)

Larger background: Germany enforces eInvoices starting 01.01.2025 for B2B customers with a transition period of two years. Other EU countries have similar plans. Therefore, there will be a growing interest for proper FOSS tools. :)

I will write a blog article how to use weasyprint for eInvoices as soon as I don't have to recommend monkeypatching. I would be fine using parts of the blog article in weasyprints documentation, too.

I hope the MR is ok for you as this is my first MR in this project. I tried to stay close to your code style, but added typing as I think it makes sense to do so "on the way". If you want any changes or additional documentation, feel free to request it.

@the-infinity the-infinity changed the title custom metadata custom RDF metadata for Factur-X eInvoices Dec 31, 2024
@the-infinity the-infinity changed the title custom RDF metadata for Factur-X eInvoices custom RDF metadata for PDF/A, especially for Factur-X eInvoices Dec 31, 2024
@liZe
Copy link
Member

liZe commented Jan 2, 2025

Hi!

Thanks for your PR, supporting ZUGFeRD / Factur-X could be a really good thing for WeasyPrint!

Everything seems to be OK with this code, I’ve only changed a couple of minor things (mainly removed types that we don’t want to use in WeasyPrint). Could I have rights to push the commit to your branch? (Or I can start another branch from your commit and create a new PR if you prefer.)

Before merging, could you please share some code you could use to generate Factur-X documents? We’d like to be sure about the API we’ll have to maintain. 😄 If you already have some content written for your blog post, don’t hesitate to share, we could add some documentation in this PR too.

@liZe liZe added the feature New feature that should be supported label Jan 2, 2025
@liZe liZe added this to the 64.0 milestone Jan 2, 2025
@the-infinity
Copy link
Contributor Author

the-infinity commented Jan 2, 2025

Just invited you to the repository.

May I ask why you don't use typing? That's a bit uncommon / unexpected :)

Wanted to start writing the blog post as soon as I got a positive reaction from this MR, so I will start now. The method I would set at rdf_metadata_generator() would be generate_metadata() from this generator class:


from lxml import builder, etree
from weasyprint import __version__ as weasyprint_version


class RdfMetadataGenerator:
    """
    Generates XMP / RDF metadata for PDF/A-3b documents.
    """

    nsmap: dict[str, str] = {
        'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
        'fx': 'urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#',
        'pdf': 'http://ns.adobe.com/pdf/1.3/',
        'pdfaid': 'http://www.aiim.org/pdfa/ns/id/',
        'pdfaExtension': 'http://www.aiim.org/pdfa/ns/extension/',
        'pdfaSchema': 'http://www.aiim.org/pdfa/ns/schema#',
        'pdfaProperty': 'http://www.aiim.org/pdfa/ns/property#',
    }

    em_rdf = builder.ElementMaker(namespace=nsmap['rdf'], nsmap=nsmap)
    em_fx = builder.ElementMaker(namespace=nsmap['fx'])
    em_pdf = builder.ElementMaker(namespace=nsmap['pdf'])
    em_pdfaid = builder.ElementMaker(namespace=nsmap['pdfaid'])
    em_extension = builder.ElementMaker(namespace=nsmap['pdfaExtension'])
    em_schema = builder.ElementMaker(namespace=nsmap['pdfaSchema'])
    em_property = builder.ElementMaker(namespace=nsmap['pdfaProperty'])
    
    def generate_metadata(self, title: str | None) -> bytes:
        root = self.em_rdf.RDF(
            self._get_producer(),
            *([] if title is None else [self._get_title(title)]),
            self._get_pdfa_details(),
            self._get_factur_x_values(),
            self._get_factur_x_schema(),
        )
        return etree.tostring(root)

    def _get_producer(self) -> etree.Element:
        return self.em_rdf.Description(
            self.em_pdf.Producer(f'WeasyPrint {weasyprint_version}'),
            {etree.QName(self.nsmap['rdf'], 'about'): ''},
        )

    def _get_title(self, title: str) -> etree.Element:
        return self.em_rdf.Description(
            self.em_dc.title(
                self.em_rdf.Alt(
                    self.em_rdf.li(
                        title,
                        {QName(self.nsmap['xml'], 'lang'): 'x-default'},
                    ),
                ),
            ),
            {etree.QName(self.nsmap['rdf'], 'about'): ''},
        )

    def _get_pdfa_details(self) -> etree.Element:
        return self.em_rdf.Description(
            self.em_pdfaid.part('3'),
            self.em_pdfaid.conformance('B'),
            {etree.QName(self.nsmap['rdf'], 'about'): ''},
        )

    def _get_factur_x_values(self) -> etree.Element:
        return self.em_rdf.Description(
            self.em_fx.DocumentType('INVOICE'),
            self.em_fx.DocumentFileName('factur-x.xml'),
            self.em_fx.Version('1.0'),
            self.em_fx.ConformanceLevel('EN 16931'),
            {etree.QName(self.nsmap['rdf'], 'about'): ''},
        )

    def _get_factur_x_schema(self) -> etree.Element:
        return self.em_rdf.Description(
            self.em_extension.schemas(
                self.em_rdf.Bag(
                    self.em_rdf.li(
                        self.em_schema.schema('Factur-X PDFA Extension Schema'),
                        self.em_schema.namespaceURI('urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#'),
                        self.em_schema.prefix('fx'),
                        self.em_schema.property(
                            self.em_rdf.Seq(
                                self._get_schema_property(
                                    name='DocumentFileName',
                                    description='Name of the embedded XML invoice file',
                                ),
                                self._get_schema_property(
                                    name='DocumentType',
                                    description='INVOICE',
                                ),
                                self._get_schema_property(
                                    name='Version',
                                    description='The actual version of the Factur-X XML schema',
                                ),
                                self._get_schema_property(
                                    name='ConformanceLevel',
                                    description='The conformance level of the embedded Factur-X data',
                                ),
                            ),
                        ),
                        {etree.QName(self.nsmap['rdf'], 'parseType'): 'Resource'},
                    ),
                ),
            ),
            {etree.QName(self.nsmap['rdf'], 'about'): ''},
        )

    def _get_schema_property(self, name: str, description: str) -> etree.Element:
        return self.em_rdf.li(
            self.em_property.name(name),
            self.em_property.valueType('Text'),
            self.em_property.category('external'),
            self.em_property.description(description),
            {etree.QName(self.nsmap['rdf'], 'parseType'): 'Resource'},
        )

Will write a full examples in my blogpost, in a few days I will be able to provide more :)

@liZe
Copy link
Member

liZe commented Jan 2, 2025

Just invited you to the repository.

Thank you!

May I ask why you don't use typing? That's a bit uncommon / unexpected :)

Sure: Kozea/Pyphen#50 (comment)

Wanted to start writing the blog post as soon as I got a positive reaction from this MR, so I will start now. The method I would set at rdf_metadata_generator() would be generate_metadata() from this generator class:

Thanks a lot! I’ll take a look and see if it’s there’s anything I’d like to change.

@the-infinity
Copy link
Contributor Author

About typing: this really helps IDEs like pyCharm, and it's actually the same reason: pyCharm has pretty good auto complete features and hints where you do something wrong if you provide proper typing. So I guess that's another way of being lazy: it's something the IDE can do a lot for you if you do the typing, and something where you don't have to think about that much any more. At python 3.5 / 3.6, I was not convinced, either, but starting with 3.8 and especially with 3.10, typing really makes sense - at least for for me :)

And, maybe an obvious question about the code: why didn't I add the whole Factur-X extension RDF to weasyprint? Well, two reasons:

  1. Factur-X is not the only PDF/A extension which exists, these extensions are a generic framework, so adding just the Factur-X extension would be a very specific solution, preventing other people with other PDF/A extension needs to use weasyprint
  2. I like lxml ElementBuilder a lot because the code is way cleaner then the code required by the standard library. But, adding lxml as a dependency to weasyprint felt wrong, because PDF/A extensions won't be used by most users, so for most users it will be an unused library.

@the-infinity
Copy link
Contributor Author

Blog article is finished and published: https://binary-butterfly.de/artikel/factur-x-zugferd-e-invoices-with-python/ . Which parts do you think are interesting for the weasyprint docs, @liZe ? :)

@liZe
Copy link
Member

liZe commented Jan 6, 2025

@the-infinity Thanks a lot for sharing. I think that I will adapt the code in "Use both XMLs to generate a PDF/A with attachment" as a small snippet, and add a link to your blog entry for more detailed information, if it’s OK for you.

About the API, we may want to change minor things:

  • If PDF/A doesn’t allow the title in metadata, we may want to set that in WeasyPrint’s code. Do you remember where you saw this requirement?
  • I think that the PDF identifier is automatically generated when you don’t provide one. Does it work for you?
  • I think that setting base_url for the attachment filename is more a side effect than a real feature. I’ll check that and see if it’s possible to add an explicit filename parameter (or something else).

@the-infinity
Copy link
Contributor Author

the-infinity commented Jan 6, 2025

Thanks a lot for sharing. I think that I will adapt the code in "Use both XMLs to generate a PDF/A with attachment" as a small snippet, and add a link to your blog entry for more detailed information, if it’s OK for you.

Sounds good :)

If PDF/A doesn’t allow the title in metadata, we may want to set that in WeasyPrint’s code. Do you remember where you saw this requirement?

Ok, that was an question where I had to look deeper. Turns out: that was not true, it was just a wrong assumption I had at some point based on an issue I saw ... somewhere. So, removed it. Updated the demo code above.

I think that the PDF identifier is automatically generated when you don’t provide one. Does it work for you?

A requirement is that if you regenerate the same invoice, it should give the same identifier. Thiis would be broken when generating a random identifier.

I think that setting base_url for the attachment filename is more a side effect than a real feature. I’ll check that and see if it’s possible to add an explicit filename parameter (or something else).

Okay :)

@liZe liZe force-pushed the custom-metadata branch 2 times, most recently from 603295d to 8a7c72e Compare January 19, 2025 16:33
@liZe
Copy link
Member

liZe commented Jan 19, 2025

Hi @the-infinity!

I think that this PR is almost ready to be merged. There are some minor changes required for your blog article (name and parameters of generate_rdf_metadata, using Attachment’s name parameter for example), but the overall logic is still the same.

I’ve included a full example in the documentation, tested with FNFE-MPE’s online validator (free account required).

Thanks a lot for your hard work!

@liZe liZe merged commit 51aea2e into Kozea:main Jan 24, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants