Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define handling of plus and space #261

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

matt-phylum
Copy link
Contributor

@matt-phylum matt-phylum commented Nov 7, 2023

This is a big change (in terms of impact, not lines), and I'm not entirely sure its the correct change, but something needs to be done in this area.

Problem

The PURL spec describes qualifiers as being an & delimited sequenced of = delimited key value pairs where the value is percent encoded, and the section on encoding describes a minimal set of characters that are supposed to be percent encoded in different contexts. This looks a lot like x-www-formurlencoded, but x-www-formurlencoded encodes almost all characters besides the ascii alphanumeric set, and has a special behavior where ' ' is encoded as '+'.

I am aware of thirteen PURL implementations:

Implementation ' ' decodes as ' ' ' ' encodes as %20 '+' decodes as '+' '+' encodes as %2B
althonos/packageurl.rs ⚠️
anchore/packageurl-go ⚠️ implemented in prerelease
giterlizzi/perl-URI-PackageURL ☠️ parses incorrectly
maennchen/purl 💣 does not parse ⚠️ sometimes encodes as ' ' ☠️ '+' is encoded as '+' which decodes as ' '
package-url/packageurl-dotnet
package-url/packageurl-go
package-url/packageurl-java 💣 does not parse
package-url/packageurl-js
package-url/packageurl-php
package-url/packageurl-python
package-url/packageurl-ruby
package-url/packageurl-swift
phylum-dev/purl
sonatype/package-url-java ☠️ '+' is sometimes encoded as %20

That means if you're working with qualifiers that have spaces or plus signs in their values, it's fairly likely that software using a different implementation of PURL will interpret the PURLs differently. This may also happen if your PURLs have plus signs in their versions (deb) or spaces in their names (swid) and the implementation incorrectly decodes the name and version as if they were qualifier values (see below).

6/14 of the implementations are decoding '+' as ' ', so here's why I think it's better for the spec to specify '+' is '+':

  1. The spec doesn't say it should be x-www-formurlencoded, although it could be argued that it is implied. Using x-www-formurlencoded would make PURLs more compatible with regular URL parsers and utilities (but still not 100% compatible IIRC), at the expensive of making the specification more complicated.
  2. Three of those six implementations that encode and decode qualifier values as if they were x-www-formurlencoded also encode and decode other parts of the PURL, such as the version, as if it were an x-www-formurlencoded value. This is not standard URL behavior and it is not part of the PURL spec. This seems like an implementation mistake, incorrect regardless of how the qualifier values are encoded. (package-url/packageurl-dotnet, package-url/packageurl-ruby, sonatype/package-url-java)
  3. Having a different encoding scheme for the qualifiers vs the rest of the PURL, while common for regular URLs, can lead to confusion, both for implementors (see above) and for users when dealing with PURLs that contain plus signs in and out of qualifiers.

Proposal

  1. ' ' decodes as ' ' and '+' decodes as '+' (as in 8/14 implementations)
  2. In qualifiers, ' ' encodes as %20 and '+' encodes as %2B (as in 8/14 implementations) to avoid ambiguity when PURLs are parsed by implementations pre this change (unfortunately, there's nothing we can do about parsing PURLs created by implementations pre this change)
  3. In the rest of the PURL, ' ' encodes as %20 and '+' encodes as '+' because that's how the spec was written, it's not ambiguous that the rest of the PURL might be x-www-formurlencoded, and plus signs seem like they're common enough with some package types that there's benefit in keeping them unencoded.

The spec is updated to be specific, new tests are added to the test suite, and incorrectly escaped examples in the package types spec are updated to be consistent.

Unfortunately, this requires changes to at least 7/14 implementations to get everything aligned, but they should be minor changes.

@pombredanne
Copy link
Member

@matt-phylum this is more than nice research and work you did there! 🙇 Let me review this in details and come with feedback!

@matt-phylum
Copy link
Contributor Author

I found links to three more implementations in the readme file of this repository and updated the description to include them. packageurl-java merged some PRs that fixed most of the problems with spaces and plus signs so it's been updated here.

@matt-phylum
Copy link
Contributor Author

I've published a slightly cleaned up version of the code used to collect the data for the above table in https://github.com/phylum-dev/purl-survey. Hopefully it's useful for future cross-implementation testing like this.

@tiegz
Copy link

tiegz commented Jan 23, 2024

+1 to this.

I just discovered that syft will not encode + in Golang versions even though + is a valid character in SemVer, so PURL parsers of Syft-generated PURLs will decode as v1.2.3 incompatible instead of v1.2.3+incompatible.

I think you could probably add this https://github.com/anchore/packageurl-go library to this list since that's what syft is using.

@johnmhoran johnmhoran added the Ecma specification Work on the core specification label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding PURL qualifiers component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants