You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the bug
It appears Nokogiri's HTML parser returns parse errors for valid img tags with src urls that contain two or more query parameters.
Help us reproduce what you're seeing
Open irb and run the following script. Observe that parsing one_qp does not produce any errors, whereas parsing two_qp does.
To be valid XML, you would need to escape the ampersand as &. Nokogiri currently relies on the libxml parser for HTML which handles some, but not all, of the differences between HTML and XML.
The plans are to merge nokogumbo into nokogiri. But you don't need to wait, you can use nokogumbo today.
The behavior you're describing is inherited from the underlying HTML library, libxml2. Here's the C code that controls URI-escaping of certain HTML attributes at serialization-time (when the document is printed):
Specifically, href, action, src, and name (but only within an anchor) are always escaped when generating HTML -- basically, anything that could be a URI reference.
Using Nokogumbo along with Nokogiri for HTML should be a better experience; and as Sam mentioned we're working on improving the integration between the two libraries.
I see, thank you both for the detailed responses! I'll look into using Nokogumbo for now then and make sure to keep an eye on the project merge tracking issue. Not sure if it might be helpful to add a small temporary note to the documentation somewhere recommending Nokogumbo for HTML5 parsing (perhaps there already is one somewhere but I didn't notice any when searching for information on this error). Closing this issue, good luck with the merge!
Please describe the bug
It appears Nokogiri's HTML parser returns parse errors for valid img tags with src urls that contain two or more query parameters.
Help us reproduce what you're seeing
Open irb and run the following script. Observe that parsing
one_qp
does not produce any errors, whereas parsingtwo_qp
does.Expected behavior
Parsing a string containing a single image tag with two query parameters should not return any errors.
Environment
The text was updated successfully, but these errors were encountered: