You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we noticed a change in the parser behavior regarding newlines/whitespaces between inline/block elements when using Jsoup to create a document from html:
So you see the difference is what's happening between the two p-tags. In previous versions, they would be collapsed and there would be a whitespace between them. Now that whitespace is removed. We noticed this change of behavior inside table cells, but there might be more places where this now happens.
Why do we think the previous behavior was better:
The additional whitespace between actual block-elements did not hurt. And as the Tag#isBlock method does not handle the display:inline styling, it can't be sure this really is a block element and it would be better to let the browser handle it.
The normal browser behavior is, if there are two block-elements after another, the browser will put these in new lines. No matter if there is no whitespace, a single whitespace or a linebreak between them in the raw html.
If there are two inline-elements, the browser will put them in the same line, no difference if there is a whitespace between them or a linebreak. But it does make a difference if there is no whitespace at all between them, because then it will not add a single whitespace, it will print the text in the spans directly next to each other.
So if one pastes the resulting html from above in to a html file and opens it in the browser, there is a difference, because the browser now sees two inline elements without a whitespace or linebreak between them and prints them just like that, while behavior with the original html would be that there is a whitespace between them.
Original in browser:
Parsed with version <15.4:
Parsed with version 15.4:
The text was updated successfully, but these errors were encountered:
AnanasPizza
changed the title
Difference in jsoup version 15.4 to previous version
Difference of block element handling in jsoup version 15.4 to previous version
Mar 28, 2023
Thanks for the detailed and clear report. I have fixed this by causing nested inlineable content elements (like TDs and Ps) to wrap.
The pretty-printer code now is unfortunately pretty gnarly. I think it will be useful to complete refactor the implementation to simplify it, and make the output more customisable.
Hi, we noticed a change in the parser behavior regarding newlines/whitespaces between inline/block elements when using Jsoup to create a document from html:
So here is a minimal testcase:
Input html:
Testcode:
This is the result for < 15.4:
This is the result for 15.4:
So you see the difference is what's happening between the two p-tags. In previous versions, they would be collapsed and there would be a whitespace between them. Now that whitespace is removed. We noticed this change of behavior inside table cells, but there might be more places where this now happens.
Why do we think the previous behavior was better:
The additional whitespace between actual block-elements did not hurt. And as the Tag#isBlock method does not handle the display:inline styling, it can't be sure this really is a block element and it would be better to let the browser handle it.
The normal browser behavior is, if there are two block-elements after another, the browser will put these in new lines. No matter if there is no whitespace, a single whitespace or a linebreak between them in the raw html.
If there are two inline-elements, the browser will put them in the same line, no difference if there is a whitespace between them or a linebreak. But it does make a difference if there is no whitespace at all between them, because then it will not add a single whitespace, it will print the text in the spans directly next to each other.
So if one pastes the resulting html from above in to a html file and opens it in the browser, there is a difference, because the browser now sees two inline elements without a whitespace or linebreak between them and prints them just like that, while behavior with the original html would be that there is a whitespace between them.
Original in browser:
Parsed with version <15.4:
Parsed with version 15.4:
The text was updated successfully, but these errors were encountered: