-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread is stuck during HTML to PDF conversion with nested tables for fairly large HTML #551
Comments
hi @swillis12 , I'll have a look using your example (thank you!) in the profiler, maybe there is some obvious fix that can be done. |
Which version are you using? With 1.0.4, with the Using the following code:
I got 1218ms |
Thanks @syjer. Yes I am on 1.0.4. That is strange.. it is much slower on mine. The only difference is that I am reading the HTML from a String in memory rather than from a file like you are doing. Update: I tried it with your code and am seeing the same results. It is very fast (I got 2685ms on the larger HTML file)! This is good news, but now I have to figure out what is going on in my program. This is the code I'm using by the way:
|
It may be:
To be noted, I think we can still improve the performance :). |
This is very possible. Yes I currently have the SLF4J jar added -- How do you recommend testing what you mention about reducing the amount of "garbage" generated by the logging? Should I just exclude the SLF4J jar? Note: it doesn't seem to be making that many log statements, just what's below:
|
As a first start, I would add the following flag to the jvm: You will see during the execution some lines like:
If they appear too often and with quite a lot of time, then it could be the issue: maybe not enough memory is given to the java process or some kind of memory leak is happening. Alternatively you can use visualvm, for visualizing the gc activity. With it you can also do a first profiling of the application (cpu or memory) and try to pin down the main root cause. edit: for reducing the amount of garbage: well, first we need to identify what could be the issue in this library :) |
I'm reading the HTML content from a string in memory as well as writing the outputstream to memory (browser Response).. So when I get a chance I'll do a quick test and first write the HTML content to a file and then have the library write it to a file as well. Thanks for the help so far @syjer I'll try this out once I get a little more time. What do you recommend I do with this issue for now? |
@swillis12 , you can keep the issue open, maybe somebody else has/had the same issue and could provide additional feedback. I'm still thinking how it would be possible to have this much difference in execution time (1-2minutes vs few seconds), even in a case of GC issue, I don't think it would be that bad. To be noted, we have another issue of slow generation time: #506 but this one seems to be more font related |
Thanks again @syjer. I ran a quick test below and got the same slow results.. I'll have to do some GC profiling as you described. I wish I could help you reproduce it :). Yes I had scoured the issue tracker for mentions of slowness and did come across that one. I also want to note that I'm seeing the same behavior using FlyingSaucer (I swapped this library for FS to run a quick test). It seems to be equivalently slow.
|
btw, beware when using Better to delete in a (obviously, if it's a short running process, it's not an issue ;)) |
@syjer I was away last week but got another chance to look at this. My test case that I originally attached is not reproducing the actual issue. If you look at my update, the issue is seen only when the enclosing div has auto height. You can try it out and hopefully see what I am talking about. P.S. I tested with the latest code including your change #552. So it doesn't seemed to have improved it much. It seems maybe related to a memory leak in CSS calculations when the table height is auto. Notice the constant GC (purple line) since the new generation space is filling up over and over again: |
hi @swillis12 , thank you for updating the example 👍 . I'll have a look most likely friday. |
hi @swillis12 , with the new template, I'm able to reproduce the reported issue. Now, for finding the main issue, it will not be easy :). It does not seems to be a memory issue, but something in the layout algorithm that cause to spend quite a lot of time. |
the main culprit seems to be:
removing this css rule generate the pdf in 1.4~ seconds on my pc for the file edit: most likely the page break avoiding algorithm is sub optimal and spiral out of control when you have a lot of children elements. |
Looking at https://github.com/danfickle/openhtmltopdf/blob/open-dev-v1/openhtmltopdf-core/src/main/java/com/openhtmltopdf/layout/BlockBoxing.java#L40 , I've got the impression the rule is dropped too late, maybe due to the peculiarity of this page (deep hierarchy). Most likely @danfickle has a better idea than me :), I'll still try on my side to find a solution though. |
Even more interesting, the issue appear only when the first tbody has the css rule applied. If you apply the page-break-inside:avoid only to the most inner tbody, it work without problem :). I think I'll be able to condense the issue in a more compact html file. |
Good catch! Thanks for the workaround @syjer. Hopefully this helps others that may run into this too. |
@swillis12 it may have a role, but I think in the #506 issue, the main cause is more inside pdfbox. |
I've cut a little bit the problematic file: issue-551-page-break-inside-avoid.txt It's quite clear we have O(n²) algorithm, or maybe even exponential, as it depend on how deep it need to check. |
I'm struggling with PDF generation where it goes out of control and eventually locks things up using this:
Would that be the same culprit as |
P.S. There was an n squared algorithm in |
hi @danfickle , I've tried the |
Update 9/23/20: The first two HTML files I attached are misleading and work as expected. I found that the problem is actually occurring only when the height/max-height of the div enclosing the table is set to "auto":
@syjer Try this new HTML file attached, it should reproduce the issue for you as well. Sorry about the confusion with the previous test cases. FYI running this in my minimal test application using this HTML file was 90,087ms which is consistent with the results I see inside of my program.
smaller_test_auto_height.txt
Original Issue/info below, the stack trace thread dump info applies to the "smaller_test_auto_height.txt" test case since my proram was using this "auto" height all along (which I wasn't aware of 🤦 ):
Currently our HTML target is email, so we are using a lot of HTML table elements, nested and everything is inline styles. I suspect that I have a thread hanging due to the nested tables. Unfortunately I'm not sure what I can do to work around this issue. I've tried "table-layout: fixed" and assigning column widths as well.
Update: made sure that it is valid XHTML as well using https://validator.w3.org/: problem_html1.txt
Here is also a smaller test html that is still painfully slow (~1.5-2 minutes). It is the same content, just reduced the number of table rows to 71 for easier visibility:
smaller_test.txt
If I enable logging I see infinite messages as follow:
Thread dump shows this stack:
Question is -- what exactly is the issue and what can I do to work around this if I am pretty much stuck with the current layout? It may not be feasible to get rid of the table nesting.
The text was updated successfully, but these errors were encountered: