-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make async runtime scale better on SMT machines #850
Comments
According to the benchmark from scheduler, |
Thanks for the pointer! That benchmark focuses on speed, not scalability, so I'm sceptical that it'll make any different. I don't have energy to invest into this right now, but if you do, please try and report back the results! |
I investigated using
|
I am also experiencing this scaling issue. The increased userland CPU time when using more Profiling didn't reveal anything interesting, the profiles looking overwhelmingly similar for I'm beginning to wonder if this issue might be in the GHC RTS, rather than Hakyll... |
@frasertweedale I did some investigation with ThreadScope afterwards that wasn't especially insightful, which is why I didn't mention it here, but it did show that some of the overhead was GC-related. When I minimised garbage collection using some of the suggestions here the observed performance did seem to scale better. I'm a relative novice when it comes to parallel Haskell so it's entirely possible that there's something simple that I'm missing. |
@vaibhavsagar thanks for the additional info. It is always helpful to mention the dead ends in the investigation. That way, people will know it has been done, and won't waste their time doing the same thing :) |
When using multiple capabilities, on GHC 8.8 I get the best results with There must be something about Hakyll's design that makes parallel GC particularly inefficient. When actually using multiple capabilities there was an improvement in wall time GCing the second generation, although productivity still decreases considerably. For the first generation, the parallel GC performance is quite terrible. I'd be interested to see how GHC 8.10+'s I'm suspending my investigation at this point. Single-threaded performance is good enough for me and even with |
FWIW, I ran into severe performance problems apparently related to these changes when I recently upgraded Hakyll after a while. My writeup: https://groups.google.com/g/hakyll/c/5_evK9wCb7M/m/3oQYlX9PAAAJ |
I would like to look into this during ZuriHac 2022, I'm not sure if I'll have time before that. My current suspicion is that the combination of an |
@jaspervdj #903 is much more pressing, if you're in the mood to dig into hard issues :) Sadly I didn't have enough energy to do that, even though I promised. It looks like the |
Yeah, I wonder if we should just roll back the concurrent runtime for now given these issues. Is the slight speedup for some sites worth the overhead for others? I’m not sure. Doing a concurrent runtime still seems doable and worthwhile and I think we can get it with minimal overhead but it just requires a bit more investigation to update or remove some existing abstractions like Store. |
I have an implementation in https://github.com/jaspervdj/hakyll/tree/async-scheduler which is a bit rough but should generally work and allow us to scale much better. A few things like error handling and checking for cyclic deps still need to be improved though. |
Does #946 resolve this issue? |
@vaibhavsagar Not really, see the benchmark results here: #946 (review) |
Update: I've been using a fork all this time, as mentioned, and so haven't seen any effect of the new scheduler. My Threadripper workstation has died, so I can no longer test high core counts. I've been restarting on a Ubuntu 24 laptop with just 8 virtual cores (4 real, IIRC), and running with 5-7 threads has not shown any major issues with the 4.14.0.0 HEAD (GHC 9.4.7). |
#844 added a new async runtime, but on SMT machines (eg. Intel's HyperThreading), it doesn't scale too well past the number of cores. Details are in #844 (comment), and there are some ideas further down the thread.
For now, the workaround is to use
+RTS -Nx
to limit the number of threads to the number of cores.The text was updated successfully, but these errors were encountered: