-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle OOM gracefully in PVF execution #767
Comments
Ask around if |
https://man7.org/linux/man-pages/man2/mlock.2.html is better, but depends on RLIMIT_MEMLOCK setting which might not be set at all defaulting to 64kb? Kernel default. We can dinamically use it if limit is high enough and advise the operators to configure this limit properly to improve performance. MADV_WILLNEED Is just a hint for the kernel and might do nothing when memory pressure is higher, but is worth experimenting with. |
I'd think mlock would bring far more complexity, like more OOM crashes, etc. |
Yeah you are right, I was thinking more about performance rather than robustness. |
Something to keep in mind is whether this will increase latency for the one "special" worker in charge of retries. I suppose this happens rarely in practice, so it's probably not an issue. |
Well, except, there could be other allocations besides the linear memory on the course of execution. Hence, "guarantee" is a too strong word here.
TBH, I am not sure about that. My mental model of an OOM killer is that it mostly operates based on memory, and it strives to sacrifice children. TBH, I've never bothered to dive into the sources to check how it actually work though. What are your sources?
To my knowledge, this can be dangerous and, counterintuitively, sometimes worsen the overall performance. Can you elaborate on this more? However, maybe we should move this discussion elsewhere since it seems tangential. |
Source is this, in particular:
About swap: Interested in hearing arguments here, for PVF execution, especially if we timeout based on wall clock time, swapping is to be avoided at all cost - otherwise required time can easily skyrocket - it is better to just crash in that case. Even if we use CPU time, we should not swap - because then it would not necessarily timeout, but we would run into a situation where wall clock time will be way longer than CPU time, which is also to be avoided if we want to only timeout based on CPU time. |
The problem with disabling swap is, that it does not fully solve the problem as far as I know. Because there is a second mechanism in place: The kernel just dropping things from memory it already has on disk as well, like the code segment of a process. So yes, in some load scenarios swapping clearly unused memory will lead to better performance than forcing the kernel to remove code sections from running processes. Buuuut, we will also oom faster without swap - like the time the system is running with really bad performance is significnatly shortened. With swap the system can keep running forever (if you have lots of swap), but will be barely usable ... and with barely usable I mean, "not usable" it just works at a speed that for all practical purposes you can say it is not working - for consensus it is better to just crash, before running into that situation. |
I think this one is outdated. I was able to find this saying it all was greatly simplified. It is old, but seems to be in line with the current source code (link). Yes, I agree with your thinking regarding the swap. Thanks for the elaboration. I am worried about the second-order effects. Specifically, I am unsure how well our software behaves in the "let it crash" approach. One thing that crossed my mind is that we are clearing the PVF artifact compilation cache after restarting. But it is also a gut feeling since we don't do much testing to verify that nodes behave well. I don't have a good suggestion here though. |
One thing that's not clear to me is, would we want to disable swap on just the special worker or all workers? If the latter, should it perhaps be a separate issue independent of other changes? |
Can we actually disable swap per process? I was thinking that the node operator would not enable it at all. If we can specifically mark PVF execution workers as: "Don't swap (in any way)" that would actually be the best solution. |
Wow from 2010 and it is more current than "the docs" ...and then people complain that our docs are outdated. Anyhow, thanks @pepyakin checking it out. |
|
That’s exactly my initial suggestion but it might be a bad idea to do this when there is memory pressure. |
We can limit data size for a process I suppose. |
Here are several pointers. First of all, First, there is Then, there is Hence, perhaps, it makes sense to look into cgroups. It allows a more fine-grained approach, but it is also harder to work with. Several caveats from the top of my head:
Footnotes |
You're worried the OOM error cannot be distinguished from another error type? If so, your allocator could flag that it's doing an allocation somewhere, and then remove the flag afterwards. I'm kinda surprised if OOM errors are not distinguished somehow though. |
This should be prioritized after the sandboxing work since we have seen OOMs in the wild. See paritytech/polkadot#7155 (comment). Since we are restricted to Linux now for security (see #881), we can go ahead and explore using
We can explore this in conjunction with a global allocator wrapper. |
Would the system really crash, or would the OOM killer start sacrificing individual processes? And it could really be any process; we don't know what the behavior will be unless we test it. One scenario is polkadot may keep going, but be nonfunctional - e.g. worker processes may get repeatedly killed leading to time outs and no-shows. But as a thought exercise, one would think the kernel would first go after processes that use significant memory, but which haven't touched it in a while or are otherwise deemed unimportant. So we could instantiate a canary process with the lowest priority, that reserves some non-trivial amount of memory, locks it, and then goes to sleep. If polkadot detects that this process died, then we can safely initiate a shutdown including informing operators what happened. (I'm more comfortable with this kind of solution now that we are officially restricting to Linux so we don't have to implement/test for other platforms.) Edit: after reading the source linked above, the kernel mainly tries to find the process using the most memory which is not OOM-killer-disabled. Its exact strategy is described here. We can use the utility at that link to disable the OOM killer for polkadot and all child processes, and then re-enable it for the child canary process. |
I would not disable the OOM killer for Polkadot itself - that would make things worse. But if we can disable it for workers (which have an upper bound on memory) that would be perfect! |
This all sounds like a lot of complexity. I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully. Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node. |
In this case, assuming there are no other major processes running, the OOM killer would nuke polkadot, which wouldn't allow us to do any graceful exit (it sends a SIGKILL) - which may be fine?
That sounds good, can you elaborate on how you would do that? Do you mean from within polkadot? Thinking more about what we are trying to achieve, there are two separate problems both relating to determinism:
Disabling swap on the workers can help with 11. An OOM flag can help with 2. "Safely" crashing when memory is low2 seems to address both 1 and 2. It might actually simplify things in a sense, because we wouldn't have to think about memory anymore in discussions about determinism, i.e. there would be one less factor to consider. But maybe it's too extreme?
Wondering out loud, would aborting polkadot when memory went way up have helped in any recent incidents? Footnotes |
The Polkadot process can make use of cgroups to detect the memory pressure:
I like this question, which kind of returns back to the initial issue comments. AFAIK validator nodes are less pressured on memory than on CPU, network or disk. I have seen some instances of OOM killing happening due to maybe leaks in our code. Before the OOM killer goes in and does it's thing it is clear that swap usage (if not disabled) will degrade performance. If there might be additional processes running on the system which also eat a lot of memory, we can only protect ourselves by bailing out gracefully, which probably might be better than getting slashed for against valid (not implemented now) and loading the system with disputes. |
Just an idea instead of trying to prevent OOMs, which is a hard problem without controlling the full environment(Kernel configuration, OS, other processes running in system), have we explored the idea where we can have a slim superviser process that monitors our beefy executables and when OOM happens it just gracefully exists the network ? |
My impression would be that trying to prevent OOM by doing something before it happens is racy. If memory growth is happening too quickly, we might be too slow to react and the OOM happens anyway. Also this seems like something hard to configure, in a way that it will not backfire. Having Polkadot killed by SIGKILL (or similar) is not ideal and we should make sure things like database corruptions can not happen due to this or if so we should at least be able to detect them. Other than that, is is better to kill it, than leaving it in a state where it no longer functions correctly. |
Indeed it will be racy, but unfortunately it isn't possible to make it perfect. I don't think we should worry that much about the OOM killer, since it will target usually the largest consumer of memory which are not the PVF workers, but the Polkadot binary itself, so if the growth happens too quickly we'll quickly be killed which solves the problem of not functioning optimally. If we gracefully exit the PVF workers on memory pressure we should also notify the Polkadot process and it can decide based on the frequency of these events to exit, such that intermittent spikes in mem usage would not backfire. Anyway, we should experiment with this and collect some data before we enable the functionality. |
https://lwn.net/Articles/941614/
It seems in the future we may be able to control the OOM killer with BPF. |
…ytech#767) Co-authored-by: claravanstaden <Cats 4 life!>
The memory a wasm binary is allowed to use is already limited to my knowledge (1GiB?), so we should be able to have a PVF execution worker that pre-allocates enough memory on startup, so it will never allocate memory because of execution. I am talking about real memory here, not virtual memory. This should be enforceable by filling memory with random data. This might sound wasteful, but PVF execution is consensus critical so reserving memory for deterministic execution seems totally acceptable. Except of course for parallel execution.
Optimistic flex mem execution
To minimize memory usage for parallel workers, I am suggesting to only reserve memory for one worker, which is therefore special. All other parallel workers will allocate memory as needed. When a PVF gets executed, it gets scheduled on any worker just as of now, but if it fails due to
AmbiguousWorkerDeath
we try again, this time guaranteed on the one special pre-scheduled worker. Only if it also fails there on the second try we reportInvalid
.Reasoning
With this setup unjustified disputes due to memory pressure should become extremely unlikely and trying a second time should also effectively fix other instances of unjustified disputes.
Swap
Related to this, we should ensure memory won't get swapped out. In particular operators should be instructed to not enable swap. If a PVF worker gets swapped out to disk, we might run into timeout issues.
The text was updated successfully, but these errors were encountered: