-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent the audit event system from using too much memory #7329
Prevent the audit event system from using too much memory #7329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems useful and helpful by my reading. Thank you!
Can you please add some form of logging when the throttling takes place? I guess we'd need a mechanism to prevent flooding the logs when this happens.
I assume you did some manual testing to validate that all the events still make it under normal conditions, and that this behaves as expected during constrained conditions?
Thanks for the review!
Yeah I did manual tests, sometimes forcing one of the threads to sleep a little to simulate it; it's not always immediate to reproduce. Something to mention is that if osquery is able to use CPU normally you will still most likely see it die for too much memory used, even with this threshold. So really, this threshold comes in for particular situations where the global amount of records over the scheduler time is not that high, but osquery internally reads them all very quickly but then is much slower in processing them.
I would've reserved this for VLOG with some limiting, but if we want to make it something that is normally visible (via LOG() that is), I think this makes it slightly more complicated. For instance, if you are getting the message each and everytime it happens, or if you're applying some low limit (1 message per second), it can be useful to correlate that event with maybe something else. But if you're limiting it to for instance every 30 seconds or more, to avoid flooding, then the relationship between what might've happened to osquery and the threshold being hit is most likely lost. So I guess the question is, if using VLOG() how many times per seconds you would like to see it? And if it's using LOG()? |
I should probably verify an assumption I'm making here: My assumption is that records not read from the netlink socket in time will be dropped. So this implies that when this throttling takes place in osquery there may be process executions that are lost Is that correct? If not, the rest of my commentary may not apply. In my understanding osquery users have two top priorities: (1) Maintaining system performance. This PR helps to prioritize (1) by dropping some events to keep the system performing adequately. Because the data integrity is such a priority, I think it's important that osquery alert the user that integrity was sacrificed in order to maintain performance. Practically, I think it should be logged at the default log-level and maybe once every minute? If it's a verbose-only log then it will be difficult for users to ever know their data integrity was sacrificed for perf. Please let me know if I'm misunderstanding anything or you disagree with my reasoning here. |
I think the issue here is that there are several cases where behavior wildly changes. The throttling mechanism here doesn't force any drop of events, but it may result in that, depending how often the throttling happens and how continuous the high stream of events is (or the CPU constraint). We have to take into account that there's also the audit kernel backlog (which can be configured via the flag At this point I think that what it could make sense is as you say a message that is given at most 1 minute, which says that throttling happened and also how many times. |
53f9895
to
d1e59cc
Compare
@zwass I've added the messages. I went with showing the total amount of throttling time instead of a count, because I believe that's more indicative, since each throttling session duration might not be the same. I've also used two different messages, one as a WARNING, the other as a VLOG for debugging purposes, because as the comment suggests, the VLOG one might not cause further throttling and cause event loss at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log levels make sense to me. Can we add a note about this to the documentation so that users can understand how to interpret those messages? Other than that I think it's good to merge.
d1e59cc
to
05629de
Compare
@zwass it seems like the feedback has been addressed but was there anything else needed to approve this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Possibly a typo in the docs?
05629de
to
d155c8e
Compare
- Reduced the amount of audit records read as a batch from 4096 to 1024, to reduce the average number of them that has to be kept in memory. - Introduced a throttling mechanism between the netlink reader and the records processing, and then between the records processing and the audit publisher, which generates events. This is to ensure that it's not possible to have a backlog that's always increasing and that's bigger of a certain amount. - Prevent some unnecessary work happening in the processing thread, by always sleeping for 1 second if there's no records to be processed.
d155c8e
to
6d7001c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was mostly approved prior, and fell by.
Reduced the amount of audit records read as a batch
from 4096 to 1024, to reduce the average number of them
that has to be kept in memory.
Introduced a throttling mechanism between the netlink reader
and the records processing, and then between the records processing
and the audit publisher, which generates events.
This is to ensure that it's not possible to have a backlog that's
always increasing and that's bigger of a certain amount.
Prevent some unnecessary work happening in the processing thread,
by always sleeping for 1 second if there's no records to be processed.
We noticed that in environments where:
Then osquery would sometime start using an indefinitely increasing amount of memory.
This is caused by the presence of two internal backlog caches, one for records to be processed after being just read from the netlink socket, and the other for records that have been processed but that need to be transformed into events by the publisher.
If the thread processing the records was slow enough, and continued to be so, then the intermediate cache between it and the netlink reader thread which was providing the records, would indefinitely increase.
Although we haven't observed it, depending on the kernel thread scheduling, the same could happen between the publisher thread and the records processing thread.
That's why there are two points which attempt to throttle operations.
An additional note is that we've reduced the amount of records read on each netlink reader loop from 4k to 1k, because this permits to process records sooner and keep less of them in memory.
Moreover if the threshold of records in the backlog is passed, the amount beyond the threshold is much smaller than before.
The reason why we don't outright drop records when beyond the threshold, it's because we want to support spikes of events that may momentarily fill up the backlog, and in that situation if the spike lasts for a short time, then no events need to be dropped.
This fix should also slightly improve CPU usage in a situation where the CPU load is temporarily high and the backlog is getting big. In this situation with the previous implementation, when the CPU load drops and osquery can catch up, it might consume a high amount of CPU, for enough time for the watchdog to decide that the worker process has to be killed.