Prevent the audit event system from using too much memory #7329

Smjert · 2021-09-29T18:22:52Z

Reduced the amount of audit records read as a batch
from 4096 to 1024, to reduce the average number of them
that has to be kept in memory.
Introduced a throttling mechanism between the netlink reader
and the records processing, and then between the records processing
and the audit publisher, which generates events.
This is to ensure that it's not possible to have a backlog that's
always increasing and that's bigger of a certain amount.
Prevent some unnecessary work happening in the processing thread,
by always sleeping for 1 second if there's no records to be processed.

We noticed that in environments where:

Audit is used and a continuous amount of audit records is captured
There's a limited CPU availability due to few cores and high CPU load, or the osquery worker process CPU usage is limited through the use of cgroups

Then osquery would sometime start using an indefinitely increasing amount of memory.

This is caused by the presence of two internal backlog caches, one for records to be processed after being just read from the netlink socket, and the other for records that have been processed but that need to be transformed into events by the publisher.
If the thread processing the records was slow enough, and continued to be so, then the intermediate cache between it and the netlink reader thread which was providing the records, would indefinitely increase.
Although we haven't observed it, depending on the kernel thread scheduling, the same could happen between the publisher thread and the records processing thread.
That's why there are two points which attempt to throttle operations.

An additional note is that we've reduced the amount of records read on each netlink reader loop from 4k to 1k, because this permits to process records sooner and keep less of them in memory.
Moreover if the threshold of records in the backlog is passed, the amount beyond the threshold is much smaller than before.

The reason why we don't outright drop records when beyond the threshold, it's because we want to support spikes of events that may momentarily fill up the backlog, and in that situation if the spike lasts for a short time, then no events need to be dropped.

This fix should also slightly improve CPU usage in a situation where the CPU load is temporarily high and the backlog is getting big. In this situation with the previous implementation, when the CPU load drops and osquery can catch up, it might consume a high amount of CPU, for enough time for the watchdog to decide that the worker process has to be killed.

zwass

This seems useful and helpful by my reading. Thank you!

Can you please add some form of logging when the throttling takes place? I guess we'd need a mechanism to prevent flooding the logs when this happens.

I assume you did some manual testing to validate that all the events still make it under normal conditions, and that this behaves as expected during constrained conditions?

Smjert · 2021-11-10T11:43:23Z

Thanks for the review!

I assume you did some manual testing to validate that all the events still make it under normal conditions, and that this behaves as expected during constrained conditions?

Yeah I did manual tests, sometimes forcing one of the threads to sleep a little to simulate it; it's not always immediate to reproduce.
Obviously more testing is appreciated!

Something to mention is that if osquery is able to use CPU normally you will still most likely see it die for too much memory used, even with this threshold.
This is because the threshold only limits the backlog records that exists in a specific instant. So supposing that one has a query in the schedule that queries process_events every 5 seconds, it can still happen that the rate of records that come in and are unprocessed is higher than the threshold in each moment, so the throttling happens, but the accumulated amount of events to create rows from is enough after 5 seconds to consume all the memory.

So really, this threshold comes in for particular situations where the global amount of records over the scheduler time is not that high, but osquery internally reads them all very quickly but then is much slower in processing them.
This again seems to normally happen when you limit osquery CPU with cgroups and you're also are using a VM with few cores (2 or similar).

Can you please add some form of logging when the throttling takes place? I guess we'd need a mechanism to prevent flooding the logs when this happens.

I would've reserved this for VLOG with some limiting, but if we want to make it something that is normally visible (via LOG() that is), I think this makes it slightly more complicated.
Like what is the case where you would like to know that the throttling is happening?

For instance, if you are getting the message each and everytime it happens, or if you're applying some low limit (1 message per second), it can be useful to correlate that event with maybe something else. But if you're limiting it to for instance every 30 seconds or more, to avoid flooding, then the relationship between what might've happened to osquery and the threshold being hit is most likely lost.

So I guess the question is, if using VLOG() how many times per seconds you would like to see it? And if it's using LOG()?

zwass · 2021-11-10T17:46:40Z

I should probably verify an assumption I'm making here:

My assumption is that records not read from the netlink socket in time will be dropped. So this implies that when this throttling takes place in osquery there may be process executions that are lost

Is that correct? If not, the rest of my commentary may not apply.

In my understanding osquery users have two top priorities:

(1) Maintaining system performance.
(2) Integrity of data collected.

This PR helps to prioritize (1) by dropping some events to keep the system performing adequately. Because the data integrity is such a priority, I think it's important that osquery alert the user that integrity was sacrificed in order to maintain performance.

Practically, I think it should be logged at the default log-level and maybe once every minute? If it's a verbose-only log then it will be difficult for users to ever know their data integrity was sacrificed for perf.

Please let me know if I'm misunderstanding anything or you disagree with my reasoning here.

Smjert · 2021-11-10T19:52:20Z

I should probably verify an assumption I'm making here:

My assumption is that records not read from the netlink socket in time will be dropped. So this implies that when this throttling takes place in osquery there may be process executions that are lost

Is that correct? If not, the rest of my commentary may not apply.

In my understanding osquery users have two top priorities:

(1) Maintaining system performance. (2) Integrity of data collected.

This PR helps to prioritize (1) by dropping some events to keep the system performing adequately. Because the data integrity is such a priority, I think it's important that osquery alert the user that integrity was sacrificed in order to maintain performance.

Practically, I think it should be logged at the default log-level and maybe once every minute? If it's a verbose-only log then it will be difficult for users to ever know their data integrity was sacrificed for perf.

Please let me know if I'm misunderstanding anything or you disagree with my reasoning here.

I think the issue here is that there are several cases where behavior wildly changes. The throttling mechanism here doesn't force any drop of events, but it may result in that, depending how often the throttling happens and how continuous the high stream of events is (or the CPU constraint).

We have to take into account that there's also the audit kernel backlog (which can be configured via the flag audit_backlog_limit and is currently at 4096), which helps with the buffering of events.

At this point I think that what it could make sense is as you say a message that is given at most 1 minute, which says that throttling happened and also how many times.
The number of times would be an indicator of the throttling being possibly the cause of some event drops.

Smjert · 2021-12-02T14:12:24Z

@zwass I've added the messages. I went with showing the total amount of throttling time instead of a count, because I believe that's more indicative, since each throttling session duration might not be the same.

I've also used two different messages, one as a WARNING, the other as a VLOG for debugging purposes, because as the comment suggests, the VLOG one might not cause further throttling and cause event loss at all.
Only the throttling of the Netlink reading could cause loss of events. Such throttling might also happen if the records processing throttling happens, but it's not guaranteed.

zwass

The log levels make sense to me. Can we add a note about this to the documentation so that users can understand how to interpret those messages? Other than that I think it's good to merge.

mike-myers-tob · 2022-05-05T23:34:36Z

@zwass it seems like the feedback has been addressed but was there anything else needed to approve this PR?

zwass

Looks great! Possibly a typo in the docs?

docs/wiki/deployment/process-auditing.md

- Reduced the amount of audit records read as a batch from 4096 to 1024, to reduce the average number of them that has to be kept in memory. - Introduced a throttling mechanism between the netlink reader and the records processing, and then between the records processing and the audit publisher, which generates events. This is to ensure that it's not possible to have a backlog that's always increasing and that's bigger of a certain amount. - Prevent some unnecessary work happening in the processing thread, by always sleeping for 1 second if there's no records to be processed.

directionless

Looks like this was mostly approved prior, and fell by.

Smjert added Linux events Related to osquery's evented tables or eventing subsystem performance labels Sep 29, 2021

Smjert marked this pull request as ready for review November 8, 2021 18:36

Smjert requested review from a team as code owners November 8, 2021 18:36

directionless assigned zwass Nov 9, 2021

zwass reviewed Nov 10, 2021

View reviewed changes

Smjert force-pushed the stefano/improvement/audit-event-throttling branch from 53f9895 to d1e59cc Compare December 2, 2021 13:38

Smjert closed this Jan 4, 2022

Smjert reopened this Jan 4, 2022

zwass reviewed Jan 5, 2022

View reviewed changes

Smjert force-pushed the stefano/improvement/audit-event-throttling branch from d1e59cc to 05629de Compare January 10, 2022 17:03

Smjert added the ready for review Pull requests that are ready to be reviewed by a maintainer label Mar 15, 2022

mike-myers-tob added this to the 5.4.0 milestone May 5, 2022

zwass previously approved these changes May 6, 2022

View reviewed changes

docs/wiki/deployment/process-auditing.md Outdated Show resolved Hide resolved

Smjert dismissed zwass’s stale review via d155c8e May 6, 2022 07:38

Smjert force-pushed the stefano/improvement/audit-event-throttling branch from 05629de to d155c8e Compare May 6, 2022 07:38

Smjert force-pushed the stefano/improvement/audit-event-throttling branch from d155c8e to 6d7001c Compare May 6, 2022 07:41

Smjert modified the milestones: 5.4.0, 5.5.0 Jul 6, 2022

directionless approved these changes Aug 4, 2022

View reviewed changes

mike-myers-tob merged commit db7358f into osquery:master Aug 4, 2022

mike-myers-tob deleted the stefano/improvement/audit-event-throttling branch August 4, 2022 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent the audit event system from using too much memory #7329

Prevent the audit event system from using too much memory #7329

Smjert commented Sep 29, 2021 •

edited

Loading

zwass left a comment

Smjert commented Nov 10, 2021 •

edited

Loading

zwass commented Nov 10, 2021

Smjert commented Nov 10, 2021

Smjert commented Dec 2, 2021 •

edited

Loading

zwass left a comment

mike-myers-tob commented May 5, 2022

zwass left a comment

directionless left a comment

Prevent the audit event system from using too much memory #7329

Prevent the audit event system from using too much memory #7329

Conversation

Smjert commented Sep 29, 2021 • edited Loading

zwass left a comment

Choose a reason for hiding this comment

Smjert commented Nov 10, 2021 • edited Loading

zwass commented Nov 10, 2021

Smjert commented Nov 10, 2021

Smjert commented Dec 2, 2021 • edited Loading

zwass left a comment

Choose a reason for hiding this comment

mike-myers-tob commented May 5, 2022

zwass left a comment

Choose a reason for hiding this comment

directionless left a comment

Choose a reason for hiding this comment

Smjert commented Sep 29, 2021 •

edited

Loading

Smjert commented Nov 10, 2021 •

edited

Loading

Smjert commented Dec 2, 2021 •

edited

Loading