Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent the audit event system from using too much memory #7329

Conversation

Smjert
Copy link
Member

@Smjert Smjert commented Sep 29, 2021

  • Reduced the amount of audit records read as a batch
    from 4096 to 1024, to reduce the average number of them
    that has to be kept in memory.

  • Introduced a throttling mechanism between the netlink reader
    and the records processing, and then between the records processing
    and the audit publisher, which generates events.
    This is to ensure that it's not possible to have a backlog that's
    always increasing and that's bigger of a certain amount.

  • Prevent some unnecessary work happening in the processing thread,
    by always sleeping for 1 second if there's no records to be processed.

We noticed that in environments where:

  1. Audit is used and a continuous amount of audit records is captured
  2. There's a limited CPU availability due to few cores and high CPU load, or the osquery worker process CPU usage is limited through the use of cgroups

Then osquery would sometime start using an indefinitely increasing amount of memory.

This is caused by the presence of two internal backlog caches, one for records to be processed after being just read from the netlink socket, and the other for records that have been processed but that need to be transformed into events by the publisher.
If the thread processing the records was slow enough, and continued to be so, then the intermediate cache between it and the netlink reader thread which was providing the records, would indefinitely increase.
Although we haven't observed it, depending on the kernel thread scheduling, the same could happen between the publisher thread and the records processing thread.
That's why there are two points which attempt to throttle operations.

An additional note is that we've reduced the amount of records read on each netlink reader loop from 4k to 1k, because this permits to process records sooner and keep less of them in memory.
Moreover if the threshold of records in the backlog is passed, the amount beyond the threshold is much smaller than before.

The reason why we don't outright drop records when beyond the threshold, it's because we want to support spikes of events that may momentarily fill up the backlog, and in that situation if the spike lasts for a short time, then no events need to be dropped.

This fix should also slightly improve CPU usage in a situation where the CPU load is temporarily high and the backlog is getting big. In this situation with the previous implementation, when the CPU load drops and osquery can catch up, it might consume a high amount of CPU, for enough time for the watchdog to decide that the worker process has to be killed.

@Smjert Smjert added Linux events Related to osquery's evented tables or eventing subsystem performance labels Sep 29, 2021
@Smjert Smjert marked this pull request as ready for review November 8, 2021 18:36
@Smjert Smjert requested review from a team as code owners November 8, 2021 18:36
Copy link
Member

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems useful and helpful by my reading. Thank you!

Can you please add some form of logging when the throttling takes place? I guess we'd need a mechanism to prevent flooding the logs when this happens.

I assume you did some manual testing to validate that all the events still make it under normal conditions, and that this behaves as expected during constrained conditions?

@Smjert
Copy link
Member Author

Smjert commented Nov 10, 2021

Thanks for the review!

I assume you did some manual testing to validate that all the events still make it under normal conditions, and that this behaves as expected during constrained conditions?

Yeah I did manual tests, sometimes forcing one of the threads to sleep a little to simulate it; it's not always immediate to reproduce.
Obviously more testing is appreciated!

Something to mention is that if osquery is able to use CPU normally you will still most likely see it die for too much memory used, even with this threshold.
This is because the threshold only limits the backlog records that exists in a specific instant. So supposing that one has a query in the schedule that queries process_events every 5 seconds, it can still happen that the rate of records that come in and are unprocessed is higher than the threshold in each moment, so the throttling happens, but the accumulated amount of events to create rows from is enough after 5 seconds to consume all the memory.

So really, this threshold comes in for particular situations where the global amount of records over the scheduler time is not that high, but osquery internally reads them all very quickly but then is much slower in processing them.
This again seems to normally happen when you limit osquery CPU with cgroups and you're also are using a VM with few cores (2 or similar).

Can you please add some form of logging when the throttling takes place? I guess we'd need a mechanism to prevent flooding the logs when this happens.

I would've reserved this for VLOG with some limiting, but if we want to make it something that is normally visible (via LOG() that is), I think this makes it slightly more complicated.
Like what is the case where you would like to know that the throttling is happening?

For instance, if you are getting the message each and everytime it happens, or if you're applying some low limit (1 message per second), it can be useful to correlate that event with maybe something else. But if you're limiting it to for instance every 30 seconds or more, to avoid flooding, then the relationship between what might've happened to osquery and the threshold being hit is most likely lost.

So I guess the question is, if using VLOG() how many times per seconds you would like to see it? And if it's using LOG()?

@zwass
Copy link
Member

zwass commented Nov 10, 2021

I should probably verify an assumption I'm making here:

My assumption is that records not read from the netlink socket in time will be dropped. So this implies that when this throttling takes place in osquery there may be process executions that are lost

Is that correct? If not, the rest of my commentary may not apply.

In my understanding osquery users have two top priorities:

(1) Maintaining system performance.
(2) Integrity of data collected.

This PR helps to prioritize (1) by dropping some events to keep the system performing adequately. Because the data integrity is such a priority, I think it's important that osquery alert the user that integrity was sacrificed in order to maintain performance.

Practically, I think it should be logged at the default log-level and maybe once every minute? If it's a verbose-only log then it will be difficult for users to ever know their data integrity was sacrificed for perf.

Please let me know if I'm misunderstanding anything or you disagree with my reasoning here.

@Smjert
Copy link
Member Author

Smjert commented Nov 10, 2021

I should probably verify an assumption I'm making here:

My assumption is that records not read from the netlink socket in time will be dropped. So this implies that when this throttling takes place in osquery there may be process executions that are lost

Is that correct? If not, the rest of my commentary may not apply.

In my understanding osquery users have two top priorities:

(1) Maintaining system performance. (2) Integrity of data collected.

This PR helps to prioritize (1) by dropping some events to keep the system performing adequately. Because the data integrity is such a priority, I think it's important that osquery alert the user that integrity was sacrificed in order to maintain performance.

Practically, I think it should be logged at the default log-level and maybe once every minute? If it's a verbose-only log then it will be difficult for users to ever know their data integrity was sacrificed for perf.

Please let me know if I'm misunderstanding anything or you disagree with my reasoning here.

I think the issue here is that there are several cases where behavior wildly changes. The throttling mechanism here doesn't force any drop of events, but it may result in that, depending how often the throttling happens and how continuous the high stream of events is (or the CPU constraint).

We have to take into account that there's also the audit kernel backlog (which can be configured via the flag audit_backlog_limit and is currently at 4096), which helps with the buffering of events.

At this point I think that what it could make sense is as you say a message that is given at most 1 minute, which says that throttling happened and also how many times.
The number of times would be an indicator of the throttling being possibly the cause of some event drops.

@Smjert Smjert force-pushed the stefano/improvement/audit-event-throttling branch from 53f9895 to d1e59cc Compare December 2, 2021 13:38
@Smjert
Copy link
Member Author

Smjert commented Dec 2, 2021

@zwass I've added the messages. I went with showing the total amount of throttling time instead of a count, because I believe that's more indicative, since each throttling session duration might not be the same.

I've also used two different messages, one as a WARNING, the other as a VLOG for debugging purposes, because as the comment suggests, the VLOG one might not cause further throttling and cause event loss at all.
Only the throttling of the Netlink reading could cause loss of events. Such throttling might also happen if the records processing throttling happens, but it's not guaranteed.

@Smjert Smjert closed this Jan 4, 2022
@Smjert Smjert reopened this Jan 4, 2022
Copy link
Member

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log levels make sense to me. Can we add a note about this to the documentation so that users can understand how to interpret those messages? Other than that I think it's good to merge.

@Smjert Smjert force-pushed the stefano/improvement/audit-event-throttling branch from d1e59cc to 05629de Compare January 10, 2022 17:03
@Smjert Smjert added the ready for review Pull requests that are ready to be reviewed by a maintainer label Mar 15, 2022
@mike-myers-tob
Copy link
Member

@zwass it seems like the feedback has been addressed but was there anything else needed to approve this PR?

@mike-myers-tob mike-myers-tob added this to the 5.4.0 milestone May 5, 2022
zwass
zwass previously approved these changes May 6, 2022
Copy link
Member

@zwass zwass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Possibly a typo in the docs?

docs/wiki/deployment/process-auditing.md Outdated Show resolved Hide resolved
@Smjert Smjert force-pushed the stefano/improvement/audit-event-throttling branch from 05629de to d155c8e Compare May 6, 2022 07:38
- Reduced the amount of audit records read as a batch
  from 4096 to 1024, to reduce the average number of them
  that has to be kept in memory.

- Introduced a throttling mechanism between the netlink reader
  and the records processing, and then between the records processing
  and the audit publisher, which generates events.
  This is to ensure that it's not possible to have a backlog that's
  always increasing and that's bigger of a certain amount.

- Prevent some unnecessary work happening in the processing thread,
  by always sleeping for 1 second if there's no records to be processed.
@Smjert Smjert force-pushed the stefano/improvement/audit-event-throttling branch from d155c8e to 6d7001c Compare May 6, 2022 07:41
@Smjert Smjert modified the milestones: 5.4.0, 5.5.0 Jul 6, 2022
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was mostly approved prior, and fell by.

@mike-myers-tob mike-myers-tob merged commit db7358f into osquery:master Aug 4, 2022
@mike-myers-tob mike-myers-tob deleted the stefano/improvement/audit-event-throttling branch August 4, 2022 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
events Related to osquery's evented tables or eventing subsystem Linux performance ready for review Pull requests that are ready to be reviewed by a maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants