-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CWS] Make the ebpf rate-limiter generic and use it for ptrace #32398
Conversation
2af1c5f
to
a323ec4
Compare
[Fast Unit Tests Report] On pipeline 51667691 (CI Visibility). The following jobs did not run any unit tests: Jobs:
If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=51667691 --os-family=ubuntu Note: This applies to commit ba121a1 |
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision✅ Passed |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 395cb8e Optimization Goals: ✅ Improvement(s) detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +1.14 | [+0.46, +1.83] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.39 | [+0.35, +0.42] | 1 | Logs bounds checks dashboard |
➖ | quality_gate_idle_all_features | memory utilization | +0.25 | [+0.16, +0.34] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.17 | [-0.62, +0.97] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.08 | [-0.68, +0.85] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.08 | [-0.68, +0.84] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.05 | [-0.82, +0.91] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.04 | [-0.42, +0.51] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.03 | [-0.85, +0.91] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.10, +0.12] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.02 | [-0.65, +0.62] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.03 | [-0.94, +0.87] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.06 | [-0.13, +0.01] | 1 | Logs |
➖ | file_tree | memory utilization | -0.22 | [-0.35, -0.09] | 1 | Logs |
➖ | otel_to_otel_logs | ingress throughput | -0.48 | [-1.17, +0.22] | 1 | Logs |
✅ | quality_gate_logs | % cpu utilization | -7.82 | [-10.91, -4.72] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
@@ -41,7 +41,8 @@ BPF_HASH_MAP(secprofs_syscalls, u64, struct security_profile_syscalls_t, 1) // m | |||
BPF_HASH_MAP(auid_approvers, u32, struct event_mask_filter_t, 128) | |||
BPF_HASH_MAP(auid_range_approvers, u32, struct u32_range_filter_t, EVENT_MAX) | |||
|
|||
BPF_LRU_MAP(activity_dump_rate_limiters, u64, struct activity_dump_rate_limiter_ctx, 1) // max entries will be overridden at runtime | |||
BPF_LRU_MAP(activity_dump_rate_limiters, u64, struct rate_limiter_ctx, 1) // max entries will be overridden at runtime | |||
BPF_LRU_MAP(rate_limiters, u32, struct rate_limiter_ctx, 1) // max entries will be overridden at runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment about a common map and function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've think about it and my take is that: as the use case are a bit different, IMHO it make sense to have them separated:
- a first one with a 64bit cookie, sized to MaxTracedCgroupsCount
- a second one with a pid 32bit key, sized as the other pid caches
This way, we ensure entries for AD are not evicted on process stressed activity.
Also, the rates should be different (number of syscalls done by a whole container VS number of syscalls for a single process).
WDYT?
a323ec4
to
efac26a
Compare
/merge |
Devflow running:
|
What does this PR do?
It makes the rate-limiter developed for activity dumps generic, let us use it for other purposes.
The AD limiter keeps its own map and won't be impacted about the change.
The generic one has for now a rate of 100 events/sec/pid (the rate is not configurable yet).
Ptrace is the first syscall to benefit from it: except for a given subset of requests we don't want to miss, all the ptrace call are now rate limited.
Also, I've simplified the rate limiter algos to a single basic one (having different ones makes no more sense)
Motivation
Ptrace could be VERY noisy and have a huge impact on the perf buffer + the agent itself. This will drastically reduce the ptrace pressure.
Describe how you validated your changes
Possible Drawbacks / Trade-offs
Additional Notes