-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CONTP-338] Agent sidecar securityContext readOnlyRootFilesystem default setup #31257
Conversation
d4c6af0
to
f75a640
Compare
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=51334945 --os-family=ubuntu Note: This applies to commit 6bb8efe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with minor suggestions, thanks!
releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml
Outdated
Show resolved
Hide resolved
e6dc815
to
48a09b8
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 5d81b5d Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | otel_to_otel_logs | ingress throughput | +1.09 | [+0.38, +1.81] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +0.98 | [-1.97, +3.92] | 1 | Logs |
➖ | file_tree | memory utilization | +0.94 | [+0.80, +1.08] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.33 | [-0.44, +1.10] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.27 | [-0.20, +0.73] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.21 | [-0.51, +0.92] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.13 | [-0.78, +1.05] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.11 | [-0.66, +0.88] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.07 | [-0.81, +0.94] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.06 | [-0.58, +0.69] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.05 | [-0.71, +0.81] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.04 | [-0.79, +0.87] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.10, +0.10] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.07 | [-0.14, -0.01] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | -0.31 | [-0.44, -0.18] | 1 | Logs bounds checks dashboard |
➖ | quality_gate_idle | memory utilization | -0.68 | [-0.73, -0.63] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go
Outdated
Show resolved
Hide resolved
return true | ||
} | ||
securityContext := (*w.profileOverrides)[0].SecurityContext | ||
return securityContext == nil || securityContext.ReadOnlyRootFilesystem == nil || *securityContext.ReadOnlyRootFilesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm following here.
If securityContext == nil
, why would we return true? If we have no security context then this is not a ReadOnlyFilteSystem by default
Same if securityContext.ReadOnlyFilesystem == nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user supplies a value in profileOverride.securityContext.ReadOnlyFilesystem, then we will use that value. Otherwise we will default to true. We need to check whether first profileOverrides
exists and only has 1 entry. Then we must check that securityContext
is defined and additionally if readOnlyRootFilesystem
was set. When none of those objects exist, we will default to true.
I can more explicitly state that we want the default value to be true in the function's comment description.
pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go
Outdated
Show resolved
Hide resolved
w.addDefaultSidecarSecurity(agentSidecarContainer) | ||
pod.Spec.Volumes = append(pod.Spec.Volumes, *w.getDefaultSidecarVolumeTemplate()) | ||
// Don't want to apply any overrides to the agent sidecar init container | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why do we need a defer function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the initContainer to the structure here. Then the applyProviderOverrides
will affect the init container's definition. This specifically happens at applyProviderOverrides
in the fargate
case when applyFargateOverrides
calls common.InjectVolume(pod, volume, volumeMount)
and it adds a VolumeMount to that initContainers here. This initContainer does not need this VolumeMount for APM and DogStatsD sockets.
pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gabedos
I already left some comments.
But after reading all the PR, I think I understand that after this PR we will be setting the securityContext to ReadRootFileSystem by default.
I thought the goal was just to allow the user to set the security context to ReadRootFileSystem: true
when needed.
Any idea why we are setting this by default now?
Hi @adel121! Thanks for taking a look at my PR. The main goal of this PR was to enable this security feature by default on the sidecar agent. We want to limit the scope of the sidecar agent access to the root filesystem because it is best practice to minimize permissions. However, we still want to provide advanced users with the ability to modify the securityContext of the sidecar agent so this is why there is also lots of changes to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
20bd4f7
to
4a53e02
Compare
Package size comparisonComparison with ancestor Diff per package
Decision✅ Passed |
/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on |
Devflow running:
|
[Fast Unit Tests Report] On pipeline 51334945 (CI Visibility). The following jobs did not run any unit tests: Jobs:
If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help |
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
fe95db1
to
24f1073
Compare
/merge |
Devflow running:
|
What does this PR do?
Enables
readOnlyRootFilesystem:true
for sidecar agent and applies all relevant volumes and initContainers to support this feature. Also allows advanced users to disable this securityContext by overriding the profile by settingreadOnlyRootFilesystem:false
.Motivation
Support case ticket about customer want to configure this securityContext. Limiting the agent's scope is also best practice.
Describe how to test/QA your changes
A majority of the updates to the profile processing and ensuring the sidecar container is properly configured is covered by unit tests. However, we can manually verify this is working in a local minikube cluster with the sidecar agent injection.
values.yaml
k apply -f deploy.yaml -n fargate
etc/datadog
Possible Drawbacks / Trade-offs
An additional init container will run for each time the sidecar agent spins up. However, it uses the same agent image so there aren't concerns about time to load the image. This initContainer runs for ~0 sec.
Additional Notes
Note: you might need to restart your cluster-agent and make sure it's using the same token that you're providing to the node agent. Look into the pod manifest and search for secretKeyRef to see where they are being pulled from.