Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONTP-338] Agent sidecar securityContext readOnlyRootFilesystem default setup #31257

Merged
merged 16 commits into from
Dec 19, 2024

Conversation

gabedos
Copy link
Contributor

@gabedos gabedos commented Nov 20, 2024

What does this PR do?

Enables readOnlyRootFilesystem:true for sidecar agent and applies all relevant volumes and initContainers to support this feature. Also allows advanced users to disable this securityContext by overriding the profile by setting readOnlyRootFilesystem:false.

Motivation

Support case ticket about customer want to configure this securityContext. Limiting the agent's scope is also best practice.

Describe how to test/QA your changes

A majority of the updates to the profile processing and ensuring the sidecar container is properly configured is covered by unit tests. However, we can manually verify this is working in a local minikube cluster with the sidecar agent injection.

  1. Load the following cluster agent configuration values.yaml
datadog:
  kubelet:
    tlsVerify: false
  clusterName: <INSERT_CLUSTER_NAME>
agents:
  enabled: false
clusterAgent:
  image:
    repository: <INSERT_REPO>
    tag: <INSERT_TAG>
  enabled: true
  replicas: 1
  admissionController:
    agentSidecarInjection:
      enabled: true
      provider: fargate
  1. Setup the fake fargate configuration to allow the agent to be deployed
kubectl create namespace fargate
kubectl create secret generic datadog-secret -n datadog-agent-helm \
        --from-literal api-key=$DD_API_KEY --from-literal token=random32characterstringfortoken1
kubectl create secret generic datadog-secret -n fargate \
        --from-literal api-key=$DD_API_KEY --from-literal token=random32characterstringfortoken1
  1. Deploy a workload that will receive an agent sidecar k apply -f deploy.yaml -n fargate
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
        agent.datadoghq.com/sidecar: fargate
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
  1. Confirm agent is loaded and properly running
k exec -it nginx-deployment-xxxxx -c datadog-agent-injected -n fargate -- agent status

Collector
============
    Running Checks
    ============
    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 32
      Metric Samples: Last Run: 9, Total: 281
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2024-11-20 19:29:47 UTC (1732130987000)
      Last Successful Execution Date : 2024-11-20 19:29:47 UTC (1732130987000)
  1. Confirm configuration files properly placed in etc/datadog
k exec -it nginx-deployment-xxxxx -c datadog-agent-injected -n fargate -- /bin/bash
root@nginx-deployment-xxxxx:/# ls etc/datadog-agent/

auth_token    conf.d                datadog-docker.yaml      datadog.yaml          install_info                 selinux
checks.d      datadog-ci.yaml       datadog-ecs.yaml         datadog.yaml.example  runtime-security.d           system-probe.yaml.example
compliance.d  datadog-cluster.yaml  datadog-kubernetes.yaml  install.json          security-agent.yaml.example

Possible Drawbacks / Trade-offs

An additional init container will run for each time the sidecar agent spins up. However, it uses the same agent image so there aren't concerns about time to load the image. This initContainer runs for ~0 sec.

Additional Notes

Note: you might need to restart your cluster-agent and make sure it's using the same token that you're providing to the node agent. Look into the pod manifest and search for secretKeyRef to see where they are being pulled from.

@github-actions github-actions bot added medium review PR review might take time team/container-platform The Container Platform Team labels Nov 20, 2024
@gabedos gabedos force-pushed the gabedos/readonly-agent-sidecar branch from d4c6af0 to f75a640 Compare November 20, 2024 01:13
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Nov 20, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51334945 --os-family=ubuntu

Note: This applies to commit 6bb8efe

@gabedos gabedos changed the title Agent sidecar securityContext readOnlyRootFilesystem default setup [CONTP-388] Agent sidecar securityContext readOnlyRootFilesystem default setup Nov 20, 2024
@gabedos gabedos changed the title [CONTP-388] Agent sidecar securityContext readOnlyRootFilesystem default setup [CONTP-338] Agent sidecar securityContext readOnlyRootFilesystem default setup Nov 20, 2024
Copy link
Contributor

@jhgilbert jhgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor suggestions, thanks!

@gabedos gabedos force-pushed the gabedos/readonly-agent-sidecar branch from e6dc815 to 48a09b8 Compare November 20, 2024 20:17
Copy link

cit-pr-commenter bot commented Nov 20, 2024

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 52154bd4-0ec3-4202-91a9-6f8d5f06b0f8

Baseline: 5d81b5d
Comparison: 6bb8efe
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
otel_to_otel_logs ingress throughput +1.09 [+0.38, +1.81] 1 Logs
quality_gate_logs % cpu utilization +0.98 [-1.97, +3.92] 1 Logs
file_tree memory utilization +0.94 [+0.80, +1.08] 1 Logs
file_to_blackhole_1000ms_latency egress throughput +0.33 [-0.44, +1.10] 1 Logs
file_to_blackhole_1000ms_latency_linear_load egress throughput +0.27 [-0.20, +0.73] 1 Logs
uds_dogstatsd_to_api_cpu % cpu utilization +0.21 [-0.51, +0.92] 1 Logs
file_to_blackhole_0ms_latency_http1 egress throughput +0.13 [-0.78, +1.05] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.11 [-0.66, +0.88] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.07 [-0.81, +0.94] 1 Logs
file_to_blackhole_300ms_latency egress throughput +0.06 [-0.58, +0.69] 1 Logs
file_to_blackhole_100ms_latency egress throughput +0.05 [-0.71, +0.81] 1 Logs
file_to_blackhole_0ms_latency_http2 egress throughput +0.04 [-0.79, +0.87] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.01, +0.01] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.10, +0.10] 1 Logs
tcp_syslog_to_blackhole ingress throughput -0.07 [-0.14, -0.01] 1 Logs
quality_gate_idle_all_features memory utilization -0.31 [-0.44, -0.18] 1 Logs bounds checks dashboard
quality_gate_idle memory utilization -0.68 [-0.73, -0.63] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
file_to_blackhole_0ms_latency lost_bytes 10/10
file_to_blackhole_0ms_latency memory_usage 10/10
file_to_blackhole_0ms_latency_http1 lost_bytes 10/10
file_to_blackhole_0ms_latency_http1 memory_usage 10/10
file_to_blackhole_0ms_latency_http2 lost_bytes 10/10
file_to_blackhole_0ms_latency_http2 memory_usage 10/10
file_to_blackhole_1000ms_latency memory_usage 10/10
file_to_blackhole_1000ms_latency_linear_load memory_usage 10/10
file_to_blackhole_100ms_latency lost_bytes 10/10
file_to_blackhole_100ms_latency memory_usage 10/10
file_to_blackhole_300ms_latency lost_bytes 10/10
file_to_blackhole_300ms_latency memory_usage 10/10
file_to_blackhole_500ms_latency lost_bytes 10/10
file_to_blackhole_500ms_latency memory_usage 10/10
quality_gate_idle memory_usage 10/10 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 bounds checks dashboard
quality_gate_logs lost_bytes 10/10
quality_gate_logs memory_usage 10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

return true
}
securityContext := (*w.profileOverrides)[0].SecurityContext
return securityContext == nil || securityContext.ReadOnlyRootFilesystem == nil || *securityContext.ReadOnlyRootFilesystem
Copy link
Contributor

@adel121 adel121 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following here.

If securityContext == nil, why would we return true? If we have no security context then this is not a ReadOnlyFilteSystem by default

Same if securityContext.ReadOnlyFilesystem == nil

Copy link
Contributor Author

@gabedos gabedos Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user supplies a value in profileOverride.securityContext.ReadOnlyFilesystem, then we will use that value. Otherwise we will default to true. We need to check whether first profileOverrides exists and only has 1 entry. Then we must check that securityContext is defined and additionally if readOnlyRootFilesystem was set. When none of those objects exist, we will default to true.

I can more explicitly state that we want the default value to be true in the function's comment description.

w.addDefaultSidecarSecurity(agentSidecarContainer)
pod.Spec.Volumes = append(pod.Spec.Volumes, *w.getDefaultSidecarVolumeTemplate())
// Don't want to apply any overrides to the agent sidecar init container
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why do we need a defer function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the initContainer to the structure here. Then the applyProviderOverrides will affect the init container's definition. This specifically happens at applyProviderOverrides in the fargate case when applyFargateOverrides calls common.InjectVolume(pod, volume, volumeMount) and it adds a VolumeMount to that initContainers here. This initContainer does not need this VolumeMount for APM and DogStatsD sockets.

@github-actions github-actions bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Nov 21, 2024
Copy link
Contributor

@adel121 adel121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gabedos

I already left some comments.

But after reading all the PR, I think I understand that after this PR we will be setting the securityContext to ReadRootFileSystem by default.

I thought the goal was just to allow the user to set the security context to ReadRootFileSystem: true when needed.

Any idea why we are setting this by default now?

@gabedos
Copy link
Contributor Author

gabedos commented Nov 21, 2024

Thanks @gabedos

I already left some comments.

But after reading all the PR, I think I understand that after this PR we will be setting the securityContext to ReadRootFileSystem by default.

I thought the goal was just to allow the user to set the security context to ReadRootFileSystem: true when needed.

Any idea why we are setting this by default now?

Hi @adel121! Thanks for taking a look at my PR. The main goal of this PR was to enable this security feature by default on the sidecar agent. We want to limit the scope of the sidecar agent access to the root filesystem because it is best practice to minimize permissions. However, we still want to provide advanced users with the ability to modify the securityContext of the sidecar agent so this is why there is also lots of changes to the ProfileOverride struct and its parsing.

@gabedos gabedos marked this pull request as ready for review November 21, 2024 18:55
@gabedos gabedos requested review from a team as code owners November 21, 2024 18:55
Copy link
Contributor

@adel121 adel121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gabedos gabedos added the qa/rc-required Only for a PR that requires validation on the Release Candidate label Dec 2, 2024
@gabedos gabedos force-pushed the gabedos/readonly-agent-sidecar branch from 20bd4f7 to 4a53e02 Compare December 10, 2024 10:27
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 10, 2024

Package size comparison

Comparison with ancestor a7ba9110c9023ca31f2c5e913a7ff98ab31a9a5b

Diff per package
package diff status size ancestor threshold
datadog-agent-amd64-deb -1.41MB 1270.66MB 1272.08MB 140.00MB
datadog-iot-agent-amd64-deb 0.00MB 113.20MB 113.20MB 10.00MB
datadog-dogstatsd-amd64-deb 0.00MB 78.32MB 78.32MB 10.00MB
datadog-heroku-agent-amd64-deb -1.40MB 526.45MB 527.85MB 70.00MB
datadog-agent-x86_64-rpm -1.41MB 1279.90MB 1281.31MB 140.00MB
datadog-agent-x86_64-suse -1.41MB 1279.90MB 1281.31MB 140.00MB
datadog-iot-agent-x86_64-rpm 0.00MB 113.26MB 113.26MB 10.00MB
datadog-iot-agent-x86_64-suse 0.00MB 113.26MB 113.26MB 10.00MB
datadog-dogstatsd-x86_64-rpm 0.00MB 78.40MB 78.40MB 10.00MB
datadog-dogstatsd-x86_64-suse 0.00MB 78.40MB 78.40MB 10.00MB
datadog-agent-arm64-deb -0.01MB 1004.84MB 1004.85MB 140.00MB
datadog-iot-agent-arm64-deb 0.00MB 108.67MB 108.67MB 10.00MB
datadog-dogstatsd-arm64-deb 0.00MB 55.59MB 55.59MB 10.00MB
datadog-agent-aarch64-rpm -0.00MB 1014.06MB 1014.06MB 140.00MB
datadog-iot-agent-aarch64-rpm 0.00MB 108.74MB 108.74MB 10.00MB

Decision

✅ Passed

@gabedos gabedos added this to the 7.62.0 milestone Dec 11, 2024
@gabedos
Copy link
Contributor Author

gabedos commented Dec 11, 2024

/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on

@dd-devflow
Copy link

dd-devflow bot commented Dec 11, 2024

Devflow running: /trigger-ci --variable RUN_ALL_BUILDS=true --varia...

View all feedbacks in Devflow UI.


2024-12-11 13:39:36 UTC ℹ️ Gitlab pipeline started

Started pipeline #50831448

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 16, 2024

[Fast Unit Tests Report]

On pipeline 51334945 (CI Visibility). The following jobs did not run any unit tests:

Jobs:
  • tests_flavor_dogstatsd_deb-x64
  • tests_flavor_heroku_deb-x64
  • tests_flavor_iot_deb-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Dec 16, 2024

Uncompressed package size comparison

Comparison with ancestor 5d81b5d4377ca3be9dac2134700683eedb516337

Diff per package
package diff status size ancestor threshold
datadog-agent-x86_64-rpm 0.00MB ⚠️ 1197.01MB 1197.01MB 140.00MB
datadog-agent-x86_64-suse 0.00MB ⚠️ 1197.01MB 1197.01MB 140.00MB
datadog-agent-aarch64-rpm 0.00MB ⚠️ 943.00MB 943.00MB 140.00MB
datadog-agent-amd64-deb 0.00MB ⚠️ 1187.77MB 1187.77MB 140.00MB
datadog-agent-arm64-deb 0.00MB ⚠️ 933.78MB 933.78MB 140.00MB
datadog-heroku-agent-amd64-deb 0.00MB ⚠️ 505.05MB 505.05MB 70.00MB
datadog-dogstatsd-amd64-deb 0.00MB 78.58MB 78.58MB 10.00MB
datadog-dogstatsd-x86_64-rpm 0.00MB 78.65MB 78.65MB 10.00MB
datadog-dogstatsd-x86_64-suse 0.00MB 78.65MB 78.65MB 10.00MB
datadog-dogstatsd-arm64-deb 0.00MB 55.78MB 55.78MB 10.00MB
datadog-iot-agent-amd64-deb 0.00MB 113.31MB 113.31MB 10.00MB
datadog-iot-agent-x86_64-rpm 0.00MB 113.38MB 113.38MB 10.00MB
datadog-iot-agent-x86_64-suse 0.00MB 113.38MB 113.38MB 10.00MB
datadog-iot-agent-arm64-deb 0.00MB 108.78MB 108.78MB 10.00MB
datadog-iot-agent-aarch64-rpm 0.00MB 108.84MB 108.84MB 10.00MB

Decision

⚠️ Warning

@gabedos gabedos force-pushed the gabedos/readonly-agent-sidecar branch from fe95db1 to 24f1073 Compare December 17, 2024 15:17
@gabedos
Copy link
Contributor Author

gabedos commented Dec 19, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Dec 19, 2024

Devflow running: /merge

View all feedbacks in Devflow UI.


2024-12-19 14:20:39 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 27m.


2024-12-19 14:56:57 UTC ℹ️ MergeQueue: This merge request was merged

@dd-mergequeue dd-mergequeue bot merged commit 1d0d5f3 into main Dec 19, 2024
231 checks passed
@dd-mergequeue dd-mergequeue bot deleted the gabedos/readonly-agent-sidecar branch December 19, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
long review PR is complex, plan time to review it qa/rc-required Only for a PR that requires validation on the Release Candidate team/container-platform The Container Platform Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants