[CONTP-338] Agent sidecar securityContext readOnlyRootFilesystem default setup #31257

gabedos · 2024-11-20T01:12:17Z

What does this PR do?

Enables readOnlyRootFilesystem:true for sidecar agent and applies all relevant volumes and initContainers to support this feature. Also allows advanced users to disable this securityContext by overriding the profile by setting readOnlyRootFilesystem:false.

Motivation

Support case ticket about customer want to configure this securityContext. Limiting the agent's scope is also best practice.

Describe how to test/QA your changes

A majority of the updates to the profile processing and ensuring the sidecar container is properly configured is covered by unit tests. However, we can manually verify this is working in a local minikube cluster with the sidecar agent injection.

Load the following cluster agent configuration values.yaml

datadog:
  kubelet:
    tlsVerify: false
  clusterName: <INSERT_CLUSTER_NAME>
agents:
  enabled: false
clusterAgent:
  image:
    repository: <INSERT_REPO>
    tag: <INSERT_TAG>
  enabled: true
  replicas: 1
  admissionController:
    agentSidecarInjection:
      enabled: true
      provider: fargate

Setup the fake fargate configuration to allow the agent to be deployed

kubectl create namespace fargate
kubectl create secret generic datadog-secret -n datadog-agent-helm \
        --from-literal api-key=$DD_API_KEY --from-literal token=random32characterstringfortoken1
kubectl create secret generic datadog-secret -n fargate \
        --from-literal api-key=$DD_API_KEY --from-literal token=random32characterstringfortoken1

Deploy a workload that will receive an agent sidecar k apply -f deploy.yaml -n fargate

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
        agent.datadoghq.com/sidecar: fargate
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80

Confirm agent is loaded and properly running

k exec -it nginx-deployment-xxxxx -c datadog-agent-injected -n fargate -- agent status

Collector
============
    Running Checks
    ============
    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 32
      Metric Samples: Last Run: 9, Total: 281
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2024-11-20 19:29:47 UTC (1732130987000)
      Last Successful Execution Date : 2024-11-20 19:29:47 UTC (1732130987000)

Confirm configuration files properly placed in etc/datadog

k exec -it nginx-deployment-xxxxx -c datadog-agent-injected -n fargate -- /bin/bash

root@nginx-deployment-xxxxx:/# ls etc/datadog-agent/

auth_token    conf.d                datadog-docker.yaml      datadog.yaml          install_info                 selinux
checks.d      datadog-ci.yaml       datadog-ecs.yaml         datadog.yaml.example  runtime-security.d           system-probe.yaml.example
compliance.d  datadog-cluster.yaml  datadog-kubernetes.yaml  install.json          security-agent.yaml.example

Possible Drawbacks / Trade-offs

An additional init container will run for each time the sidecar agent spins up. However, it uses the same agent image so there aren't concerns about time to load the image. This initContainer runs for ~0 sec.

Additional Notes

Note: you might need to restart your cluster-agent and make sure it's using the same token that you're providing to the node agent. Look into the pod manifest and search for secretKeyRef to see where they are being pulled from.

agent-platform-auto-pr · 2024-11-20T01:39:47Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51334945 --os-family=ubuntu

Note: This applies to commit 6bb8efe

jhgilbert

Approved with minor suggestions, thanks!

releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml

cit-pr-commenter · 2024-11-20T21:16:14Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 52154bd4-0ec3-4202-91a9-6f8d5f06b0f8

Baseline: 5d81b5d
Comparison: 6bb8efe
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	otel_to_otel_logs	ingress throughput	+1.09	[+0.38, +1.81]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.98	[-1.97, +3.92]	1	Logs
➖	file_tree	memory utilization	+0.94	[+0.80, +1.08]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.33	[-0.44, +1.10]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.27	[-0.20, +0.73]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.21	[-0.51, +0.92]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.13	[-0.78, +1.05]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.11	[-0.66, +0.88]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.07	[-0.81, +0.94]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.06	[-0.58, +0.69]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.05	[-0.71, +0.81]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	+0.04	[-0.79, +0.87]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.10, +0.10]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.07	[-0.14, -0.01]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.31	[-0.44, -0.18]	1	Logs bounds checks dashboard
➖	quality_gate_idle	memory utilization	-0.68	[-0.73, -0.63]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go

pkg/clusteragent/admission/mutate/agent_sidecar/profiles.go

adel121 · 2024-11-21T06:38:41Z

pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go

+		return true
+	}
+	securityContext := (*w.profileOverrides)[0].SecurityContext
+	return securityContext == nil || securityContext.ReadOnlyRootFilesystem == nil || *securityContext.ReadOnlyRootFilesystem


I'm not sure I'm following here.

If securityContext == nil, why would we return true? If we have no security context then this is not a ReadOnlyFilteSystem by default

Same if securityContext.ReadOnlyFilesystem == nil

If the user supplies a value in profileOverride.securityContext.ReadOnlyFilesystem, then we will use that value. Otherwise we will default to true. We need to check whether first profileOverrides exists and only has 1 entry. Then we must check that securityContext is defined and additionally if readOnlyRootFilesystem was set. When none of those objects exist, we will default to true.

I can more explicitly state that we want the default value to be true in the function's comment description.

pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go

adel121 · 2024-11-21T06:54:49Z

pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go

+			w.addDefaultSidecarSecurity(agentSidecarContainer)
+			pod.Spec.Volumes = append(pod.Spec.Volumes, *w.getDefaultSidecarVolumeTemplate())
+			// Don't want to apply any overrides to the agent sidecar init container
+			defer func() {


Question: why do we need a defer function here?

If we add the initContainer to the structure here. Then the applyProviderOverrides will affect the init container's definition. This specifically happens at applyProviderOverrides in the fargate case when applyFargateOverrides calls common.InjectVolume(pod, volume, volumeMount) and it adds a VolumeMount to that initContainers here. This initContainer does not need this VolumeMount for APM and DogStatsD sockets.

pkg/clusteragent/admission/mutate/agent_sidecar/agent_sidecar.go

adel121

Thanks @gabedos

I already left some comments.

But after reading all the PR, I think I understand that after this PR we will be setting the securityContext to ReadRootFileSystem by default.

I thought the goal was just to allow the user to set the security context to ReadRootFileSystem: true when needed.

Any idea why we are setting this by default now?

gabedos · 2024-11-21T16:23:03Z

Thanks @gabedos

I already left some comments.

But after reading all the PR, I think I understand that after this PR we will be setting the securityContext to ReadRootFileSystem by default.

I thought the goal was just to allow the user to set the security context to ReadRootFileSystem: true when needed.

Any idea why we are setting this by default now?

Hi @adel121! Thanks for taking a look at my PR. The main goal of this PR was to enable this security feature by default on the sidecar agent. We want to limit the scope of the sidecar agent access to the root filesystem because it is best practice to minimize permissions. However, we still want to provide advanced users with the ability to modify the securityContext of the sidecar agent so this is why there is also lots of changes to the ProfileOverride struct and its parsing.

adel121

LGTM

agent-platform-auto-pr · 2024-12-10T13:19:36Z

Package size comparison

Comparison with ancestor a7ba9110c9023ca31f2c5e913a7ff98ab31a9a5b

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	-1.41MB	✅	1270.66MB	1272.08MB	140.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.20MB	113.20MB	10.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.32MB	78.32MB	10.00MB
datadog-heroku-agent-amd64-deb	-1.40MB	✅	526.45MB	527.85MB	70.00MB
datadog-agent-x86_64-rpm	-1.41MB	✅	1279.90MB	1281.31MB	140.00MB
datadog-agent-x86_64-suse	-1.41MB	✅	1279.90MB	1281.31MB	140.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.26MB	113.26MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.26MB	113.26MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.40MB	78.40MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.40MB	78.40MB	10.00MB
datadog-agent-arm64-deb	-0.01MB	✅	1004.84MB	1004.85MB	140.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.67MB	108.67MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.59MB	55.59MB	10.00MB
datadog-agent-aarch64-rpm	-0.00MB	✅	1014.06MB	1014.06MB	140.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	108.74MB	108.74MB	10.00MB

Decision

✅ Passed

gabedos · 2024-12-11T13:38:51Z

/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on

dd-devflow · 2024-12-11T13:39:36Z

Devflow running: `/trigger-ci --variable RUN_ALL_BUILDS=true --varia...`

View all feedbacks in Devflow UI.

2024-12-11 13:39:36 UTC ℹ️ Gitlab pipeline started

Started pipeline #50831448

agent-platform-auto-pr · 2024-12-16T21:46:00Z

[Fast Unit Tests Report]

On pipeline 51334945 (CI Visibility). The following jobs did not run any unit tests:

Jobs:

tests_flavor_dogstatsd_deb-x64
tests_flavor_heroku_deb-x64
tests_flavor_iot_deb-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

agent-platform-auto-pr · 2024-12-16T22:11:22Z

Uncompressed package size comparison

Comparison with ancestor 5d81b5d4377ca3be9dac2134700683eedb516337

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-x86_64-rpm	0.00MB	⚠️	1197.01MB	1197.01MB	140.00MB
datadog-agent-x86_64-suse	0.00MB	⚠️	1197.01MB	1197.01MB	140.00MB
datadog-agent-aarch64-rpm	0.00MB	⚠️	943.00MB	943.00MB	140.00MB
datadog-agent-amd64-deb	0.00MB	⚠️	1187.77MB	1187.77MB	140.00MB
datadog-agent-arm64-deb	0.00MB	⚠️	933.78MB	933.78MB	140.00MB
datadog-heroku-agent-amd64-deb	0.00MB	⚠️	505.05MB	505.05MB	70.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.58MB	78.58MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.65MB	78.65MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.65MB	78.65MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.78MB	55.78MB	10.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.31MB	113.31MB	10.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.38MB	113.38MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.38MB	113.38MB	10.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.78MB	108.78MB	10.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	108.84MB	108.84MB	10.00MB

Decision

⚠️ Warning

gabedos · 2024-12-19T14:20:25Z

/merge

dd-devflow · 2024-12-19T14:20:39Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-19 14:20:39 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 27m.

2024-12-19 14:56:57 UTC ℹ️ MergeQueue: This merge request was merged

Implementing default securityContext readOnlyFilesystem on sidecar

f75a640

github-actions bot added medium review PR review might take time team/container-platform The Container Platform Team labels Nov 20, 2024

gabedos force-pushed the gabedos/readonly-agent-sidecar branch from d4c6af0 to f75a640 Compare November 20, 2024 01:13

gabedos changed the title ~~Agent sidecar securityContext readOnlyRootFilesystem default setup~~ [CONTP-388] Agent sidecar securityContext readOnlyRootFilesystem default setup Nov 20, 2024

gabedos changed the title ~~[CONTP-388] Agent sidecar securityContext readOnlyRootFilesystem default setup~~ [CONTP-338] Agent sidecar securityContext readOnlyRootFilesystem default setup Nov 20, 2024

jhgilbert approved these changes Nov 20, 2024

View reviewed changes

releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml Show resolved Hide resolved

releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml Outdated Show resolved Hide resolved

releasenotes/notes/agent-sidecar-security-cbfd5ea9f72124d0.yaml Outdated Show resolved Hide resolved

Release notes

48a09b8

gabedos force-pushed the gabedos/readonly-agent-sidecar branch from e6dc815 to 48a09b8 Compare November 20, 2024 20:17

Merge branch 'main' into gabedos/readonly-agent-sidecar

68d09e5