Conflict between EKS Best Practices and latest AL2023 AMI. #2122
Description
While upgrading one of our EKS clusters to 1.31 we experienced an issue with new 1.31 nodes not joining the cluster.
In the logs we saw the following for containerd:
Jan 23 03:55:48 ip-10-0-130-143.ec2.internal systemd[1]: Started containerd.service - containerd container runtime.
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.855808769Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:efs-csi-node-5fdcs,Uid:dc3211ea-b057-4c13-8d05-598cabb16988,Namespace:kube-system,Attempt:0,}"
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.859534903Z" level=info msg="trying next host" error="failed to do request: Head \"https://localhost/v2/kubernetes/pause/manifests/latest\": dial tcp 127.0.0.1:443: connect: connection refused" host=localhost
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.861665120Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:efs-csi-node-5fdcs,Uid:dc3211ea-b057-4c13-8d05-598cabb16988,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"localhost/kubernetes/pause\": failed to pull image \"l>
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.861694974Z" level=info msg="stop pulling image localhost/kubernetes/pause:latest: active requests=0, bytes read=0"
Checking the logs of one of the running 1.30 nodes we found the following:
Jan 22 04:25:28 ip-10-0-128-200.ec2.internal systemd[1]: Started containerd.service - containerd container runtime.
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.245168728Z" level=info msg="PullImage \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\""
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.759157615Z" level=info msg="ImageCreate event name:\"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"} labels:{key:\"io.cri-containerd.pinned\" value:\"pinned\"}"
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.761025880Z" level=info msg="stop pulling image 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: active requests=0, bytes read=298689"
We were not sure why the 1.31 node was trying to pull the pause image from localhost as opposed to ECR like in the 1.30 node.
So, we checked each nodes /etc/containerd/config.toml
1.31 config:
cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
[grpc]
address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "localhost/kubernetes/pause"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/sbin/runc"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
1.30 config:
cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
[grpc]
address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
So, the al2023-1.31 AMI was now configuring the pause container to pull locally.
Searching through the awslabs/amazon-eks-ami GitHub I found this PR #2000
The pause container is now being cached during the AMI image build. We now knew what the problem was.
We created our clusters using Terraform EKS-Blueprint. We also tried to follow some of the EKS Best Practices particularly this one Use multiple EBS volumes for containers
This advises the use of a second volume with /var/lib/containerd mounted. The following script is run as a preBootstrapCommand:
"systemctl stop containerd"
"mkfs -t ext4 /dev/nvme1n1"
"rm -rf /var/lib/containerd/*"
"mount /dev/nvme1n1 /var/lib/containerd/"
"systemctl start containerd"
One of the steps is to remove all directories under /var/lib/containerd prior to mounting to the volume. With the pause container now being cached in the AMI image its likely that cache was being deleted during by this bootstrap command.
To test this, we removed the second volume and the preBootstrapCommand from our terraform node group template. We ran the upgrade again and all the new nodes started and joined the cluster as expected.
At this point we are not sure if we need the second volume or not as our applications do not write often if ever to disk so disk quotas should not be and issues. We're doing some testing now to check the disk I/O on our nodes.
However, if we did need to again utilize a second volume for containerd and given that the pause container is now cached and pulled locally what would be a work-around in this scenario?
An update to the EKS Best Practices document may be in order as well.
Environment:
- AWS Region: us-east-1
- Instance Type(s): r6i.4xlarge
- Cluster Kubernetes version: 1.31
- Node Kubernetes version: 1.31
- AMI Version: amazon-eks-node-al2023-x86_64-standard-1.31-v20250116