Dockerd memory leak because the metadata database is corrupted #49074
Description
Description
It might be due to a power outage or the dockerd crashing suddenly. I ended up with a corrupted metadata.db, and when I tried to start dockerd again, it wouldn't come up and there were no error messages. Upon analyzing the logs, I found that it was stuck in the restore container phase. So, I used the 'top' command to check the status of dockerd and noticed something unusual! The RES of dockerd was increasing at a rate of over 2GB per minute and kept growing without stopping, until it triggered an OOM (Out of Memory) situation!
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.0 us, 6.0 sy, 0.0 ni, 85.3 id, 0.5 wa, 0.8 hi, 0.0 si, 0.4 st, 77.0 id_exact, 0.8 hi_e 0.8 irq_exact
MiB Mem : 11825.5 total, 8447.6 free, 2682.0 used, 696.0 buff/cache
MiB Swap: 6060.0 total, 6060.0 free, 0.0 used. 8856.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2914 root 20 0 5661616 2.4g 44128 R 169.9 20.9 1:08.17 dockerd
I used pprof to capture the heap status.
File: dockerd
Build ID: 5e83b85afd6620952da02350b526d90b1922d1e9
Type: inuse_space
Time: Dec 12, 2024 at 2:37pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2.07GB, 99.88% of 2.07GB total
Dropped 19 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 19
flat flat% sum% cum cum%
2.07GB 99.88% 99.88% 2.07GB 99.88% github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).first
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/daemon.(*Daemon).configureVolumes
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/daemon.NewDaemon
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Bucket).ForEach
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).Next
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).next
0 0% 99.88% 2.07GB 99.88% github.com/docker/docker/vendor/github.com/boltdb/bolt.(*DB).View
0 0% 99.88% 2.07GB 99.91% github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute
0 0% 99.88% 2.07GB 99.91% github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC
0 0% 99.88% 2.07GB 99.91% github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute
I found that the internal pressure was continuously increasing while reading the volumes database.
After debugging, I discovered that when reading the database's K-Value, the generated pgid
created a non-existent leafPage
and set the flags
to 16 :
moby/vendor/go.etcd.io/bbolt/cursor.go
Lines 175 to 182 in 7faa4ec
branchPageFlag
: 0x01,
leafPage` : 0x02 .moby/vendor/go.etcd.io/bbolt/page.go
Lines 17 to 22 in 7faa4ec
The Cursor.stack kept growing, reaching over 1,600 entries, and did not stop.
Reproduce
After replacing the metadata.db, directly started dockerd.
error metadata.db:
metadata.tar.gz
Expected behavior
In the listMeta
function, when the v value is empty, I execute a b.Bucket(k) check to verify if the k value is valid. If it is invalid, it will trigger a panic, preventing infinite memory growth.
Lines 77 to 84 in 7faa4ec
I'm currently wondering whether the panic needs to be reported.
b.ForEach(func(k, v []byte) error {
if len(v) == 0 {
defer func() {
if v := recover(); v != nil {
logrus.Errorf("Error while reading volume metadata Bucket is nil:%v, ", k)
logrus.Errorf("List stack:\n %s \n", string(debug.Stack()))
fmt.Println("Error while reading volume metadata Bucket is nil")
os.Exit(1)
}
}()
if b.Bucket(k) == nil {
logrus.Debugf("Error while reading volume metadata Bucket is nil:%v", k)
}
// don't try to unmarshal an empty value
return nil
}
Additionally, I have reported the issue to the bbolt
and suggested a handling method for invalid flags: 16.
docker version
[root@localhost bin]# docker version
Client:
Version: 27.4.0
API version: 1.47
Go version: go1.22.10
Git commit: bde2b89
Built: Sat Dec 7 10:37:44 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.4.0
API version: 1.47 (minimum version 1.24)
Go version: go1.22.10
Git commit: 92a8393
Built: Sat Dec 7 10:39:05 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.6
GitCommit: d65fef44f398e66acb44a2f42b9a665f8be5d2ce
runc:
Version: 1.2.2
GitCommit: v1.2.2-0-g7cb3632
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Version: 27.4.0
Context: default
Debug Mode: false
Server:
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 4
Server Version: 27.4.0
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: false
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d65fef44f398e66acb44a2f42b9a665f8be5d2ce
runc version: v1.2.2-0-g7cb3632
init version: de40ad0
Security Options:
seccomp
Profile: builtin
Kernel Version: 5.10.134-14.zncgsl6.x86_64
Operating System: Carrier Grade Server Linux 6.06
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 11.55GiB
Name: localhost.localdomain
ID: 45f1d3fd-7ca7-430e-b7c5-0625663afd1c
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Additional Info
No response
Activity