-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU Utilization Openshift process (Origin v1.4) #14428
Comments
/cc @smarterclayton |
Please file a bug so we can track it if you are using OCP. Many fixes are in OCP 3.4 that are not in Origin 1.4. You'd probably need to grab a profile from the node process to identify the issue. |
I added the following in
I do get: probably because the origin-node is running in a container, hence can't find the openshift binaries locally. |
Think I had to run pprof from within the origin node container. Please download the profile from the following URL: |
Thanks, will take a look |
@childsb @rootfs did we fix a CPU issue in 1.4/3.4 around cleanup of orphan pod dirs where we processed way too many? Basically we're spending all of our time in syscalls to read the volumes directory. @olsib can you count how many mounts, files, and symlinks you have under your openshift volumes directory? An abnormal number, or a recursive set of links could cause this. |
Below are recursive counts for what's under /var/lib/origin/openshift.local.volumes/pods dir:
Also 79 active mounts in total (returned by mount command) |
@olsib can you upload the node log? |
@rootfs please download below: |
@olsib this looks similar to kubernetes/kubernetes#28750 and kubernetes/kubernetes#45935 # grep "Error: UnmountVolume.TearDown failed for volume" messages |grep "device or resource busy" |wc -l
164854
# grep "Error: UnmountVolume.TearDown failed for volume" messages |grep "device or resource busy" |grep deleting~ |wc -l
159737 |
@rootfs Was this supposed to be fixed in 1.4? We're getting those errors on another cluster as well which is running on same version (1.4). |
@olsib "Please file a bug so we can track it if you are using OCP. Many fixes are in OCP 3.4 that are not in Origin 1.4." |
@rootfs we are using origin, does that mean that we do not get a fix for this? |
I'm not sure if we have an authoritative stance on this, but at this point Origin 1.4 updates will be limited to massive functionality issues and/or serious security issues. |
@stevekuznetsov is this particular issue fixed in any version of Origin? |
@childsb would know what changes introduced the fix, from which we can determine which releases of Origin contain it. |
The fix is here #14549 |
Thanks @childsb. @stevekuznetsov so it seems we will have to wait for v1.6 of origin in best scenario? Did you replicate this in OCP 3.4 as well? |
Yes, 1.6 will be the first Origin release with this fix, |
Is the fix for this pulled in origin v3.6.0? |
Looks like yes. |
Abnormal Openshift Node process utilization.
We're experiencing an unusual CPU utilization in few of our Openshift nodes, sometimes eating more then half of the node CPU (please refer to the stats provided below for more details).
There is no clear evidence as to where does the load come from within the origin node container. PODs are spread equally across nodes, so we would expect more or less same utilization across all nodes.
Our non-prod environments are running Openshift Origin 1.4.
Prod environment is running on OCP v3.4. We need to make sure this bug is not replicated in OCP before we do any deployments there, so your quick assistance will be appreciated.
Version
oc v1.4.0+208f053
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO
Steps To Reproduce
Current Result
Container name:
openshift/node:v1.4.0
Docker stats from Node experiencing issue:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
edfa59a0981a 226.59% 661.9 MiB / 27.79 GiB 2.33% 0 B / 0 B 487.6 MB / 707.9 MB 31
Expected Result
Docker stats from a healthy node:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
9c5325114fe1 10.04% 231.4 MiB / 27.79 GiB 0.81% 0 B / 0 B 261.5 MB / 11.51 GB 36
Additional Information
Deployment type: Containerized
Docker version: 1.12.5
OS: Redhat 7.3 (kernel 3.10.0-514.6.1.el7.x86_64)
Cluster Info: 3 Master + 12 Nodes
Node Resources: 4 CPU/30.5 GB of RAM
Average Pods/Node: ~50
Faulty node docker info:
docker info
Server Version: 1.12.5
Storage Driver: devicemapper
Pool Name: docker--vg-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 25.48 GB
Data Space Total: 85.46 GB
Data Space Available: 59.98 GB
Metadata Space Used: 11.42 MB
Metadata Space Total: 218.1 MB
Metadata Space Available: 206.7 MB
Thin Pool Minimum Free Space: 8.546 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: host bridge null overlay
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.16.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 27.79 GiB
ID: W6NH:D2OP:JCBO:SLNW:Q4FX:FUOJ:2JIS:L5SU:IBNY:RZI2:ORCH:6KR2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Insecure Registries:
172.30.0.0/16
127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)
Healthy node docker info:
Server Version: 1.12.5
Storage Driver: devicemapper
Pool Name: docker--vg-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 41.87 GB
Data Space Total: 214.5 GB
Data Space Available: 172.6 GB
Metadata Space Used: 19.18 MB
Metadata Space Total: 109.1 MB
Metadata Space Available: 89.87 MB
Thin Pool Minimum Free Space: 21.45 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge null host overlay
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 27.79 GiB
ID: AJCW:VLRA:Z3PB:VKGN:RMJ3:DKD6:32EN:NDRC:EEIW:4KII:DK6H:STPG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Insecure Registries:
172.30.0.0/16
127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)
The text was updated successfully, but these errors were encountered: