vmstorage pod restart causing cpu spike on sequentially higher vmstorage pod #8103
Description
Is your question request related to a specific component?
vmstorage, vminsert
Describe the question in detail
Hi,
We are running across an interesting issue in our VM cluster which I want to see if has been seen before, the issue is as follows:
We have ~30 vminsert and 54 vmstorage pods in a kubernetes cluster. Since we enabled a replicationFactor of 2 to make reads more resilient to vmstorage pod restarts we have notice a very odd behavior. When a vmstorage pod restarts, say vmstorage-10, as that pod is restarting always the vmstorage pod a number higher (vmstorage-11) will then see a massive cpu spike (we don't set a cpu pod limit so can spike from 3 cores to 30 cores) which will cause storage connection saturation for that pod spiking (vmstorage-11) to go over 1s, which then causes every vminsert pod to report a maxing out of their maxConcurrentInsert capacity which then can cause a drop in ingestion rate. The "spike" can last for ~5 mins.
We have replicated this many many times, and when a vmstorage pod restarts it is always the vmstorage pod a number higher that experiences the cpu / connection saturation spike. We even saw this when we restarted vmstorage-54, vmstorage-0 ran into the behavior.
Has this ever been reported or does this behavior sound familiar? When looking at the rerouting metrics it looks to evenly reroute rows from the actually restarting pod to every other vmstorage pod evenly, so it doesn't make sense to us that just the pod a number higher is having to do so much cpu processing.
Also, we have collected cpu profile and traces of before and after of the vmstorage pod experiencing the spike, but just getting approval to share that here, though I know that confidential information is not contained in the cpu profile / trace.
We run a cluster of v1.107.0 and use the official helm charts to deploy it on kubernetes.
Any help on this would be fantastic,
Thanks