-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubectl stop rc <name> is flaky #4528
Comments
@davidopp, that is an issue in etcd client library. |
Yeah, Tim had made a similar comment. What is the best way to get more details about the error? |
etcd has these flags: Dawn said to enable etcd logging change the salt config that starts it up. |
To debug such issue last time, I had to modify etcd client library in Godeps/_workspace/src/github.com/coreos/go-etcd/etcd/ with more logging too. |
Thanks. And on the server side it looks like we want to add -vv to the command line in one or both of these files: cluster/saltbase/salt/etcd/{default,initd} |
This is totally the wrong bug to be discussing this, but I'm not sure we should create another bug. Anyway, one thing that seems to be going on is that (at least in my cluster) etcd seems to keep crashing and getting restarted by monit. /var/log/monit.log looks like this: [UTC Feb 19 05:17:31] info : monit: generated unique Monit id 0c489c0ae08f15498e0e67573ba7e33a and stored to '/var/lib/monit/id' /var/log/etcd.log also shows it restarting at the corresponding times. This is surprising, though, since it would seem things would be more broken if etcd were really restarting frequently. BTW apparently the -vv flag is no longer supported, so I'm going to retry with -v. |
It may be possible that it's not causing more visible problems because once every 2 minutes is not very often. My interpretation of the fact that I see this kind of errors only when running things in parallel is that in normal circumstances there's pretty low probability of having a API call in flight when etcd restarts, but when there's 50 of them it gets probable enough (especially because on single threaded master '''stop''' runs longer than 2 minutes). |
Davids point was a good one. AFAICT PR #4587 fixes this problem, and thus it's possible that it'll help with other flakiness as well. |
In addition to fixing etcd problems, we should make the client more robust, by retrying. |
I'm closing this, given the retries for stop, and better monitoring of etcd. |
When playing with get pods performance under pressure I found that stopping replication controller don't work reliably. Workflow:
create 50 pods, each with different rc:
works reliably, but reverse
doesn't. I get errors of type
which leaves cluster in a state when number of replicas under this rc is set to 0 or not and rc itself is not deleted:
It seems that there are two kind of errors: not deleting a pod before stopping RC (which I guess should happen), and failing to delete RC itself. Doing the same loop sequentially seems to work, which suggests a race condition.
The text was updated successfully, but these errors were encountered: