-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon Restart: attempt to wait for container deps #18208
Conversation
// Best effort to wait for container deps to be up | ||
switch e := err.(type) { | ||
case errcode.Error: | ||
if e.ErrorCode() == derr.ErrorCodeLinkNotRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also do e.ErrorCode() != derr.ErrorCodeLinkNotRunning { break }
and remove the conditional a few lines below 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we have a function to check those error codes without adding switch statements everywhere, but I cannot find it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duglin ?
Also, I have the switch because I was also trying to fix another case where overlay networks aren't ready -- but there appears to be a deeper issue in libnetwork that needs to get fixed first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at one point I did have a util func to avoid making everyone write switch statements but it never made the cut.
c3f52e1
to
fa70c7f
Compare
} | ||
err = nil | ||
} | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessary anymore 😸
nevermind about the function, I might have dreamed about it |
fa70c7f
to
852604b
Compare
@calavera Updated |
852604b
to
833b149
Compare
// This file contains all of the errors that can be generated from the | ||
// docker engine but are not tied to any specific top-level component. | ||
|
||
const errGroup = "engine" | ||
|
||
// IsErr returns a bool indicating if the passed in error matches the expected error | ||
func IsErr(err error, expected errcode.ErrorCode) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a really bad func name if you think about it ;)
maybe something more descriptive but seeing that this is not used in this PR i would remove this file from the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in this PR
833b149
to
ce39556
Compare
@crosbymichael updated the fn name. |
Edited: I was thinking about an alternative approach: We put the containers that fail to start in a queue instead of retrying. When all the other containers have finished, we try to restart the failed ones again sequentially. I don't know if it's better, what do you think? |
you can easily built a little dependency graph to fix the start order instead of thrashing the containers trying to start them. |
ce39556
to
593a44f
Compare
PTAL |
} | ||
|
||
for _, c := range children { | ||
if _, err := c.waitRunning(1 * time.Second); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just picked a duration here.
Going for something simple, but I think this might be too simple. |
b562af0
to
f572850
Compare
Ok, PTAL. |
This is closer but there is too much concurrency where it is not needed |
f572850
to
a310450
Compare
Ah, good point, updated to make waiting on children happen in one goroutine per container instead of for each child. |
@cpuguy83 needs rebase |
0f9315f
to
bd0c54b
Compare
rebased |
@@ -353,17 +353,43 @@ func (daemon *Daemon) restore() error { | |||
// The container register failed should not be started. | |||
return | |||
} | |||
}(c.container, c.registered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this "upper" go routine need to exist still now that actual container start operations are in the "lower" go routine (added below) with the range restartContainers {
loop?
I tried to dig through daemon.Register
and daemon.generateNewName
--which seem to be the only external calls in the go routine now--to see if there is any reason for this to be necessarily handled per-container in a concurrent manner and nothing jumps out at me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, you are probably right...
Maybe restarting the daemon after unclean shutdown would be a bit slower (making a few syscalls to cleanup)...
bd0c54b
to
2f69ae0
Compare
@estesp Updated with your suggestion. |
This looks a lot better to me without two loops of go routines.. not sure if @crosbymichael wanted to take another look given the comment re: "too much concurrency"? |
ping @calavera @crosbymichael if you can have another look? |
This provides a best effort on daemon restarts to restart containers which have linked containers that are not up yet instead of failing. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2f69ae0
to
19762da
Compare
Rebased |
LGTM |
// running before we try to start the container | ||
children, err := daemon.children(container.Name) | ||
if err != nil { | ||
logrus.Warnf("error getting children for %s: %v", container.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have a return here or you will probably get a panic from a nil container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we'll hit a panic here as we aren't doing anything with the values except checking if there if we have a start notifier for it later, which can be nil and be ok, and if children
is nil, it's ok because we are just ranging over it anyway.
I'd also prefer to not return here since that would make the container never start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see that now. thanks
LGTM |
Daemon Restart: attempt to wait for container deps
This provides a best effort on daemon restarts to restart containers
which have linked containers that are not up yet instead of failing.