-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition for consuming podIP via downward API #13052
Conversation
return "", err | ||
} | ||
if inspectResult == nil { | ||
return "", nil // TODO: should be an error, probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these errors, then the caller can choose to ignore the error. If you can't get a network IP it's better to continue with no status IP. Don't log in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if clause is bothering me; inspectContainer
does it but it seems like it would represent a bug in the docker API to get an (empty container, nil)
result from InspectContainer
. It would be great if there were a comment on the original code with a reference to an issue either in kube or docker, but oh well. I'm strongly tempted to take this clause out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write this code where you can tolerate not having the podIP, but you should
not ignore any errors in this function.
On Fri, Aug 21, 2015 at 4:42 PM, Paul Morie notifications@github.com
wrote:
In pkg/kubelet/dockertools/manager.go
#13052 (comment)
:@@ -276,6 +276,38 @@ type containerStatusResult struct {
err error
}+// determineInfraContainerIP inspects the given container and determines the IP address of that
+// container. The container is assumed to be the pod infra container; it is the responsibility
+// of the caller to ensure that the container ID passed is that of the infra container.
+func (dm *DockerManager) determineInfraContainerIP(dockerID string, pod *api.Pod) (string, error) {
- inspectResult, err := dm.client.InspectContainer(dockerID)
- if err != nil {
return "", err
- }
- if inspectResult == nil {
return "", nil // TODO: should be an error, probably
This if clause is bothering me; inspectContainer does it but it seems
like it would represent a bug in the docker API to get an (empty
container, nil) result from InspectContainer. It would be great if there
were a comment on the original code with a reference to an issue either in
kube or docker, but oh well. I'm strongly tempted to take this clause out...—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/13052/files#r37676100.
Clayton Coleman | Lead Engineer, OpenShift
GCE e2e build/test passed for commit 82ba717ce35deb4db96d35c454342e8b2d18a632. |
if dm.networkPlugin.Name() != network.DefaultPluginName { | ||
netStatus, err := dm.networkPlugin.Status(pod.Namespace, pod.Name, kubeletTypes.DockerID(dockerID)) | ||
if err != nil { | ||
glog.Errorf("NetworkPlugin %s failed on the status hook for pod '%s' - %v", dm.networkPlugin.Name(), pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering to myself if this should be an error in this and inspectContainer
, which I cribbed this code from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, spawn an issue
On Fri, Aug 21, 2015 at 4:35 PM, Paul Morie notifications@github.com
wrote:
In pkg/kubelet/dockertools/manager.go
#13052 (comment)
:
- }
- if inspectResult == nil {
return "", nil // TODO: should be an error, probably
- }
- result := ""
- if inspectResult.State.Running {
if inspectResult.NetworkSettings != nil {
result = inspectResult.NetworkSettings.IPAddress
}
if dm.networkPlugin.Name() != network.DefaultPluginName {
netStatus, err := dm.networkPlugin.Status(pod.Namespace, pod.Name, kubeletTypes.DockerID(dockerID))
if err != nil {
glog.Errorf("NetworkPlugin %s failed on the status hook for pod '%s' - %v", dm.networkPlugin.Name(), pod.Name, err)
I'm wondering to myself if this should be an error in this and
inspectContainer, which I cribbed this code from.—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/13052/files#r37675321.
Clayton Coleman | Lead Engineer, OpenShift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawned #13055
1f1c4da
to
2e92ea0
Compare
GCE e2e build/test failed for commit 2e92ea03d6048068c7207a01e2310970e9e9681d. |
GCE e2e build/test failed for commit 1f1c4da84887f8e2217824cfead1de0113797399. |
0b111a6
to
ab858a9
Compare
GCE e2e build/test failed for commit 0b111a621de9022b40e1297b223bb9f653889c7c. |
GCE e2e build/test failed for commit ab858a997bd7ac774e888845d969f6e78d1bb610. |
@@ -285,16 +285,53 @@ type containerStatusResult struct { | |||
err error | |||
} | |||
|
|||
const podIPDownwardAPISelector = "status.podIP" | |||
|
|||
func podDependsOnPodIP(pod *api.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc
One nit, add this to the downward_api e2e test (please), and the existing e2e failures look like flakes caused by over scheduling. |
@smarterclayton Yep, I have been working on adding it to the e2e; an additional refactor was required for that; gonna split out another PR since it touches many files. |
GCE e2e build/test failed for commit 6b5321d8a95e32788d81df4b1582b5868fb3804a. |
Interesting, this appears to be a legit failure on GCE. Going to try to reproduce. |
GCE e2e build/test failed for commit b63215e3ae7f202707a95d29d28fa49725d1563d. |
I couldn't recreate the failure on GCE. Kicking the test runner -- I will need to reshuffle the commits in this branch before this goes in if the GCE run finishes successfully. |
GCE e2e build/test failed for commit fe96aa0d7b3bcb6d7f2842b09e88b8b1fb963888. |
GCE e2e build/test failed for commit 4ff66bd. |
@pmorie Thanks for heading up, for rkt side, as we haven't done much with the network plugin yet, will revisit later. |
@smarterclayton squashed |
Thanks @yifan-gu |
LGTM, e2e is flakes |
Fix race condition for consuming podIP via downward API
@mwielgus I would have been fine waiting for the bot -- I just didn't know how long to expect to have to wait since LGTM was added yesterday |
@pmorie @smarterclayton - sorry for jumping late to the party, but it seems that this PR completely broke pod startup time latency (our scalability tests are just logging it, so I didn't catch it before - I will extend tests today to check it). Before the 99th percentile of pod statup time was less than 5 seconds, after this PR it usually exceeds 30 seconds. |
What was the 90 and 95 percentiles? On Tue, Sep 8, 2015 at 3:31 AM, Wojciech Tyczynski <notifications@github.com
Clayton Coleman | Lead Engineer, OpenShift |
I didn't look for 95th percentile, but there was also significant difference (5-10x) in 50th and 90th percentile. |
So it wasn't only the long tail - it was just significantly worse at all percentiles. |
Ok, sounds like then we probably need to reuse the data we get from Docker On Tue, Sep 8, 2015 at 10:00 AM, Wojciech Tyczynski <
Clayton Coleman | Lead Engineer, OpenShift |
We are running scalability tests on GCE (without any additional plugins). |
@smarterclayton @wojtek-t It's extremely curious that this would have affected performance that much. This pad doesn't add any new inspect calls, @smarterclayton; it already uses the result of the docker inspect performed when we started the infra container. |
All I can imagine is that the network plugin is being called, or some error On Tue, Sep 8, 2015 at 2:11 PM, Paul Morie notifications@github.com wrote:
Clayton Coleman | Lead Engineer, OpenShift |
This reverts commit b85d055.
Fixes the race that can make status.podIP be blank when consumed via the downward API.
@thockin @smarterclayton PTAL