Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a data race in kubelet image manager #88915

Merged

Conversation

roycaihw
Copy link
Member

@roycaihw roycaihw commented Mar 6, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
Let image manager return a copy of image list. Today it sorts and returns the internal image list cache to caller directly, which introduces a data race. Due to the sort being in-place, we've observed the race causing static pod workers to panic during syncPod. However, given the way kubelet recovers from goroutine panics, kubelet would keep running and the pod worker silently stopped working-- all future work dispatched for this pod got ignored.

git log shows the race was introduced in 1.10. We should cherrypick this fix to all the supported minor releases.

Sample panic log from 1.13:

runtime.go:73] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x...
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0x...
panic(0x..., 0x...)
	/usr/local/go/src/runtime/panic.go:522 +0x...
k8s.io/kubernetes/pkg/kubelet/nodestatus.Images.func1(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/nodestatus/setters.go:413 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).setNodeStatus(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:480 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).initialNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:348 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).GetNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_getters.go:220 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncPod(0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet.go:1666 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop.func1(0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:174 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:183 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod.func1(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:221 +0x...
created by k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:219 +0x...

Which issue(s) this PR fixes:

Fixes #60443

Special notes for your reviewer:
I'm going to add a proposal to SIG node about catching/mitigating this kind of programming error in future. In this case the issue would have been mitigated if we either 1) actually crashed kubelet and let it restart or 2) created a new goroutine when the old one panicked. Restarting kubelet is preferred IMO because it guards a wide range of programming error, and it's usually safe to restart kubelet as long as it's not too frequent.

Does this PR introduce a user-facing change?:

Fixed a data race in kubelet image manager that can cause static pod workers to silently stop working.

/cc @Random-Liu @cheftako
/priority critical-urgent
/sig node

@k8s-ci-robot k8s-ci-robot requested a review from cheftako March 6, 2020 18:30
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 6, 2020
@k8s-ci-robot k8s-ci-robot requested a review from Random-Liu March 6, 2020 18:30
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet labels Mar 6, 2020
@roycaihw
Copy link
Member Author

roycaihw commented Mar 6, 2020

I'm not sure what additional approval we need given it's code freeze. I will prepare the cherrypicks once it's merged in master

@@ -548,14 +550,19 @@ func TestValidateImageGCPolicy(t *testing.T) {
}
}

//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I was going to add a test description but decided to not at some point. I updated the test comments and the variable name to expect to make things more clear.

// Make a copy of testList because cache.set takes ownership of the list
testListCopy := append([]container.Image{}, testList...)
// Expect cache.get to sort the image list
sort.Sort(sliceutils.ByImageSize(testListCopy))
cache.set(testList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test that testList is not changed after get.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is changed. That's why the test didn't catch the bug before. Please see the comment I added for imageCache.set()

It brings another question: should imageCache.set() make a copy of the given list? E.g.

func (i *imageCache) set(images []container.Image) {
	i.Lock()
	defer i.Unlock()
	imagesCopy := append([]container.Image{}, images...)
	i.images = imagesCopy
}

this way the caller of imageCache.set() can safely reference images afterwards. I think it's a reasonable change but not as critical. Currently this goroutine is the only caller of imageCache.set() and it doesn't reference images afterwards, so the change doesn't matter. Since we are going to cherrypick this fix, I prefer keeping the minimum change needed. That's why I didn't change imageCache.set() behavior here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the sort logic outside of the image cache and tested that testList is not changed after get

@Random-Liu Random-Liu modified the milestones: v1.19, v1.18 Mar 6, 2020
@Random-Liu
Copy link
Member

This is a critical bug fix! Mark 1.18.

@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch from d03a564 to 64cda98 Compare March 6, 2020 20:59
@roycaihw
Copy link
Member Author

roycaihw commented Mar 6, 2020

/test pull-kubernetes-e2e-kind

@roycaihw
Copy link
Member Author

roycaihw commented Mar 6, 2020

/test pull-kubernetes-integration

sort.Sort(sliceutils.ByImageSize(i.images))
return i.images
// Make a copy of image list to avoid data race
images := append([]container.Image{}, i.images...)
Copy link
Member

@liggitt liggitt Mar 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How frequently is this read? Should we sort on set and return the list without sorting/allocating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! It's invoked in different places so it's hard to tell from source. I searched this and this log messages in a master node and found the read is invoked every few (<5) seconds (see the log below). Compared to the write getting invoked every 30s, I agree it makes sense to sort on set.

Regarding whether we should skip allocating on return, I'm not sure if it's safe to do. Currently we truncate the length of the returned list, which has no effect on the original list https://play.golang.org/p/fc6JmArJtfW. But I still feel it's safer if we copy on return, to prevent the original list getting mutated in future.

I0309 21:40:27.927125    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:35.971794    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:39.926950    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:43.582650    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:44.926879    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:45.926867    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:46.104704    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:52.927557    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:53.927034    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:56.239125    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:56.926903    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:06.404693    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:08.926829    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:11.926992    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:14.926864    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:16.541316    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:19.929088    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:20.926684    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:23.926947    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:23.927801    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:26.720531    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:28.926814    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:32.926843    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:32.927004    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:34.926880    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:36.884823    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:37.926909    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:38.926790    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:39.927096    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:43.582911    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:43.927033    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:47.013040    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:54.926907    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:55.927053    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:56.927574    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:57.161981    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:02.927006    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:04.926781    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:05.927033    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:07.335855    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:13.927023    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:17.502770    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:19.927029    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:21.926896    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:27.638850    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:32.926875    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 

@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch from 07d924e to d5f1d52 Compare March 10, 2020 00:32
// The image list needs to be sorted when it gets read and used in
// setNodeStatusImages. We sort the list on write instead of on read,
// because the image cache is more often read than written
sort.Sort(sliceutils.ByImageSize(images))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was almost going to add another copy before this line... Perhaps it's better to put the copy and the sort inside set(), to make the code more contained? @liggitt WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving copy and sort into set makes sense to me

@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch from d5f1d52 to b4cdcd4 Compare March 10, 2020 06:25
@roycaihw
Copy link
Member Author

@liggitt @Random-Liu Thanks for the review! I've updated the change. PTAL

Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

func (i *imageCache) set(images []container.Image) {
i.Lock()
defer i.Unlock()
// make a copy to avoid potential data race
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't avoid the data race, because the image list is passed in, and the caller may concurrently do some operation on it.

However, I do agree that copying it here is better. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right. I updated the comment to document that the caller should not reference the list again (similar to the comments for client-go cache and fifo queue)

 // set sorts the input list and updates image cache.
+// 'i' takes ownership of the list, you should not reference the list again
+// after calling this function.
 func (i *imageCache) set(images []container.Image) {
        i.Lock()
        defer i.Unlock()
-       // make a copy to avoid potential data race
+       // Make a copy of the input list. This doesn't completely avoid
+       // potential data race. The caller should not reference the list again
+       // after calling this function.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 10, 2020
@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch from b4cdcd4 to 9117a9f Compare March 10, 2020 07:37
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2020
Comment on lines 122 to 124
// Make a copy of the input list. This doesn't completely avoid
// potential data race. The caller should not reference the list again
// after calling this function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is unnecessarily alarming... the set() method is private, and the only place we call it does not hold onto the list and change it further. The godoc on the method clarifying ownership is probably sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Removed the comment

@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch 2 times, most recently from 235708e to 913122e Compare March 10, 2020 18:26
@liggitt
Copy link
Member

liggitt commented Mar 10, 2020

Traced all setters/getters of the function, and none are modifying the slice. There is one place that is modifying slice members of the image items in the list:

names := append(image.RepoDigests, image.RepoTags...)

That should change to accumulate names in a new []string var

@roycaihw roycaihw force-pushed the fix/image-manager-data-race branch from 913122e to 462b753 Compare March 10, 2020 22:34
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 10, 2020
@roycaihw
Copy link
Member Author

That should change to accumulate names in a new []string var

Fixed that line.

For the record, I talked with @liggitt offline: making copies for the image list could be expensive given that it's done every few seconds, and I'm not sure how big these lists are (seems that it can be greater than 50). Since these functions are internal, for efficiency we decided to not make copies, and document these functions to make it clear that the callers should not reference/modify the input/output to introduce race. I also removed the unit test because the get() isn't providing "copy" guarantee now (and the test didn't ensure that)-- we depend on the callers to behave well and avoid race.

@roycaihw
Copy link
Member Author

/retest

@liggitt
Copy link
Member

liggitt commented Mar 11, 2020

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, Random-Liu, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liggitt
Copy link
Member

liggitt commented Mar 11, 2020

see also #88779 for issues with the kubelet not exiting on panic

@liggitt
Copy link
Member

liggitt commented Mar 11, 2020

/retest

1 similar comment
@roycaihw
Copy link
Member Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 562a420 into kubernetes:master Mar 11, 2020
@roycaihw roycaihw deleted the fix/image-manager-data-race branch March 11, 2020 22:36
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.15

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.16

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.17

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image Manager should return a copy of image list to avoid data race.
6 participants