Fix a data race in kubelet image manager #88915

roycaihw · 2020-03-06T18:30:55Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Let image manager return a copy of image list. Today it sorts and returns the internal image list cache to caller directly, which introduces a data race. Due to the sort being in-place, we've observed the race causing static pod workers to panic during syncPod. However, given the way kubelet recovers from goroutine panics, kubelet would keep running and the pod worker silently stopped working-- all future work dispatched for this pod got ignored.

git log shows the race was introduced in 1.10. We should cherrypick this fix to all the supported minor releases.

Sample panic log from 1.13:

runtime.go:73] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x...
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0x...
panic(0x..., 0x...)
	/usr/local/go/src/runtime/panic.go:522 +0x...
k8s.io/kubernetes/pkg/kubelet/nodestatus.Images.func1(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/nodestatus/setters.go:413 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).setNodeStatus(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:480 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).initialNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:348 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).GetNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_getters.go:220 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncPod(0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet.go:1666 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop.func1(0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:174 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:183 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod.func1(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:221 +0x...
created by k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:219 +0x...

Which issue(s) this PR fixes:

Fixes #60443

Special notes for your reviewer:
I'm going to add a proposal to SIG node about catching/mitigating this kind of programming error in future. In this case the issue would have been mitigated if we either 1) actually crashed kubelet and let it restart or 2) created a new goroutine when the old one panicked. Restarting kubelet is preferred IMO because it guards a wide range of programming error, and it's usually safe to restart kubelet as long as it's not too frequent.

Does this PR introduce a user-facing change?:

Fixed a data race in kubelet image manager that can cause static pod workers to silently stop working.

/cc @Random-Liu @cheftako
/priority critical-urgent
/sig node

roycaihw · 2020-03-06T18:38:54Z

I'm not sure what additional approval we need given it's code freeze. I will prepare the cherrypicks once it's merged in master

Random-Liu · 2020-03-06T19:14:03Z

pkg/kubelet/images/image_gc_manager_test.go

@@ -548,14 +550,19 @@ func TestValidateImageGCPolicy(t *testing.T) {
 	}
 }

+//


ah, I was going to add a test description but decided to not at some point. I updated the test comments and the variable name to expect to make things more clear.

Random-Liu · 2020-03-06T19:15:10Z

pkg/kubelet/images/image_gc_manager_test.go

+	// Make a copy of testList because cache.set takes ownership of the list
+	testListCopy := append([]container.Image{}, testList...)
+	// Expect cache.get to sort the image list
+	sort.Sort(sliceutils.ByImageSize(testListCopy))
 	cache.set(testList)


We should also test that testList is not changed after get.

It is changed. That's why the test didn't catch the bug before. Please see the comment I added for imageCache.set()

It brings another question: should imageCache.set() make a copy of the given list? E.g.

func (i *imageCache) set(images []container.Image) { i.Lock() defer i.Unlock() imagesCopy := append([]container.Image{}, images...) i.images = imagesCopy }

this way the caller of imageCache.set() can safely reference images afterwards. I think it's a reasonable change but not as critical. Currently this goroutine is the only caller of imageCache.set() and it doesn't reference images afterwards, so the change doesn't matter. Since we are going to cherrypick this fix, I prefer keeping the minimum change needed. That's why I didn't change imageCache.set() behavior here.

I moved the sort logic outside of the image cache and tested that testList is not changed after get

Random-Liu · 2020-03-06T19:17:02Z

This is a critical bug fix! Mark 1.18.

roycaihw · 2020-03-06T22:26:05Z

/test pull-kubernetes-e2e-kind

roycaihw · 2020-03-06T23:17:03Z

/test pull-kubernetes-integration

liggitt · 2020-03-07T05:19:17Z

pkg/kubelet/images/image_gc_manager.go

-	sort.Sort(sliceutils.ByImageSize(i.images))
-	return i.images
+	// Make a copy of image list to avoid data race
+	images := append([]container.Image{}, i.images...)


How frequently is this read? Should we sort on set and return the list without sorting/allocating?

That's a good point! It's invoked in different places so it's hard to tell from source. I searched this and this log messages in a master node and found the read is invoked every few (<5) seconds (see the log below). Compared to the write getting invoked every 30s, I agree it makes sense to sort on set.

Regarding whether we should skip allocating on return, I'm not sure if it's safe to do. Currently we truncate the length of the returned list, which has no effect on the original list https://play.golang.org/p/fc6JmArJtfW. But I still feel it's safer if we copy on return, to prevent the original list getting mutated in future.

I0309 21:40:27.927125 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:35.971794 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:39.926950 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:43.582650 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:44.926879 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:45.926867 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:46.104704 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:52.927557 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:53.927034 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:56.239125 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:40:56.926903 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:06.404693 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:08.926829 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:11.926992 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:14.926864 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:16.541316 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:19.929088 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:20.926684 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:23.926947 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:23.927801 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:26.720531 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:28.926814 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:32.926843 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:32.927004 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:34.926880 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:36.884823 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:37.926909 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:38.926790 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:39.927096 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:43.582911 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:43.927033 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:47.013040 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:54.926907 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:55.927053 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:56.927574 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:41:57.161981 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:02.927006 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:04.926781 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:05.927033 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:07.335855 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:13.927023 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:17.502770 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:19.927029 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:21.926896 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:27.638850 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach I0309 21:42:32.926875 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach

roycaihw · 2020-03-10T00:53:48Z

pkg/kubelet/images/image_gc_manager.go

+			// The image list needs to be sorted when it gets read and used in
+			// setNodeStatusImages. We sort the list on write instead of on read,
+			// because the image cache is more often read than written
+			sort.Sort(sliceutils.ByImageSize(images))


I was almost going to add another copy before this line... Perhaps it's better to put the copy and the sort inside set(), to make the code more contained? @liggitt WDYT?

moving copy and sort into set makes sense to me

roycaihw · 2020-03-10T06:30:05Z

@liggitt @Random-Liu Thanks for the review! I've updated the change. PTAL

Random-Liu

/lgtm
/approve

Random-Liu · 2020-03-10T06:54:36Z

pkg/kubelet/images/image_gc_manager.go

 func (i *imageCache) set(images []container.Image) {
 	i.Lock()
 	defer i.Unlock()
+	// make a copy to avoid potential data race


I think we can't avoid the data race, because the image list is passed in, and the caller may concurrently do some operation on it.

However, I do agree that copying it here is better. :)

Ah you're right. I updated the comment to document that the caller should not reference the list again (similar to the comments for client-go cache and fifo queue)

// set sorts the input list and updates image cache. +// 'i' takes ownership of the list, you should not reference the list again +// after calling this function. func (i *imageCache) set(images []container.Image) { i.Lock() defer i.Unlock() - // make a copy to avoid potential data race + // Make a copy of the input list. This doesn't completely avoid + // potential data race. The caller should not reference the list again + // after calling this function.

liggitt · 2020-03-10T12:31:00Z

pkg/kubelet/images/image_gc_manager.go

+	// Make a copy of the input list. This doesn't completely avoid
+	// potential data race. The caller should not reference the list again
+	// after calling this function.


I think this comment is unnecessarily alarming... the set() method is private, and the only place we call it does not hold onto the list and change it further. The godoc on the method clarifying ownership is probably sufficient.

thanks. Removed the comment

liggitt · 2020-03-10T19:20:05Z

Traced all setters/getters of the function, and none are modifying the slice. There is one place that is modifying slice members of the image items in the list:

kubernetes/pkg/kubelet/nodestatus/setters.go

Line 449 in c4a7d3c

names := append(image.RepoDigests, image.RepoTags...)

That should change to accumulate names in a new []string var

race and improve efficienty

roycaihw · 2020-03-10T23:03:16Z

That should change to accumulate names in a new []string var

Fixed that line.

For the record, I talked with @liggitt offline: making copies for the image list could be expensive given that it's done every few seconds, and I'm not sure how big these lists are (seems that it can be greater than 50). Since these functions are internal, for efficiency we decided to not make copies, and document these functions to make it clear that the callers should not reference/modify the input/output to introduce race. I also removed the unit test because the get() isn't providing "copy" guarantee now (and the test didn't ensure that)-- we depend on the callers to behave well and avoid race.

roycaihw · 2020-03-11T00:27:50Z

/retest

liggitt · 2020-03-11T12:22:40Z

/lgtm
/approve
/retest

k8s-ci-robot · 2020-03-11T12:23:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, Random-Liu, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [Random-Liu,liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2020-03-11T15:04:51Z

see also #88779 for issues with the kubelet not exiting on panic

liggitt · 2020-03-11T15:05:12Z

/retest

roycaihw · 2020-03-11T16:40:45Z

/retest

fejta-bot · 2020-03-11T20:06:10Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

…15-upstream-release-1.15 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

…15-upstream-release-1.16 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

…15-upstream-release-1.17 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

k8s-ci-robot requested a review from cheftako March 6, 2020 18:30

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 6, 2020

k8s-ci-robot requested a review from Random-Liu March 6, 2020 18:30

Random-Liu reviewed Mar 6, 2020

View reviewed changes

Random-Liu assigned Random-Liu and dchen1107 Mar 6, 2020

Random-Liu modified the milestones: v1.19, v1.18 Mar 6, 2020

roycaihw force-pushed the fix/image-manager-data-race branch from d03a564 to 64cda98 Compare March 6, 2020 20:59

liggitt reviewed Mar 7, 2020

View reviewed changes

roycaihw force-pushed the fix/image-manager-data-race branch from 07d924e to d5f1d52 Compare March 10, 2020 00:32

roycaihw commented Mar 10, 2020

View reviewed changes

roycaihw force-pushed the fix/image-manager-data-race branch from d5f1d52 to b4cdcd4 Compare March 10, 2020 06:25

Random-Liu reviewed Mar 10, 2020

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 10, 2020

roycaihw force-pushed the fix/image-manager-data-race branch from b4cdcd4 to 9117a9f Compare March 10, 2020 07:37

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2020

liggitt reviewed Mar 10, 2020

View reviewed changes

roycaihw force-pushed the fix/image-manager-data-race branch 2 times, most recently from 235708e to 913122e Compare March 10, 2020 18:26

let image cache do sort on write instead of on read to avoid data

462b753

race and improve efficienty

roycaihw force-pushed the fix/image-manager-data-race branch from 913122e to 462b753 Compare March 10, 2020 22:34

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 10, 2020

k8s-ci-robot assigned liggitt Mar 11, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2020

liggitt mentioned this pull request Mar 11, 2020

Panic in kubelet Run does not exit, creates multiple parallel Run goroutines #88779

Closed

k8s-ci-robot merged commit 562a420 into kubernetes:master Mar 11, 2020

roycaihw deleted the fix/image-manager-data-race branch March 11, 2020 22:36

k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020

Merge pull request #89072 from roycaihw/automated-cherry-pick-of-#889…

76348e0

…15-upstream-release-1.15 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020

Merge pull request #89073 from roycaihw/automated-cherry-pick-of-#889…

04ef235

…15-upstream-release-1.16 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020

Merge pull request #89074 from roycaihw/automated-cherry-pick-of-#889…

4dfcd1c

…15-upstream-release-1.17 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a data race in kubelet image manager #88915

Fix a data race in kubelet image manager #88915

roycaihw commented Mar 6, 2020

roycaihw commented Mar 6, 2020

Random-Liu Mar 6, 2020

roycaihw Mar 6, 2020

Random-Liu Mar 6, 2020

roycaihw Mar 6, 2020

roycaihw Mar 10, 2020

Random-Liu commented Mar 6, 2020

roycaihw commented Mar 6, 2020

roycaihw commented Mar 6, 2020

liggitt Mar 7, 2020 •

edited

Loading

roycaihw Mar 9, 2020

roycaihw Mar 10, 2020

liggitt Mar 10, 2020

roycaihw commented Mar 10, 2020

Random-Liu left a comment

Random-Liu Mar 10, 2020

roycaihw Mar 10, 2020

liggitt Mar 10, 2020

roycaihw Mar 10, 2020

liggitt commented Mar 10, 2020

roycaihw commented Mar 10, 2020

roycaihw commented Mar 11, 2020

liggitt commented Mar 11, 2020

k8s-ci-robot commented Mar 11, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

roycaihw commented Mar 11, 2020

fejta-bot commented Mar 11, 2020

@@ @@ -548,14 +550,19 @@ func TestValidateImageGCPolicy(t *testing.T) { @@
               	}
               }
+              //

Fix a data race in kubelet image manager #88915

Fix a data race in kubelet image manager #88915

Conversation

roycaihw commented Mar 6, 2020

roycaihw commented Mar 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu commented Mar 6, 2020

roycaihw commented Mar 6, 2020

roycaihw commented Mar 6, 2020

liggitt Mar 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roycaihw commented Mar 10, 2020

Random-Liu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented Mar 10, 2020

roycaihw commented Mar 10, 2020

roycaihw commented Mar 11, 2020

liggitt commented Mar 11, 2020

k8s-ci-robot commented Mar 11, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

roycaihw commented Mar 11, 2020

fejta-bot commented Mar 11, 2020

liggitt Mar 7, 2020 •

edited

Loading