-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a data race in kubelet image manager #88915
Fix a data race in kubelet image manager #88915
Conversation
I'm not sure what additional approval we need given it's code freeze. I will prepare the cherrypicks once it's merged in master |
@@ -548,14 +550,19 @@ func TestValidateImageGCPolicy(t *testing.T) { | |||
} | |||
} | |||
|
|||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I was going to add a test description but decided to not at some point. I updated the test comments and the variable name to expect
to make things more clear.
// Make a copy of testList because cache.set takes ownership of the list | ||
testListCopy := append([]container.Image{}, testList...) | ||
// Expect cache.get to sort the image list | ||
sort.Sort(sliceutils.ByImageSize(testListCopy)) | ||
cache.set(testList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test that testList
is not changed after get
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is changed. That's why the test didn't catch the bug before. Please see the comment I added for imageCache.set()
It brings another question: should imageCache.set()
make a copy of the given list? E.g.
func (i *imageCache) set(images []container.Image) {
i.Lock()
defer i.Unlock()
imagesCopy := append([]container.Image{}, images...)
i.images = imagesCopy
}
this way the caller of imageCache.set()
can safely reference images
afterwards. I think it's a reasonable change but not as critical. Currently this goroutine is the only caller of imageCache.set()
and it doesn't reference images
afterwards, so the change doesn't matter. Since we are going to cherrypick this fix, I prefer keeping the minimum change needed. That's why I didn't change imageCache.set()
behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the sort logic outside of the image cache and tested that testList
is not changed after get
This is a critical bug fix! Mark 1.18. |
d03a564
to
64cda98
Compare
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-integration |
sort.Sort(sliceutils.ByImageSize(i.images)) | ||
return i.images | ||
// Make a copy of image list to avoid data race | ||
images := append([]container.Image{}, i.images...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How frequently is this read? Should we sort on set and return the list without sorting/allocating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! It's invoked in different places so it's hard to tell from source. I searched this and this log messages in a master node and found the read is invoked every few (<5) seconds (see the log below). Compared to the write getting invoked every 30s, I agree it makes sense to sort on set.
Regarding whether we should skip allocating on return, I'm not sure if it's safe to do. Currently we truncate the length of the returned list, which has no effect on the original list https://play.golang.org/p/fc6JmArJtfW. But I still feel it's safer if we copy on return, to prevent the original list getting mutated in future.
I0309 21:40:27.927125 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:35.971794 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:39.926950 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:43.582650 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:44.926879 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:45.926867 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:46.104704 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:52.927557 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:53.927034 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:56.239125 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:40:56.926903 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:06.404693 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:08.926829 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:11.926992 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:14.926864 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:16.541316 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:19.929088 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:20.926684 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:23.926947 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:23.927801 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:26.720531 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:28.926814 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:32.926843 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:32.927004 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:34.926880 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:36.884823 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:37.926909 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:38.926790 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:39.927096 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:43.582911 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:43.927033 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:47.013040 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:54.926907 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:55.927053 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:56.927574 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:41:57.161981 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:02.927006 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:04.926781 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:05.927033 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:07.335855 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:13.927023 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:17.502770 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:19.927029 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:21.926896 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:27.638850 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
I0309 21:42:32.926875 1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
07d924e
to
d5f1d52
Compare
// The image list needs to be sorted when it gets read and used in | ||
// setNodeStatusImages. We sort the list on write instead of on read, | ||
// because the image cache is more often read than written | ||
sort.Sort(sliceutils.ByImageSize(images)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was almost going to add another copy before this line... Perhaps it's better to put the copy and the sort inside set()
, to make the code more contained? @liggitt WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving copy and sort into set makes sense to me
d5f1d52
to
b4cdcd4
Compare
@liggitt @Random-Liu Thanks for the review! I've updated the change. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
func (i *imageCache) set(images []container.Image) { | ||
i.Lock() | ||
defer i.Unlock() | ||
// make a copy to avoid potential data race |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't avoid the data race, because the image list is passed in, and the caller may concurrently do some operation on it.
However, I do agree that copying it here is better. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right. I updated the comment to document that the caller should not reference the list again (similar to the comments for client-go cache and fifo queue)
// set sorts the input list and updates image cache.
+// 'i' takes ownership of the list, you should not reference the list again
+// after calling this function.
func (i *imageCache) set(images []container.Image) {
i.Lock()
defer i.Unlock()
- // make a copy to avoid potential data race
+ // Make a copy of the input list. This doesn't completely avoid
+ // potential data race. The caller should not reference the list again
+ // after calling this function.
b4cdcd4
to
9117a9f
Compare
// Make a copy of the input list. This doesn't completely avoid | ||
// potential data race. The caller should not reference the list again | ||
// after calling this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is unnecessarily alarming... the set() method is private, and the only place we call it does not hold onto the list and change it further. The godoc on the method clarifying ownership is probably sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Removed the comment
235708e
to
913122e
Compare
Traced all setters/getters of the function, and none are modifying the slice. There is one place that is modifying slice members of the image items in the list: kubernetes/pkg/kubelet/nodestatus/setters.go Line 449 in c4a7d3c
That should change to accumulate names in a new |
race and improve efficienty
913122e
to
462b753
Compare
Fixed that line. For the record, I talked with @liggitt offline: making copies for the image list could be expensive given that it's done every few seconds, and I'm not sure how big these lists are (seems that it can be greater than 50). Since these functions are internal, for efficiency we decided to not make copies, and document these functions to make it clear that the callers should not reference/modify the input/output to introduce race. I also removed the unit test because the get() isn't providing "copy" guarantee now (and the test didn't ensure that)-- we depend on the callers to behave well and avoid race. |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, Random-Liu, roycaihw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
see also #88779 for issues with the kubelet not exiting on panic |
/retest |
1 similar comment
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
…15-upstream-release-1.15 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
…15-upstream-release-1.16 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
…15-upstream-release-1.17 Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
What type of PR is this?
/kind bug
What this PR does / why we need it:
Let image manager return a copy of image list. Today it sorts and returns the internal image list cache to caller directly, which introduces a data race. Due to the sort being in-place, we've observed the race causing static pod workers to panic during syncPod. However, given the way kubelet recovers from goroutine panics, kubelet would keep running and the pod worker silently stopped working-- all future work dispatched for this pod got ignored.
git log shows the race was introduced in 1.10. We should cherrypick this fix to all the supported minor releases.
Sample panic log from 1.13:
Which issue(s) this PR fixes:
Fixes #60443
Special notes for your reviewer:
I'm going to add a proposal to SIG node about catching/mitigating this kind of programming error in future. In this case the issue would have been mitigated if we either 1) actually crashed kubelet and let it restart or 2) created a new goroutine when the old one panicked. Restarting kubelet is preferred IMO because it guards a wide range of programming error, and it's usually safe to restart kubelet as long as it's not too frequent.
Does this PR introduce a user-facing change?:
/cc @Random-Liu @cheftako
/priority critical-urgent
/sig node