Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix startup type error in initializeCaches #28002

Merged
merged 1 commit into from
Jun 25, 2016

Conversation

asalkeld
Copy link

The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}

The tests make extensive use of NewFakeControllerSource which uses api.List
instead of api.PersistentVolumeList. So use reflect to help iterate over the
items then assert the item type.

fixes #27757

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

4 similar comments
@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jun 24, 2016
@saad-ali saad-ali assigned jsafrane and unassigned saad-ali Jun 24, 2016
@jsafrane
Copy link
Member

@k8s-bot ok to test

@jsafrane jsafrane added sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jun 24, 2016
return
}
for _, volume := range volumeList.Items {
volumeListVal := reflect.Indirect(reflect.ValueOf(volumeListObj))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why reflect is needed as both List and PersistentVolumeList need to be processed.

@jsafrane
Copy link
Member

@k8s-bot e2e test this issue: #IGNORE

@jsafrane
Copy link
Member

LGTM

@jsafrane
Copy link
Member

@k8s-bot e2e test this issue: #IGNORE

@jsafrane jsafrane added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2016
@jsafrane jsafrane added this to the v1.3 milestone Jun 24, 2016
@jsafrane
Copy link
Member

jsafrane commented Jun 24, 2016

labeling into 1.3, this basically fixes controller initialization after crash, without this patch the controller may release (and recycle!) bound volumes!

@jsafrane jsafrane removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2016
@jsafrane
Copy link
Member

@asalkeld, please squash your commits into one and I'll merge it.

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

GCE e2e build/test passed for commit 035f9e6a916f5b80fa5fb5938aa2d301ea86302e.

@jsafrane
Copy link
Member

Ok, so @asalkeld is probably sleeping by now. @saad-ali, can you please coordinate with him to either squash the patches or merge them as they are? I'd really like to have them in Kube 1.3. This patch won't break volume controller more than it already is.

Nobody noticed this error earlier, as the volume sources behaves differently in unit tests and we don't really test controller startup with existing PVs/PVCs anywhere else. I'll write an integration test for this.

@idvoretskyi
Copy link
Member

@jsafrane we have enough time (~7-8 hours) to wait for @asalkeld squashing the commits and we'll be ready to merge this item (I'm against of merging as-is).

@jsafrane
Copy link
Member

I'm testing the patch properly and it seems it does not work as expected. Problem is that volumeSource.List in unit tests returns api.List with *api.PersistentVolume items, while real controller (and informer below) returns api.PersistentVolumeList with api.PersistentVolume as items, i.e. the items are not pointers.

volume, ok := volumeListItems.Index(i).Interface().(*api.PersistentVolume) will then fail.

@eparis eparis added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 24, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Jun 25, 2016

GCE e2e build/test passed for commit b4f7e67.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 1effc5a into kubernetes:master Jun 25, 2016
@lavalamp
Copy link
Member

Would it be a good idea for the PVC controller, when it sees a dangling reference, to do a cache-bypassing read to confirm the dangling reference, before recycling?

Possibly. Need to understand the locking mechanism to say for sure.

@asalkeld asalkeld deleted the init-cache-error branch June 25, 2016 23:13
@jsafrane
Copy link
Member

Since the PVCs and PVs are watched using separate watch operations, it seems possible that one watch could fall behind the other by an arbitrary amount of time, due to delays in various parts of the system. So, isn't it possible for the controller to get in this same kind of state, even after syncing?

It's not likely. Once the controller has complete view of the world (i.e. after initial startup), it can receive events from PVs and PVCs at different pace as long as the controller is the only one who binds volumes. Bad things can happen when a bound PV appears and the controller does not see appropriate counterpart PVC. The PV may get recycled or deleted, which is exactly what happened in this bug.

@asalkeld, @lavalamp thanks for your patches and reviews!

@erictune
Copy link
Member

If events from PVs and PVCs arrive at very different paces, doesn't that mean a bound PV could show up before the create for a PVC ever shows up?

@erictune erictune added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jun 27, 2016
@lavalamp
Copy link
Member

@erictune no, because the controller must have observed the PVC in order to make the binding in the first place. (Assumption: there's only one controller in operation, and no other actor can cause a binding to occur.)

@jsafrane I recall talking locking methods to make all of these operations safe even with multiple controllers with a group of red hatters and googlers a few months ago. I take it that has not been implemented yet?

@eparis eparis mentioned this pull request Jun 27, 2016
@jsafrane
Copy link
Member

@lavalamp, even with locking as suggested on the meeting we were able to find races when two controllers were trying to bind two PVs/PVCs at the same time.

eparis pushed a commit to eparis/kubernetes that referenced this pull request Jun 29, 2016
Automatic merge from submit-queue

Fix startup type error in initializeCaches

The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}

The tests make extensive use of NewFakeControllerSource which uses api.List
instead of api.PersistentVolumeList. So use reflect to help iterate over the
items then assert the item type.

fixes kubernetes#27757
(cherry picked from commit 1effc5a)
k8s-github-robot pushed a commit that referenced this pull request Jun 29, 2016
Automatic merge from submit-queue

Batch update for 1.3

#28030: Revert "Federation e2e supports aws"
#28026: Address outstanding review comments in #27999.
#28034: Adding lock files for kubeconfig updating
#28004: return nil from NewClientConfig instead of empty struct
#28032: Increase pod CPU/memory for fluentd, dns and kube-proxy.
#27208: Bump minimum API version for docker to 1.21
#28061: Remove extra double quotes in --federations.
#28060: rkt: Fix the 'privileged' check when stage1 annotation is provided.
#27996: Image GC logic should compensate for reserved blocks
#28044: rkt: Bump required rkt version to 1.9.1.
#28040: Tracked addition of federation, sed support in kube DNS
#28043: Set grace period to 0 when deleting namespaces after the test.
#28002: Fix startup type error in initializeCaches
#28087: Hotfix: Fixup the hyperkube dns manifest from a breaking federation PR
#28108: Fix initialization of volume controller caches.
#28056: Increase kube-dns requirements on CoreOS.
#28147: Fix error checks after cloning.
#28159: Use : as seccomp security option operator for Docker 1.10
#28165: Refactored, expanded and fixed federated-services e2e tests.
#28095: Kubelet should mark VolumeInUse before checking if it is Attached
#28172: Build: Add KUBE_GCS_RELEASE_BUCKET_MIRROR option to push-ci-build.sh
#28207: Bump cluster autoscaler to 0.2.2
#28160: Volume manager must verify containers terminated before deleting for ungracefully terminated pods
#28211: Fix federation e2e tests by correctly managing cluster clients
#27944: Fix pvc label selector validation error
#28186: federation: Upgrading the groupversion to v1beta1
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

jsafrane added a commit to jsafrane/kubernetes that referenced this pull request Jul 11, 2016
Tests kubernetes#28002 with real etcd (unit tests have a fake one with different
behavior).
k8s-github-robot pushed a commit that referenced this pull request Aug 8, 2016
Automatic merge from submit-queue

Add integration test for volume controller startup.

Tests #28002 with real etcd (unit tests have a fake one with different behavior).

@kubernetes/sig-storage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error at startup "PersistentVolumeController can't initialize caches ..."