Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thin_pool_watcher.go with thin_ls in cadvisor causes devicemapper to crash #30230

Closed
jsravn opened this issue Aug 8, 2016 · 7 comments
Closed
Assignees
Labels
area/controller-manager needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.

Comments

@jsravn
Copy link
Contributor

jsravn commented Aug 8, 2016

After updating to kubernetes 1.3, and getting the latest RHEL7 updates including the thin_ls tool, we started experiencing devicemapper failures on our kubernetes nodes:

Aug 08 10:13:43 ip-10-50-185-154.internal kernel: device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 113
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: device-mapper: block manager: recursive lock detected in metadata
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: device-mapper: block manager: recursive acquisition of block 3 requested.
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -22
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: XFS (dm-5): metadata I/O error: block 0x6400500 ("xlog_iodone") error 5 numblks 128
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: XFS (dm-5): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c.  Return address = 0xf
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: XFS (dm-5): Log I/O Error Detected.  Shutting down filesystem
Aug 08 10:13:54 ip-10-50-185-154.internal kernel: XFS (dm-5): Please umount the filesystem and rectify the problem(s)
Aug 08 10:14:01 ip-10-50-185-154.internal kernel: XFS (dm-6): Unmounting Filesystem
Aug 08 10:14:01 ip-10-50-185-154.internal kernel: device-mapper: block manager: validator mismatch (old=index vs new=btree_node) for block 138
Aug 08 10:14:01 ip-10-50-185-154.internal kernel: device-mapper: block manager: validator mismatch (old=index vs new=btree_node) for block 138
Aug 08 10:14:01 ip-10-50-185-154.internal kernel: Buffer I/O error on device dm-6, logical block 26214384
Aug 08 10:14:01 ip-10-50-185-154.internal kernel: device-mapper: block manager: validator mismatch (old=index vs new=btree_node) for block 138

#25914 updated the cadvisor version, which added a thin_pool_watcher.go for devicemapper thin pools.

Unfortunately, due to https://bugzilla.redhat.com/show_bug.cgi?id=1286500, this thin pool watching seems to cause devicemapper to freak out due to an underlying kernel bug (invoking reserve_metadata_snap then thin_ls). When the bug is hit, the thin pool goes into read only mode, causing a complete outage on the node.

This hasn't been fixed yet in rhel7, and from the redhat bug report, I'm not sure the root cause has been found.

A workaround is to make thin_ls inaccessible by kubelet so it won't try to watch the thin pools.

I'm guessing this isn't a big problem yet for most people, since thin_ls is missing on most platforms, and was just recently added to rhel7 (last week). E.g. #27935

@ravilr
Copy link
Contributor

ravilr commented Aug 8, 2016

google/cadvisor#1411

@ncdc the above fix needs to be cherrypicked to release-1.3.

@sjenning
Copy link
Contributor

sjenning commented Aug 8, 2016

@vishh @thockin i can do the work here. just need to know what needs to be done.

  1. cherry-pick the cadvisor commit into the kube 1.3 dep tree directly
  2. bump the cadvisor version in the Godeps for 1.3

@vishh
Copy link
Contributor

vishh commented Aug 8, 2016

@sjenning I think you intended to mention @timstclair

cherry-pick the cadvisor commit into the kube 1.3 dep tree directly

Yes

bump the cadvisor version in the Godeps for 1.3

k8s master needs to include the fix before it can be cherry-picked into v1.3 branch.

@sjenning
Copy link
Contributor

sjenning commented Aug 8, 2016

@vishh indeed. thanks. i'll get on it.

@sjenning
Copy link
Contributor

sjenning commented Aug 8, 2016

@vishh actually i need some clarity here:

  • pick fix from cadvisor/master to cadvisory/v0.23
  • get new tag for v0.23
  • bump kube master to new cadvisor v0.23 tag
  • bump kube v1.3 cadvisor to new tag? or
  • pick commit into kube 1.3 vendor tree directly?

@spiffxp
Copy link
Member

spiffxp commented Jun 23, 2017

/assign
closing, this was fixed by #30307

@spiffxp
Copy link
Member

spiffxp commented Jun 23, 2017

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

8 participants