[BUG] Single-node Harvester rancher-monitoring-prometheus enter "CrashLoopBackOff" due to "reloadBlocks: corrupted block" #2092
Open
Description
Describe the bug
Single node Harvester rancher-monitoring-prometheus enter "CrashLoopBackOff" due to "reloadBlocks: corrupted block"
POD is in CrashLoopBackOff.
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 2/3 CrashLoopBackOff 7 (19s ago) 11m
Error message:
level=error ts=2022-03-27T10:13:16.342Z caller=main.go:917 err="opening storage failed: reloadBlocks: corrupted block 01FZ5AABWQWCVRSVJW5T3NSRN8: read TOC: read TOC: invalid checksum"
The log of POD:
$ kubectl logs prometheus-rancher-monitoring-prometheus-0 -n cattle-monitoring-system
level=info ts=2022-03-27T10:13:16.301Z caller=main.go:443 msg="Starting Prometheus" version="(version=2.28.1, branch=HEAD, revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)"
level=info ts=2022-03-27T10:13:16.301Z caller=main.go:448 build_context="(go=go1.16.5, user=root@2915dd495090, date=20210701-15:20:10)"
level=info ts=2022-03-27T10:13:16.301Z caller=main.go:449 host_details="(Linux 5.3.18-150300.59.54-default #1 SMP Sat Mar 5 10:00:50 UTC 2022 (1d0fa95) x86_64 prometheus-rancher-monitoring-prometheus-0 )"
level=info ts=2022-03-27T10:13:16.301Z caller=main.go:450 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2022-03-27T10:13:16.301Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-03-27T10:13:16.303Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-03-27T10:13:16.303Z caller=main.go:824 msg="Starting TSDB ..."
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647894096390 maxt=1647900000000 ulid=01FYRN5TFMFV7QVG8YM0CYRXHF
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647947475362 maxt=1647950400000 ulid=01FYS6FFMHN1MKV1F15P6RCVBH
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647871292136 maxt=1647885600000 ulid=01FYS6FHDE63ARAMNR0Y5STWCG
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647965617618 maxt=1647972000000 ulid=01FYV6T3Q72CPX30TQBP4VDDZA
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648033064346 maxt=1648036800000 ulid=01FYVH3PCKEVGCR4F72YZWRFF3
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648066035362 maxt=1648072800000 ulid=01FYYRE076DT1PV72EBMYXA1AN
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648152215687 maxt=1648159200000 ulid=01FZ0B1V3C3YFCSZY3CVHFHZBR
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648036800065 maxt=1648051200000 ulid=01FZ0B1WFMBD7WH76B5AMBYJZ4
level=info ts=2022-03-27T10:13:16.303Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648205293767 maxt=1648209600000 ulid=01FZ0NBEPVF461S7V8C08C7PR5
level=info ts=2022-03-27T10:13:16.304Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648209600000 maxt=1648216800000 ulid=01FZ106SPRX465W7ZSPGJRPX85
level=info ts=2022-03-27T10:13:16.304Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648216800107 maxt=1648224000000 ulid=01FZ10AJN1CRHGY4QHX56S4ETE
level=info ts=2022-03-27T10:13:16.304Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648224000000 maxt=1648231200000 ulid=01FZ5AABWQWCVRSVJW5T3NSRN8
level=info ts=2022-03-27T10:13:16.304Z caller=tls_config.go:227 component=web msg="TLS is disabled." http2=false
level=info ts=2022-03-27T10:13:16.341Z caller=main.go:697 msg="Stopping scrape discovery manager..."
level=info ts=2022-03-27T10:13:16.341Z caller=main.go:711 msg="Stopping notify discovery manager..."
level=info ts=2022-03-27T10:13:16.341Z caller=main.go:733 msg="Stopping scrape manager..."
level=info ts=2022-03-27T10:13:16.341Z caller=main.go:707 msg="Notify discovery manager stopped"
level=info ts=2022-03-27T10:13:16.342Z caller=main.go:693 msg="Scrape discovery manager stopped"
level=info ts=2022-03-27T10:13:16.342Z caller=main.go:727 msg="Scrape manager stopped"
level=info ts=2022-03-27T10:13:16.342Z caller=manager.go:934 component="rule manager" msg="Stopping rule manager..."
level=info ts=2022-03-27T10:13:16.342Z caller=manager.go:944 component="rule manager" msg="Rule manager stopped"
level=info ts=2022-03-27T10:13:16.342Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2022-03-27T10:13:16.342Z caller=main.go:908 msg="Notifier manager stopped"
level=error ts=2022-03-27T10:13:16.342Z caller=main.go:917 err="opening storage failed: reloadBlocks: corrupted block 01FZ5AABWQWCVRSVJW5T3NSRN8: read TOC: read TOC: invalid checksum"
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
KVM based Harvester, single-node cluster, for debugging issue [BUG] unstable deployment of POD prometheus-0 #2013 , the VM was "Shut Down -> Force Off" and "Run" from "Virtual Machine Manager" several times a day.
This issue is encountered once.
Expected behavior
Not sure:
- Will this also happen in mulit-node harvester cluster ?
- Given it happens, will the system recover automatically? If not, how to enhance, or how to recover ?
- When the user needs to recover it manually, how to?
Support bundle
Environment:
- Harvester ISO version: V1.0.1 master-head
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): KVM based single-node cluster
Additional context
Add any other context about the problem here.