Kernel Monitor: Add look back support and kernel panic handling #22

Random-Liu · 2016-06-24T06:33:11Z

This PR:

Add lookback support in kernel monitor. After started, Kernel monitor will check some old logs to detect old problems which happened before last node reboot.
Add lookback and startPattern in kernel monitor configuration.

lookback specifies how long time kernel monitor should look back.
startPattern specifies which log indicates the node is started. kernel monitor will clear all current node conditions once it finds a node start log. This makes sure that old problems won't change the node condition.

Add support for kernel panic monitoring, the null pointer and divide 0 kernel panic will be surfaced as event. Usually kernel monitor will report these events during looking back phase.

I've cut a branch v0.1 before this PR, and bump up the image version to v0.2 in this PR.
@dchen1107
/cc @kubernetes/sig-node

Random-Liu · 2016-08-19T21:25:51Z

I've updated and verified the PR. Since this PR added look back support, if there was a kernel panic in last run, node problem detector will report a node event after reboot similar with:

  10s   10s 1   {kernel-monitor e2e-test-lantaol-minion-group-29qf}     Warning KernelPanic divide error: 0000 [#1] SMP

During look back, node problem detector will not keep historical node conditions (since the node has rebooted), but will report historical node events.

@dchen1107

dchen1107 · 2016-08-20T03:57:54Z

config/kernel-monitor.json

@@ -1,5 +1,7 @@
 {
 	"logPath": "/log/kern.log",
+	"lookback": "1h",


looking back 1h might be too long to find panic information? We only care about the most recent, right? Changing it to 5m, or 10m maybe?

1) Add lookback support in kernel monitor. After started, Kernel monitor will check some old logs to detect problems which happened before last node reboot. 2) Add `lookback` and `startPattern` in kernel monitor configuration. * `lookback` specifies how long time kernel monitor should look back. * `startPattern` specifies which log indicates the node is started. kernel monitor will clear all current node conditions once it finds a node start log. This makes sure that old problems won't change the node condition. 3) Add support for kernel panic monitoring, the null pointer and divide 0 kernel panic will be surfaced as event. Usually kernel monitor will report these events during looking back phase.

Random-Liu · 2016-08-21T02:11:52Z

@dchen1107 Addressed comments.

dchen1107 · 2016-08-24T00:13:54Z

LGTM

Random-Liu added the enhancement label Jun 24, 2016

Random-Liu assigned dchen1107 Jun 24, 2016

Random-Liu mentioned this pull request Jun 24, 2016

Kernel panic when having a privileged container with docker >= 1.10 kubernetes/kubernetes#27885

Closed

Random-Liu mentioned this pull request Jul 6, 2016

Document known issues for v1.3.0 kubernetes/kubernetes#28403

Merged

dchen1107 mentioned this pull request Aug 19, 2016

Integrate node-problem-detector with e2e test infrastructure kubernetes/kubernetes#30811

Closed

Random-Liu force-pushed the add-look-back branch from f398ef0 to 43bd62d Compare August 19, 2016 21:24

dchen1107 reviewed Aug 20, 2016
View reviewed changes

Random-Liu added this to the Kubernetes v1.4 milestone Aug 21, 2016

Random-Liu force-pushed the add-look-back branch from 43bd62d to 532f933 Compare August 21, 2016 02:11

dchen1107 merged commit ea83111 into kubernetes:master Aug 24, 2016

Random-Liu deleted the add-look-back branch January 11, 2017 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel Monitor: Add look back support and kernel panic handling #22

Kernel Monitor: Add look back support and kernel panic handling #22

Random-Liu commented Jun 24, 2016 •

edited

Loading

Random-Liu commented Aug 19, 2016 •

edited

Loading

dchen1107 Aug 20, 2016

Random-Liu commented Aug 21, 2016

dchen1107 commented Aug 24, 2016

Kernel Monitor: Add look back support and kernel panic handling #22

Kernel Monitor: Add look back support and kernel panic handling #22

Conversation

Random-Liu commented Jun 24, 2016 • edited Loading

Random-Liu commented Aug 19, 2016 • edited Loading

dchen1107 Aug 20, 2016

Choose a reason for hiding this comment

Random-Liu commented Aug 21, 2016

dchen1107 commented Aug 24, 2016

Random-Liu commented Jun 24, 2016 •

edited

Loading

Random-Liu commented Aug 19, 2016 •

edited

Loading