Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Monitor: Add look back support and kernel panic handling #22

Merged
merged 1 commit into from
Aug 24, 2016

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Jun 24, 2016

This helps kubernetes/kubernetes#27885.

This PR:

  1. Add lookback support in kernel monitor. After started, Kernel monitor will check some old logs to detect old problems which happened before last node reboot.

  2. Add lookback and startPattern in kernel monitor configuration.

  • lookback specifies how long time kernel monitor should look back.
  • startPattern specifies which log indicates the node is started. kernel monitor will clear all current node conditions once it finds a node start log. This makes sure that old problems won't change the node condition.
  1. Add support for kernel panic monitoring, the null pointer and divide 0 kernel panic will be surfaced as event. Usually kernel monitor will report these events during looking back phase.

I've cut a branch v0.1 before this PR, and bump up the image version to v0.2 in this PR.
@dchen1107
/cc @kubernetes/sig-node

@Random-Liu
Copy link
Member Author

Random-Liu commented Aug 19, 2016

I've updated and verified the PR. Since this PR added look back support, if there was a kernel panic in last run, node problem detector will report a node event after reboot similar with:

  10s   10s 1   {kernel-monitor e2e-test-lantaol-minion-group-29qf}     Warning KernelPanic divide error: 0000 [#1] SMP

During look back, node problem detector will not keep historical node conditions (since the node has rebooted), but will report historical node events.

@dchen1107

@@ -1,5 +1,7 @@
{
"logPath": "/log/kern.log",
"lookback": "1h",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking back 1h might be too long to find panic information? We only care about the most recent, right? Changing it to 5m, or 10m maybe?

@Random-Liu Random-Liu added this to the Kubernetes v1.4 milestone Aug 21, 2016
1) Add lookback support in kernel monitor. After started, Kernel monitor
will check some old logs to detect problems which happened before last
node reboot.
2) Add `lookback` and `startPattern` in kernel monitor configuration.
  * `lookback` specifies how long time kernel monitor should look back.
  * `startPattern` specifies which log indicates the node is started.
  kernel monitor will clear all current node conditions once it finds
  a node start log. This makes sure that old problems won't change the
  node condition.
3) Add support for kernel panic monitoring, the null pointer and divide
0 kernel panic will be surfaced as event. Usually kernel monitor will
report these events during looking back phase.
@Random-Liu
Copy link
Member Author

@dchen1107 Addressed comments.

@dchen1107
Copy link
Member

LGTM

@dchen1107 dchen1107 merged commit ea83111 into kubernetes:master Aug 24, 2016
@Random-Liu Random-Liu deleted the add-look-back branch January 11, 2017 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants