Name	Name	Last commit message	Last commit date
Latest commit History 119 Commits
Godeps	Godeps
cmd	cmd
config	config
demo	demo
pkg	pkg
vendor	vendor
.gitignore	.gitignore
.travis.yml	.travis.yml
CHANGELOG.md	CHANGELOG.md
Dockerfile.in	Dockerfile.in
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
node-problem-detector.yaml	node-problem-detector.yaml

node-problem-detector

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon runs on each node, detects node problems and reports them to apiserver. node-problem-detector can either run as a DaemonSet or run standalone. Now it is running as a Kubernetes Addon enabled by default in the GCE cluster.

Background

There are tons of node problems could possibly affect the pods running on the node such as:

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

Currently these problems are invisible to the upstream layers in cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.

To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have the visibility to those problems, we can discuss the remedy system.

Problem API

node-problem-detector uses Event and NodeCondition to report problems to apiserver.

NodeCondition: Permanent problem that makes the node unavailable for pods should be reported as NodeCondition.
Event: Temporary problem that has limited impact on pod but is informative should be reported as Event.

Problem Daemon

A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific kind of node problems and reports them to node-problem-detector.

A problem daemon could be:

A tiny daemon designed for dedicated usecase of Kubernetes.
An existing node health monitoring daemon integrated with node-problem-detector.

Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.

List of supported problem daemons:

Problem Daemon	NodeCondition	Description
KernelMonitor	KernelDeadlock	A system log monitor monitors kernel log and reports problem according to predefined rules.

Usage

Flags

--version: Print current version of node-problem-detector.
--system-log-monitors: List of paths to system log monitor configuration files, comma separated, e.g. config/kernel-monitor.json. Node problem detector will start a separate log monitor for each configuration. You can use different log monitors to monitor different system log.
--apiserver-override: A URI parameter used to customize how node-problem-detector connects the apiserver. The format is same as the source flag of Heapster. For example, to run without auth, use the following config:

http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false

Refer heapster docs for a complete list of available options.

--hostname-override: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from hostname-override, then NODE_NAME environment variable and finally fall back to os.Hostname.

Build Image

Run make in the top directory. It will:

Build the binary.
Build the docker image. The binary and config/ are copied into the docker image.
Upload the docker image to registry. By default, the image will be uploaded to gcr.io/google_containers. It's easy to modify the Makefile to push the image to another registry

Start DaemonSet

Create a file node-problem-detector.yaml with the following yaml.

apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-problem-detector spec: template: spec: containers: - name: node-problem-detector image: gcr.io/google_containers/node-problem-detector:v0.2 imagePullPolicy: Always securityContext: privileged: true env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumeMounts: - name: log mountPath: /log readOnly: true - name: localtime mountPath: /etc/localtime readOnly: true volumes: - name: log # Config `log` to your system log directory hostPath: path: /var/log/ - name: localtime hostPath: path: /etc/localtime

Edit node-problem-detector.yaml to fit your environment: Set log volume to your system log diretory. (Used by SystemLogMonitor)
Create the DaemonSet with kubectl create -f node-problem-detector.yaml
If needed, you can use ConfigMap to overwrite the config/.

Start Standalone

To run node-problem-detector standalone, you should set inClusterConfig to false and teach node-problem-detector how to access apiserver with apiserver-override.

To run node-problem-detector standalone with an insecure apiserver connection:

node-problem-detector --apiserver-override=http://APISERVER_IP:APISERVER_INSECURE_PORT?inClusterConfig=false

For more scenarios, see here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

node-problem-detector

Background

Problem API

Problem Daemon

Usage

Flags

Build Image

Start DaemonSet

Start Standalone

Links

About

Uh oh!

Releases 40

Packages

Uh oh!

Contributors 157

Languages

License

kubernetes/node-problem-detector

Folders and files

Latest commit

History

Repository files navigation

node-problem-detector

Background

Problem API

Problem Daemon

Usage

Flags

Build Image

Start DaemonSet

Start Standalone

Links

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 40

Packages 0

Uh oh!

Contributors 157

Languages

Packages