- Notifications
You must be signed in to change notification settings - Fork 679
Closed
Milestone
Description
NPD (node problem detector) is introduced in Kubernetes 1.3 as a default add-on in GCE cluster.
At that time, it is mainly targeted on default GCE Kubernetes setup. However, as time goes by, some limitations were found such as Journald support, Authentication Issue, Scalability Issue which affected the adoption of NPD in many other environment.
In Kubernetes 1.6, we plan to invest some time to improve NPD, make it production ready and rollout it in GKE.
Here are the working items and priorities:
- [P0] Journald support. Many important OS distros are using systemd now, such as GCI, CoreOS, CentOS etc. This is essential for NPD adoption. (Issue: KernelMonitor: Add journald support #14, PR: Journald support #39, add journal support #33, @adohe)
- [P0] Apiserver client option override. By default, NPD is running as DaemonSet and use
InClusterConfigto access apiserver. However, this does not work when service account is not available. (Issue: Allow reading a kubeconfig file for master / auth info #27, does this tool work outside Google? #21). We should make the apiserver client option configurable, so that user can customize it based on their cluster setup. This is prerequisite of Standalone mode (PR: add support for running standalone #49, @andyxning) - [P0] Standalone mode. Make it possible to run NPD standalone, possibly as a systemd service. DaemonSet is easy to deploy and manage. However, docker still stops all containers when it's dead (
live-restoreis still in validation). Because of this, NPD may not be able to detect problems when docker is unresponsive. (Issue: Standalone NPD Support #76) - [P1] Integrate NPD with K8s e2e framework. NPD is already running in e2e cluster, but the information it collects is not well-surfaced from the test framework. We should make it visible by failing the test or collecting via a dashboard (Issue: Integrate node-problem-detector with e2e test infrastructure kubernetes#30811).
- [P1] Scalability and performance. Performance benchmark and optimization #85
- Some known performance issue needs to be fixed in NPD, such as reduce apiserver access (Node problem detector should use apiserver cache. #37), and improve log parsing efficiency. (Update NPD to only do forcibly sync every 1 minutes. #79, Only change transition timestamp when condition is changed. #84)
- More benchmark to verify the performance of NPD. Both benchmark for NPD resource usage and apiserver load introduced by NPD (Feature request for a "hollow"-node-problem-detector having an empty list of conditions and rules inside kernel monitor config #50, @shyamjvs, Performance benchmark and optimization #85).
- [P2] Formalize the Project. Formalize the process of the project, including:
- Add CHANGELOG.md for NPD repo. #66 Add change log. (We need a ChangeLog for NPD. #45) [P2]
- Define release process. (Release instruction is needed. #67) [P2]
- Add pre/post submit e2e test. (Add CI e2e test for NPD. #43) [P3]
- [P2] Docker problem detection. Although kernel monitor could be extended to monitor other logs, it still needs some code change to achieve that. We should cleanup the code to make it easier to monitor other logs and add clear documentation for it. (Generalize kernel monitor. #44) (Add arbitray system log support #88) (PR: Add arbitray system log support #88, Generalize the kernel monitor code. #92, Add multiple system log monitor support #94)
- [P3] 3rd party problem daemon integration. Kernel monitor is designed to detect known kernel problems with minimum overhead, it is not expected to be a comprehensive solution. NPD should be extensible to integrate with more small problem daemons or more mature solution. (How to hook up third-party daemons? #35)
Note that only P0s are release blocker.
@dchen1107 @fabioy @ajitak
/cc @kubernetes/sig-node-misc
sfroment
Metadata
Metadata
Assignees
Labels
No labels