Skip to content

Commit 5b5f8d8

Browse files
Merge pull request openshift#43 from bpickard22/master
Add NorthboundStale and SouthboundStale runbooks
2 parents 65bf4f5 + d4dbb6b commit 5b5f8d8

File tree

2 files changed

+169
-0
lines changed

2 files changed

+169
-0
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Northbound Stale Alert
2+
3+
## Meaning
4+
5+
The alert `NorthboundStale` will be triggered if the ovn-kube master process
6+
is not functioning, if the northbound database is not functioning,
7+
or if connectivity between the ovnkube master and the database is broken.
8+
9+
This alert will be triggered if `NoRunningOvnMaster` or if `NoOvnMasterLeader`
10+
is firing, so check those alerts before proceeding here.
11+
12+
## Impact
13+
14+
Networking control plane is degraded and Networking Configuration updates
15+
applied to the cluster will not be applied.
16+
Existing workloads should continue to have connectivity, but the
17+
OVN-Kubernetes control plane and/or the OVN southbound database
18+
may not be functional.
19+
20+
## Diagnosis
21+
22+
Investigate the causes that can trigger this alert.
23+
24+
1. Is the ovnkube-master process running, i.e. is the container running.
25+
If it is not, check the logs and proceed from there.
26+
27+
You can check the container names on the ovn-kube master nodes with this command
28+
```shell
29+
$oc get pods -n=openshift-ovn-kubernetes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.name}{", "}{end}{end}' |sort | grep master
30+
```
31+
You should also run
32+
`$oc get pods -n=openshift-ovn-kubernetes | grep master`
33+
to ensure that all containers are ready.
34+
To see the pod logs you can run `$oc describe pod/<podname> -n <namespace>`
35+
To see the logs of a specific container within the pod you can run
36+
`$oc logs <podname> -n <namespace> -c <containerName>`
37+
38+
2. Is OVN Northbound Database functioning.
39+
Check to see if the northbound database containers are running without errors.
40+
If there are no errors, ensure there is an OVN Northbound Database Leader.
41+
If you have made it through the debug steps in step one, then you should
42+
already know that the containers are healthy on the master pod/s,
43+
and you should now ensure that there is a northbound database leader.
44+
To find the database leader, you can run this script.
45+
```shell
46+
`$LEADER="not found"; for MASTER_POD in $(oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-master -o=jsonpath='{.items[*].metadata.name}'); do RAFT_ROLE=$(oc exec -n openshift-ovn-kubernetes "${MASTER_POD}" -c nbdb -- bash -c "ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>&1 | grep \"^Role\""); if echo "${RAFT_ROLE}" | grep -q -i leader; then LEADER=$MASTER_POD; break; fi; done; echo "nbdb leader ${LEADER}"
47+
```
48+
If this does not work, exec into each nbdb container on the master pods with
49+
`oc exec -it <ovnkube-master podname> -c nbdb -- bash`
50+
and then run
51+
`ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound`
52+
You should see a role that will either say leader or follower.
53+
A common cause of database leader issues is that one of the database
54+
servers is unable to participate with the other raft peers due to mismatching
55+
cluster ids. Due to this, they will be unable to elect a database leader.
56+
Try restarting the ovn-kube master pods to resolve this issue.
57+
58+
59+
3. Lastly, check to make sure that the connectivity between ovnkube-master
60+
leader and OVN northbound database leader is healthy.
61+
62+
To determine what node the ovnkube-master leader is on, run
63+
```shell
64+
oc get cm -n openshift-ovn-kubernetes ovn-kubernetes-master -o json | jq '.metadata.annotations'
65+
```
66+
Then get the logs of the ovnkube-master container on the ovnkube-master pod
67+
on that node with
68+
`$oc logs <podname> -n <namespace> -c ovnkube-master`
69+
You should see a message along the lines of "msg"="trying to connect"
70+
"database"="OVN_Northbound" "endpoint"="tcp:172.18.0.4:6641".
71+
This message indicates that the master cannot connect to the database.
72+
A successful connection message will appear in the logs if the master
73+
has connected to the database.
74+
75+
76+
77+
## Mitigation
78+
79+
Mitigation will depend on what was found in the diagnosis section.
80+
As a general fix, you can try restarting the ovn-k master pods.
81+
If the issue persists, reach out to the SDN team on #forum-sdn.
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Southbound Stale Alert
2+
3+
## Meaning
4+
5+
The alert `SouthboundStale` is triggered when OVN southbound DB has not been
6+
written to for 2 minutes or longer by OVN northd.
7+
Therfore, any networking control plane changes are not being updated
8+
in the Southbound DB for cluster resources to consume.
9+
Make sure that the `NorthBoundStale`, `NoOvnMasterLeader`,`NoRunningOvnMaster`
10+
alerts are not firing.
11+
If they are, triage them before continuing here.
12+
13+
## Impact
14+
15+
Networking control plane is degraded.
16+
Networking Configuration updates applied to the cluster will not be applied.
17+
Existing workloads should continue to have connectivity.
18+
19+
## Diagnosis
20+
21+
There are a few scenarios that can cause this alert to trigger.
22+
23+
1. OVN SDBD is not functional.
24+
Check to see if the SBDB is running without errors
25+
and has a RAFT cluster leader.
26+
Check the containers names on the ovn-kube master pods with this command
27+
```shell
28+
$oc get pods -n=openshift-ovn-kubernetes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.name}{", "}{end}{end}' |sort | grep master
29+
```
30+
You should also run `$oc get pods -n=openshift-ovn-kubernetes | grep master`
31+
to ensure that all containers on your master pods are running.
32+
To see the pod events you can run
33+
`$oc describe pod/<podname> -n <namespace>`
34+
To see the logs of a specific container within the pod you can run
35+
`$oc logs <podname> -n <namespace> -c <containerName>`
36+
Find the RAFT leader.
37+
You can run this script to do this.
38+
```shell
39+
$LEADER="not found"; for MASTER_POD in $(oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-master -o=jsonpath='{.items[*].metadata.name}'); do RAFT_ROLE=$(oc exec -n openshift-ovn-kubernetes "${MASTER_POD}" -c sbdb -- bash -c "ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 2>&1 | grep \"^Role\""); if echo "${RAFT_ROLE}" | grep -q -i leader; then LEADER=$MASTER_POD; break; fi; done; echo "sbdb leader ${LEADER}"
40+
```
41+
If this does not work, exec into each sbdb container on the master pods with
42+
`oc exec -it <ovnkube-master podname> -c sbdb -- bash`
43+
and then run
44+
`ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound`
45+
You should see a role that will either say leader or follower.
46+
A common cause of database leader issues is that one of the database
47+
servers is unable to participate with the other raft peers due to mismatching
48+
cluster ids. Due to this, they will be unable to elect a database leader.
49+
Try restarting the ovn-kube master pods to resolve this issue
50+
51+
2. OVN northd is not functional.
52+
If northd is down, then the logical flows in the northbound database will not
53+
be translated to the logical datapath flows in the southbound database.
54+
If you have made it through the first step,
55+
you will already know if the container is running or not,
56+
so you should check the logs to see if there are any errors.
57+
58+
59+
3. OVN nortbound database is not functioning.
60+
Check to see if the `NorthboundStaleAlert` is firing.
61+
If the nbdb container on ovn-kubernetes master is not running,
62+
check the container logs and proceed from there.
63+
64+
4. OVN northd cannot connect to one or both of the NBDB/SBDB leader
65+
or OVN northd is overloaded.
66+
You can check in the logs of the northd container on the ovn-kube master pod
67+
with the active instance of ovn-northd.
68+
To determine this you must exec into the ovnkube-master container
69+
on each ovnkube-master pod, then run this command
70+
`$curl 127.0.0.1:29105/metrics | grep ovn_northd_status`
71+
This will return the status of northd on that ovnkube-master pod
72+
Once you find the active instance of ovn-northd,
73+
you can check the logs of the northd container on that ovnkube-master pod.
74+
If northd is overloaded, there will be logs in the container along the lines of
75+
`dropped x number of log messages due to excessive rate`
76+
or a message that contains `(99% CPU usage)`
77+
or some other high percentage CPU usage.
78+
You can also check the cpu usage of the container by logging into your
79+
openshift cluster console.
80+
In the Observe section on the sidebar, click metrics, then run this query
81+
`container_cpu_usage_seconds_total{pod="$PODNAME", container="$CONTAINERNAME"}`
82+
83+
84+
## Mitigation
85+
86+
Mitigation will depend on what was found in the diagnosis section.
87+
As a general fix, you can try restarting the ovn-k master pods.
88+
If the issue persists, reach out to the SDN team on #forum-sdn.

0 commit comments

Comments
 (0)