|
| 1 | +# Southbound Stale Alert |
| 2 | + |
| 3 | +## Meaning |
| 4 | + |
| 5 | +The alert `SouthboundStale` is triggered when OVN southbound DB has not been |
| 6 | +written to for 2 minutes or longer by OVN northd. |
| 7 | +Therfore, any networking control plane changes are not being updated |
| 8 | +in the Southbound DB for cluster resources to consume. |
| 9 | +Make sure that the `NorthBoundStale`, `NoOvnMasterLeader`,`NoRunningOvnMaster` |
| 10 | +alerts are not firing. |
| 11 | +If they are, triage them before continuing here. |
| 12 | + |
| 13 | +## Impact |
| 14 | + |
| 15 | +Networking control plane is degraded. |
| 16 | +Networking Configuration updates applied to the cluster will not be applied. |
| 17 | +Existing workloads should continue to have connectivity. |
| 18 | + |
| 19 | +## Diagnosis |
| 20 | + |
| 21 | +There are a few scenarios that can cause this alert to trigger. |
| 22 | + |
| 23 | +1. OVN SDBD is not functional. |
| 24 | +Check to see if the SBDB is running without errors |
| 25 | +and has a RAFT cluster leader. |
| 26 | +Check the containers names on the ovn-kube master pods with this command |
| 27 | +```shell |
| 28 | +$oc get pods -n=openshift-ovn-kubernetes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.name}{", "}{end}{end}' |sort | grep master |
| 29 | +``` |
| 30 | +You should also run `$oc get pods -n=openshift-ovn-kubernetes | grep master` |
| 31 | +to ensure that all containers on your master pods are running. |
| 32 | +To see the pod events you can run |
| 33 | +`$oc describe pod/<podname> -n <namespace>` |
| 34 | +To see the logs of a specific container within the pod you can run |
| 35 | +`$oc logs <podname> -n <namespace> -c <containerName>` |
| 36 | +Find the RAFT leader. |
| 37 | +You can run this script to do this. |
| 38 | +```shell |
| 39 | +$LEADER="not found"; for MASTER_POD in $(oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-master -o=jsonpath='{.items[*].metadata.name}'); do RAFT_ROLE=$(oc exec -n openshift-ovn-kubernetes "${MASTER_POD}" -c sbdb -- bash -c "ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 2>&1 | grep \"^Role\""); if echo "${RAFT_ROLE}" | grep -q -i leader; then LEADER=$MASTER_POD; break; fi; done; echo "sbdb leader ${LEADER}" |
| 40 | +``` |
| 41 | +If this does not work, exec into each sbdb container on the master pods with |
| 42 | +`oc exec -it <ovnkube-master podname> -c sbdb -- bash` |
| 43 | +and then run |
| 44 | +`ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound` |
| 45 | +You should see a role that will either say leader or follower. |
| 46 | +A common cause of database leader issues is that one of the database |
| 47 | +servers is unable to participate with the other raft peers due to mismatching |
| 48 | +cluster ids. Due to this, they will be unable to elect a database leader. |
| 49 | +Try restarting the ovn-kube master pods to resolve this issue |
| 50 | + |
| 51 | +2. OVN northd is not functional. |
| 52 | +If northd is down, then the logical flows in the northbound database will not |
| 53 | +be translated to the logical datapath flows in the southbound database. |
| 54 | +If you have made it through the first step, |
| 55 | +you will already know if the container is running or not, |
| 56 | +so you should check the logs to see if there are any errors. |
| 57 | + |
| 58 | + |
| 59 | +3. OVN nortbound database is not functioning. |
| 60 | +Check to see if the `NorthboundStaleAlert` is firing. |
| 61 | +If the nbdb container on ovn-kubernetes master is not running, |
| 62 | +check the container logs and proceed from there. |
| 63 | + |
| 64 | +4. OVN northd cannot connect to one or both of the NBDB/SBDB leader |
| 65 | +or OVN northd is overloaded. |
| 66 | +You can check in the logs of the northd container on the ovn-kube master pod |
| 67 | +with the active instance of ovn-northd. |
| 68 | +To determine this you must exec into the ovnkube-master container |
| 69 | +on each ovnkube-master pod, then run this command |
| 70 | +`$curl 127.0.0.1:29105/metrics | grep ovn_northd_status` |
| 71 | +This will return the status of northd on that ovnkube-master pod |
| 72 | +Once you find the active instance of ovn-northd, |
| 73 | +you can check the logs of the northd container on that ovnkube-master pod. |
| 74 | +If northd is overloaded, there will be logs in the container along the lines of |
| 75 | +`dropped x number of log messages due to excessive rate` |
| 76 | +or a message that contains `(99% CPU usage)` |
| 77 | +or some other high percentage CPU usage. |
| 78 | +You can also check the cpu usage of the container by logging into your |
| 79 | +openshift cluster console. |
| 80 | +In the Observe section on the sidebar, click metrics, then run this query |
| 81 | +`container_cpu_usage_seconds_total{pod="$PODNAME", container="$CONTAINERNAME"}` |
| 82 | + |
| 83 | + |
| 84 | +## Mitigation |
| 85 | + |
| 86 | +Mitigation will depend on what was found in the diagnosis section. |
| 87 | +As a general fix, you can try restarting the ovn-k master pods. |
| 88 | +If the issue persists, reach out to the SDN team on #forum-sdn. |
0 commit comments