Skip to content

ingress-nginx stops processing backend changes once it gets SIGTERM #13215

@grounded042

Description

@grounded042

What happened:

I've discovered that from when ingress-nginx gets a SIGTERM to when it stops proxying traffic it does not update any backends (upstream IPs) even if they change in kubernetes.

This causes errors such as

[error] 28#28: *144 upstream timed out (110: Operation timed out) while connecting to upstream 

because ingress-nginx can, in some cases, route to a pod IP which no longer is in use.

This is an issue if a connection to ingress-nginx is kept open to send multiple HTTP requests. The connection stays connected to the ingress-nginx pod even after termination starts. If an upstream pod is terminated before the ingress-nginx pod no longer processes requests the upstream IP it has will be incorrect as it hasn't updated upstream (backend) IPs since it started terminating.

This is more noticeable if --shutdown-grace-period is used.

What you expected to happen:

ingress-nginx should update backends until it stops proxying traffic.

I've discovered a fix to the problem and will have a PR. Here are details:

The root cause of the issue is a bug in how ingress-nginx handles graceful shutdown. When an ingress-nginx pod is being terminated there are several things that happen:

  1. The controller gets the SIGTERM and calls Stop() on the NGINXController struct. src
  2. Inside of Stop() the isShuttingDown bool is set to true. src
  3. Inside of Stop() the controller sleeps for whatever was passed to --shutdown-grace-period. 350 seconds in my case. src
  4. Next Stop() sends a SIGQUIT to the NGINX process and waits for it to stop before returning. src
  5. ingress-nginx waits for one last time via --post-shutdown-grace-period and then exits.

At a cursory glance that does not present the problem, but there’s one important note: as soon as isShuttingDown in step 2 is set to true the goroutine which is processing Kubernetes events to update backends stops processing events. src

NGINX Ingress controller version (exec into the pod and run /nginx-ingress-controller --version):
I'm testing in kind using the latest commit from main. The commit is 8f2593b

------------------------------------------------------------------------------- NGINX Ingress controller Release: 1.0.0-dev Build: git-8f2593bb8 Repository: https://github.com/kubernetes/ingress-nginx.git nginx version: nginx/1.27.1 ------------------------------------------------------------------------------- 

Kubernetes version (use kubectl version):

Client Version: v1.32.2 Kustomize Version: v5.5.0 Server Version: v1.32.3 

Environment: kind via make dev-env

  • Cloud provider or hardware configuration: kind via make dev-env
  • OS (e.g. from /etc/os-release): kind via make dev-env
  • Kernel (e.g. uname -a): kind via make dev-env
  • Install tools: kind via make dev-env
    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:
    • kubectl version see "Kubernetes version" section
    • kubectl get nodes -o wide:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ingress-nginx-dev-control-plane Ready control-plane 7h25m v1.32.3 172.18.0.2 <none> Debian GNU/Linux 12 (bookworm) 6.10.14-linuxkit containerd://2.0.3 
  • How was the ingress-nginx-controller installed: make dev-env which runs helm template. I used the following setup:
controller: extraArgs: shutdown-grace-period: 350 image: repository: ${REGISTRY}/controller tag: ${TAG} digest: config: worker-processes: "1" podLabels: deploy-date: "$(date +%s)" updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 hostPort: enabled: true terminationGracePeriodSeconds: 360 service: type: NodePort 
  • Current State of the controller:
    • kubectl describe ingressclasses
Name: nginx Labels: app.kubernetes.io/component=controller app.kubernetes.io/instance=ingress-nginx app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ingress-nginx app.kubernetes.io/part-of=ingress-nginx app.kubernetes.io/version=1.12.1 helm.sh/chart=ingress-nginx-4.12.1 Annotations: <none> Controller: k8s.io/ingress-nginx Events: <none> 
  • kubectl -n <ingresscontrollernamespace> get all -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES httpd pod/httpd-777868ddb6-m4gfb 1/1 Running 0 42m 10.244.0.12 ingress-nginx-dev-control-plane <none> <none> ingress-nginx pod/ingress-nginx-admission-create-xrhg8 0/1 Completed 0 7h2m 10.244.0.4 ingress-nginx-dev-control-plane <none> <none> ingress-nginx pod/ingress-nginx-admission-patch-d9ktn 0/1 Completed 1 7h2m 10.244.0.6 ingress-nginx-dev-control-plane <none> <none> ingress-nginx pod/ingress-nginx-controller-cd664468-x5mk9 1/1 Running 0 37m 10.244.0.13 ingress-nginx-dev-control-plane <none> <none> kube-system pod/coredns-668d6bf9bc-pxkbw 1/1 Running 0 7h2m 10.244.0.5 ingress-nginx-dev-control-plane <none> <none> kube-system pod/coredns-668d6bf9bc-tjc6c 1/1 Running 0 7h2m 10.244.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/etcd-ingress-nginx-dev-control-plane 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/kindnet-86z7x 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/kube-apiserver-ingress-nginx-dev-control-plane 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/kube-controller-manager-ingress-nginx-dev-control-plane 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/kube-proxy-97sjv 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> kube-system pod/kube-scheduler-ingress-nginx-dev-control-plane 1/1 Running 0 7h2m 172.18.0.2 ingress-nginx-dev-control-plane <none> <none> local-path-storage pod/local-path-provisioner-7dc846544d-z7nb8 1/1 Running 0 7h2m 10.244.0.3 ingress-nginx-dev-control-plane <none> <none> NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7h2m <none> httpd service/httpd ClusterIP 10.96.198.184 <none> 80/TCP 6h54m app=httpd ingress-nginx service/ingress-nginx-controller NodePort 10.96.120.145 <none> 80:31510/TCP,443:30212/TCP 7h2m app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx ingress-nginx service/ingress-nginx-controller-admission ClusterIP 10.96.240.122 <none> 443/TCP 7h2m app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 7h2m k8s-app=kube-dns NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR kube-system daemonset.apps/kindnet 1 1 1 1 1 kubernetes.io/os=linux 7h2m kindnet-cni docker.io/kindest/kindnetd:v20250214-acbabc1a app=kindnet kube-system daemonset.apps/kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 7h2m kube-proxy registry.k8s.io/kube-proxy:v1.32.3 k8s-app=kube-proxy NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR httpd deployment.apps/httpd 1/1 1 1 6h54m httpd httpd:alpine app=httpd ingress-nginx deployment.apps/ingress-nginx-controller 1/1 1 1 7h2m controller us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx kube-system deployment.apps/coredns 2/2 2 2 7h2m coredns registry.k8s.io/coredns/coredns:v1.11.3 k8s-app=kube-dns local-path-storage deployment.apps/local-path-provisioner 1/1 1 1 7h2m local-path-provisioner docker.io/kindest/local-path-provisioner:v20250214-acbabc1a app=local-path-provisioner NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR httpd replicaset.apps/httpd-777868ddb6 1 1 1 42m httpd httpd:alpine app=httpd,pod-template-hash=777868ddb6 httpd replicaset.apps/httpd-798d447958 0 0 0 6h54m httpd httpd:alpine app=httpd,pod-template-hash=798d447958 ingress-nginx replicaset.apps/ingress-nginx-controller-69df88cb89 0 0 0 7h2m controller us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=69df88cb89 ingress-nginx replicaset.apps/ingress-nginx-controller-778bbb8bf5 0 0 0 46m controller us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=778bbb8bf5 ingress-nginx replicaset.apps/ingress-nginx-controller-867f4dc7b8 0 0 0 6h50m controller us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=867f4dc7b8 ingress-nginx replicaset.apps/ingress-nginx-controller-cd664468 1 1 1 44m controller us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=cd664468 kube-system replicaset.apps/coredns-668d6bf9bc 2 2 2 7h2m coredns registry.k8s.io/coredns/coredns:v1.11.3 k8s-app=kube-dns,pod-template-hash=668d6bf9bc local-path-storage replicaset.apps/local-path-provisioner-7dc846544d 1 1 1 7h2m local-path-provisioner docker.io/kindest/local-path-provisioner:v20250214-acbabc1a app=local-path-provisioner,pod-template-hash=7dc846544d NAMESPACE NAME STATUS COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR ingress-nginx job.batch/ingress-nginx-admission-create Complete 1/1 23s 7h2m create registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.5.2@sha256:e8825994b7a2c7497375a9b945f386506ca6a3eda80b89b74ef2db743f66a5ea batch.kubernetes.io/controller-uid=ef0717de-b89d-4a52-ad10-24d10bedb23e ingress-nginx job.batch/ingress-nginx-admission-patch Complete 1/1 24s 7h2m patch registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.5.2@sha256:e8825994b7a2c7497375a9b945f386506ca6a3eda80b89b74ef2db743f66a5ea batch.kubernetes.io/controller-uid=e57c790d-19a9-43eb-a9c0-3f603e893660 
  • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
Name: ingress-nginx-controller-cd664468-x5mk9 Namespace: ingress-nginx Priority: 0 Service Account: ingress-nginx Node: ingress-nginx-dev-control-plane/172.18.0.2 Start Time: Mon, 14 Apr 2025 14:44:25 -0600 Labels: app.kubernetes.io/component=controller app.kubernetes.io/instance=ingress-nginx app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ingress-nginx app.kubernetes.io/part-of=ingress-nginx app.kubernetes.io/version=1.12.1 deploy-date=1744663026 helm.sh/chart=ingress-nginx-4.12.1 pod-template-hash=cd664468 Annotations: kubectl.kubernetes.io/restartedAt: 2025-04-14T14:35:13-06:00 Status: Running IP: 10.244.0.13 IPs: IP: 10.244.0.13 Controlled By: ReplicaSet/ingress-nginx-controller-cd664468 Containers: controller: Container ID: containerd://6d5572d8452e0054fddc759096c727017a937974423e42798e19f502b5be38fc Image: us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev Image ID: sha256:9c6eeb4f00fee2013f0aea8cfcefae86094763728f1a2d0b2555fc319aea2183 Ports: 80/TCP, 443/TCP, 8443/TCP Host Ports: 80/TCP, 443/TCP, 0/TCP SeccompProfile: RuntimeDefault Args: /nginx-ingress-controller --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=$(POD_NAMESPACE)/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/usr/local/certificates/key --shutdown-grace-period=350 State: Running Started: Mon, 14 Apr 2025 14:44:26 -0600 Ready: True Restart Count: 0 Requests: cpu: 100m memory: 90Mi Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5 Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAME: ingress-nginx-controller-cd664468-x5mk9 (v1:metadata.name) POD_NAMESPACE: ingress-nginx (v1:metadata.namespace) LD_PRELOAD: /usr/local/lib/libmimalloc.so Mounts: /usr/local/certificates/ from webhook-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k2ltk (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready True ContainersReady True PodScheduled True Volumes: webhook-cert: Type: Secret (a volume populated by a Secret) SecretName: ingress-nginx-admission Optional: false kube-api-access-k2ltk: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 39m default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. Normal Scheduled 38m default-scheduler Successfully assigned ingress-nginx/ingress-nginx-controller-cd664468-x5mk9 to ingress-nginx-dev-control-plane Normal Pulled 38m kubelet Container image "us-central1-docker.pkg.dev/k8s-staging-images/ingress-nginx/controller:1.0.0-dev" already present on machine Normal Created 38m kubelet Created container: controller Normal Started 38m kubelet Started container controller Normal RELOAD 38m nginx-ingress-controller NGINX reload triggered due to a change in configuration 
  • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Name: ingress-nginx-controller Namespace: ingress-nginx Labels: app.kubernetes.io/component=controller app.kubernetes.io/instance=ingress-nginx app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ingress-nginx app.kubernetes.io/part-of=ingress-nginx app.kubernetes.io/version=1.12.1 helm.sh/chart=ingress-nginx-4.12.1 Annotations: <none> Selector: app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx Type: NodePort IP Family Policy: SingleStack IP Families: IPv4 IP: 10.96.120.145 IPs: 10.96.120.145 Port: http 80/TCP TargetPort: http/TCP NodePort: http 31510/TCP Endpoints: 10.244.0.13:80 Port: https 443/TCP TargetPort: https/TCP NodePort: https 30212/TCP Endpoints: 10.244.0.13:443 Session Affinity: None External Traffic Policy: Cluster Internal Traffic Policy: Cluster Events: <none> 
  • Current state of ingress object, if applicable:
    • kubectl -n <appnamespace> get all,ing -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/httpd-777868ddb6-m4gfb 1/1 Running 0 44m 10.244.0.12 ingress-nginx-dev-control-plane <none> <none> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/httpd ClusterIP 10.96.198.184 <none> 80/TCP 6h57m app=httpd NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/httpd 1/1 1 1 6h57m httpd httpd:alpine app=httpd NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/httpd-777868ddb6 1 1 1 44m httpd httpd:alpine app=httpd,pod-template-hash=777868ddb6 replicaset.apps/httpd-798d447958 0 0 0 6h57m httpd httpd:alpine app=httpd,pod-template-hash=798d447958 NAME CLASS HOSTS ADDRESS PORTS AGE ingress.networking.k8s.io/wildcard nginx *.wildcard.example.com 10.96.120.145 80 6h57m 
  • kubectl -n <appnamespace> describe ing <ingressname>
Name: wildcard Labels: <none> Namespace: httpd Address: 10.96.120.145 Ingress Class: nginx Default backend: <default> Rules: Host Path Backends ---- ---- -------- *.wildcard.example.com / httpd:80 (10.244.0.12:80) Annotations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Sync 49m nginx-ingress-controller Scheduled for sync Normal Sync 46m (x2 over 47m) nginx-ingress-controller Scheduled for sync Normal Sync 40m nginx-ingress-controller Scheduled for sync 
  • If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag

I used hurl to simulate the re-use of a connection to send multiple requests. This is the config file:

GET http://one.wildcard.example.com HTTP 200 GET http://one.wildcard.example.com [Options] delay: 10s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 20s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 30s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 40s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 60s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 90s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 120s HTTP 200 
  • Others:
    • Any other related information like ;
      • copy/paste of the snippet (if applicable)
      • kubectl describe ... of any custom configmap(s) created and in use
      • Any other related information that may help

How to reproduce this issue:

Setup

  1. make dev-env. Make sure you've set controller.extraArgs.shutdown-grace-period is set to 350 and controller.terminationGracePeriodSeconds is 360. Any time period will work, you just need it to keep the pod around long enough to show the issue.
  2. kubectl create namespace httpd
  3. kubectl create deployment httpd -n httpd --image=httpd:alpine
  4. kubectl expose deployment -n httpd httpd --port 80
  5. kubectl -n httpd create ingress wildcard --class nginx --rule "*.wildcard.example.com/*"=httpd:80

You now have a working cluster. Hit it with curl --connect-to ::127.0.0.1: "http://one.wildcard.example.com" to see that it works.

Setup a file graceful_shutdown.hurl with the following contents:

GET http://one.wildcard.example.com HTTP 200 GET http://one.wildcard.example.com [Options] delay: 10s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 20s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 30s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 40s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 60s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 90s HTTP 200 GET http://one.wildcard.example.com [Options] delay: 120s HTTP 200 

Reproduce the Issue

  1. In one terminal run hurl --verbose --connect-to ::127.0.0.1: graceful_shutdown.hurl to start the connection.
  2. In a new terminal follow the logs of the current ingress-nginx pod: kubectl logs ingress-nginx-controller-cd664468-x5mk9 -f
  3. In a new terminal, rollout restart ingress-nginx: kubectl rollout restart deploy/ingress-nginx-controller -n ingress-nginx
  4. Observe in the logs of the controller pod: Received SIGTERM, shutting down
  5. Wait for a request from hurl to go through and observe the log line in the terminating controller pod:
172.18.0.1 - - [14/Apr/2025:21:32:50 +0000] "GET / HTTP/1.1" 200 45 "-" "hurl/6.0.0" 87 0.002 [httpd-httpd-80] [] 10.244.0.12:80 45 0.002 200 33b4e103ac3ac7cf0955ba9a47f138bb 
  1. Get the IP of the current httpd pod: kubectl get pod -n httpd -o wide (10.244.0.12 in my case)
  2. Rollout restart httpd in order to get a new pod IP: kubectl rollout restart deploy/httpd -n httpd
  3. Get the IP of the new httpd pod: kubectl get pod -n httpd -o wide (10.244.0.15 in my case)
  4. Observe errors in controller pod:025/04/14 21:33:25 [error] 81#81: *9182 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.18.0.1, server: ~^(?<subdomain>[\w-]+)\.wildcard\.example\.com$, request: "GET / HTTP/1.1", upstream: "http://10.244.0.12:80/", host: "one.wildcard.example.com" 2025/04/14 21:33:30 [error] 81#81: *9182 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.18.0.1, server: ~^(?<subdomain>[\w-]+)\.wildcard\.example\.com$, request: "GET / HTTP/1.1", upstream: "http://10.244.0.12:80/", host: "one.wildcard.example.com" 2025/04/14 21:33:35 [error] 81#81: *9182 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.18.0.1, server: ~^(?<subdomain>[\w-]+)\.wildcard\.example\.com$, request: "GET / HTTP/1.1", upstream: "http://10.244.0.12:80/", host: "one.wildcard.example.com"
  5. Note that the IP is that of the old httpd pod 10.244.0.12 and not of the new one 10.244.0.15.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions