Skip to content

kube-dns never resolves if a domain returns NOERROR with 0 answer records once #121

@ahmetb

Description

@ahmetb

tl;dr If a nameserver replies status=NOERROR with no answer section to a DNS A question, kube-dns always caches this result. If the domain name actually gets an A record after it's queried through kube-dns, it never (I waited a few days) resolves from the pods, but does resolve outside the container (e.g. on my laptop) just fine.

Repro steps

Prerequisites

  • Have a domain name alp.im and the nameservers are pointed to CloudFlare.
  • Have nslookup/dig installed on your workstation.
  • Have a minikube cluster ready on your workstation
    • running kubernetes v1.6.0
    • kube-dns comes by default, running gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1

Step 1: Domain does not exist, query from your laptop

Note ANSWER: 0, and status: NOERROR

$ dig A z.alp.im ; <<>> DiG 9.8.3-P1 <<>> A z.alp.im ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64978 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;z.alp.im.	IN	A ;; AUTHORITY SECTION: alp.im.	1799	IN	SOA	ivan.ns.cloudflare.com. dns.cloudflare.com. 2025042470 10000 2400 604800 3600 ;; Query time: 196 msec ;; SERVER: 2401:fa00:fa::1#53(2401:fa00:fa::1) ;; WHEN: Thu Jun 29 10:51:35 2017 ;; MSG SIZE rcvd: 99 

Step 2: Domain does not exist, query from Pod on Kubernetes

Start a toolbelt/dig container with shell and run the same query:

⚠️ Do not exit this container as you will reuse it later.

Note the response is the same, ANSWER: 0 and NOERROR.

$ kubectl run -i -t --rm --image=toolbelt/dig dig --command -- sh If you don't see a command prompt, try pressing enter. / # dig A z.alp.im ; <<>> DiG 9.11.1-P1 <<>> A z.alp.im ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11209 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;z.alp.im.	IN	A ;; AUTHORITY SECTION: alp.im.	1724	IN	SOA	ivan.ns.cloudflare.com. dns.cloudflare.com. 2025042470 10000 2400 604800 3600 ;; Query time: 74 msec ;; SERVER: 10.0.0.10#53(10.0.0.10) ;; WHEN: Thu Jun 29 17:55:46 UTC 2017 ;; MSG SIZE rcvd: 99 

(Also note that SERVER: 10.0.0.10#53 which is kube-dns.)

Step 3: Create an A record for the domain

Here I use CloudFlare as it manages my DNS.

image

Step 4: Test DNS record from your laptop

Run dig on your laptop (note ;; ANSWER SECTION: and 8.8.8.8 answer):

$ dig A z.alp.im ; <<>> DiG 9.8.3-P1 <<>> A z.alp.im ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37570 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;z.alp.im.	IN	A ;; ANSWER SECTION: z.alp.im.	299	IN	A	8.8.8.8 ;; Query time: 196 msec ;; SERVER: 2401:fa00:fa::1#53(2401:fa00:fa::1) ;; WHEN: Thu Jun 29 10:54:44 2017 ;; MSG SIZE rcvd: 53 

Step 5: Test DNS record from Pod on Kubernetes

Run the same command again:

/ # dig A z.alp.im ; <<>> DiG 9.11.1-P1 <<>> A z.alp.im ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45420 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;z.alp.im.	IN	A ;; Query time: 0 msec ;; SERVER: 10.0.0.10#53(10.0.0.10) ;; WHEN: Thu Jun 29 18:00:24 UTC 2017 ;; MSG SIZE rcvd: 37 

Note the diff:

  • still ANSWER: 0 and status: NOERROR (but it resolves just fine outside the cluster)
  • ;; AUTHORITY SECTION: disappeared and AUTHORITY: changed to 0 from the previous time we ran this.
  • ;; Query time: 0 msec (was 79 ms) –I assume it means it's just a cached response.
    • Query time stays as 0 ms no matter how many times I run the same command.

What else I tried

  • Try it on GKE: I tried with k8s v1.5.x and v1.6.4. → Same issue. (cc: @bowei)

  • Query from a different pod on minikube: I started a new Pod and queried from there → Same issue.

  • Restart kube-dns Pod → This worked on GKE, but not on minikube.

    $ kubectl delete pods -n kube-system -l k8s-app=kube-dns pod "kube-dns-268032401-69xk5" deleted 

Impact

I am not sure why this has not been discovered before. I noticed this behavior while using kube-lego on GKE. Once kube-lego applies for a TLS certificate, it polls the domain name of the service (e.g. example.com/.well-known/<token>) before asking Let's Encrypt to validate it. Before I create an Ingress with kube-lego annotation, I don't have the external IP yet so I can't configure the domain, but the kube-lego Pod already picks it up and starts querying my domain in an infinite loop. It never succeeds because first time it looked up the hostname, the A record didn't exist, so that result is cached forever. After I add A record, it still can't resolve. The moment I delete kube-dns Pods and they get recreated, it immediately starts working and resolves the hostname and completes the kube-lego challenge.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions