Skip to content

Conversation

@eplightning
Copy link
Contributor

@eplightning eplightning commented Sep 18, 2023

Pull Request Motivation

This PR fixes a race condition in Azure DNS that happens when issuing certificates concurrently in different clusters as described in #6229.

Similar issues were already fixed for AWS/GCP/Cloudflare in other PRs:

To make this work in Azure DNS, this PR makes use of optimistic concurrency control supported by Azure DNS API via standard If-Match, If-None-Match, Etag headers: https://learn.microsoft.com/en-us/rest/api/dns/record-sets/create-or-update?tabs=HTTP

Current version simply unconditionally overwrites any changes. New version introduced in this PR does the following for all updates:

  1. Try to fetch an already existing matching record. If found, Azure DNS will return the record set and its version (Etag).

  2. Modify record set as necessary: Append if doesn't exist on Present, remove if exists on CleanUp

  3. Call appropriate Azure API:

  • If after modifications no records remain, call Delete API. Etag will ensure we will only delete record set if no modifications happened concurrently. Call is skipped if step 1 didn't find a matching record set.

  • If there are records and a matching record set was not found: call CreateOrUpdate API with If-None-Match: *. This will ensure we only create and not update (otherwise we would overwrite other changes).

  • If there are records and we found a matching record set: call CreateOrUpdate API with If-Match: $Etag. Call will only succeed if no other updates happened concurrently.

  1. If any precondition fail, Azure DNS will return an error which will be eventually retried by cert-manager.

Kind

/kind bug

Release Note

Fix ACME issuer being stuck waiting for DNS propagation when using Azure DNS with multiple instances issuing for the same FQDN 
@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 22, 2023
@irbekrm
Copy link
Contributor

irbekrm commented Sep 26, 2023

/ok-to-test

@eplightning
Copy link
Contributor Author

/retest

1 similar comment
@eplightning
Copy link
Contributor Author

/retest

@jetstack-bot jetstack-bot added the dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. label Nov 13, 2023
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jakexks for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code labels Nov 13, 2023
@eplightning
Copy link
Contributor Author

Alright rebasing seems to have fixed the bot and tests not passing

@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2024
@jetstack-bot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2024
@eplightning
Copy link
Contributor Author

/remove-lifecycle stale

I'll rebase this eventually ...

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2024
@oliverplann
Copy link

@eplightning @irbekrm @maelvls Sorry for tagging you, but is there any update on this whatsoever? The PR was created quite a long time ago and similar fixes have been merged across other DNS providers.

There has been quite a lot of mentions on this e.g.:

I can confirm this also happens on a single cluster attempting to issue either multiple certificates or multiple dnsNames using the same less-priviliged domain in Azure.

I'm not a GO expert, but I have Azure infrastructure available to help you test, if you instruct me how. Currently I'm installing cert-manager through the official helm chart, but I imagine I'll have to do something else in this case.

Furthermore, is there any known workaround as long as this issue persists?

// Oliver

Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@eplightning
Copy link
Contributor Author

eplightning commented May 10, 2024

@oliverplann I rebased the PR, so whatever happens next is up to this project's maintainers.

I'm not aware of any workarounds, whenever this issue happens either a manual DNS update is required or the certificate issuance needs to be somehow attempted again - probably by removing CertificateRequest? (I don't really remember at this point honestly).

@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels May 10, 2024
Signed-off-by: Bartosz Slawianowski <bartosz.slawianowski@natzka.com>
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels May 10, 2024
@eplightning
Copy link
Contributor Author

/retest

@oliverplann
Copy link

@oliverplann I rebased the PR, so whatever happens next is up to this project's maintainers.

I'm not aware of any workarounds, whenever this issue happens either a manual DNS update is required or the certificate issuance needs to be somehow attempted again - probably by removing CertificateRequest? (I don't really remember at this point honestly).

Thank you for acting quick. I hope the maintainers will prioritize this...

@oliverplann
Copy link

oliverplann commented May 13, 2024

TLDR
I'll test this PR and have also found a suitable workaround:
Using Helm: Set the value maxConcurrentChallenges: 1 (Default: 60)
Using kubectl apply: Edit deployment.yaml and set --max-concurrent-challenges=1

Original
I've reached out to the maintainers on the Kubernetes slack and they've instructed me how to test this PR, which I will get done today or tomorrow.

Similarly to a multi-cluster setup, if you are using a delegation domain (source) to issue multiple certificates or dnsNames e.g. _acme-challenge.less-privileged.example.org cert-manager will reuse the same TXT record for every challenge in the cluster. This currently results in an edge case where each challenge will randomly overwrite the value of the TXT record, before the previous challenge succeeded. This in turn puts most, if not all of the challenges in a pending state.

As long as this bug exists, it's facilitated through the default value of maxConcurrentChallenges.

I've found a suitable workaround until this PR has been adressed. Sharing it here in case anyone else needs a working cert-manager without manually interacting with cert-manager resources or Azure DNS:

Using Helm: Set the value maxConcurrentChallenges: 1 (Default: 60)
Using kubectl apply: Edit deployment.yaml and set --max-concurrent-challenges=1

Keep in mind that It will take significantly longer for cert-manager to validate challenges and issue certificates this way, depending on the amount of certs and dnsNames in your environment, but it works!
Also this shouldn't create issues renewing existing certs, as cert-manager will wait until every challenge for a given certificate is Valid before it changes the Ready state of the certificate.

Here's an example of what challenges will look like with maxConcurrentChallenges set to 1, handling 1 at a time:

example-com-certificate-1-2696367434-4084983421           example.com         4m55s example-org-certificate-1-2696367434-410157576              k8s.example.org   4m55s example-org-certificate-1-2696367434-413199771              tst.k8s.example.org 4m55s example-net-certificate-1-1612508506-2239642643   pending   k8s.example.net  4m56s example-net-certificate-1-1612508506-2283501741   valid     example.net       4m56s example-net-certificate-1-1612508506-4097195913   valid     example.net        4m56s 
@oliverplann
Copy link

Hello @eplightning

I've tested your PR using AKS & Azure DNS and it works perfectly! Thank you very much for your contribution.
I reached out to @SgtCoDFish on Slack and he'll look into merging this when he has time.

My findings below.

Challenges are now adding separate entries/values to the _acme-challenge TXT record in Azure DNS, as opposed to overwriting the value each time:

txt-record

kubectl get challenges shows that multiple challenges are being evaluated concurrently and their state are steadily changing to Valid:

NAME STATE DOMAIN example-com-certificate-1-196084347-1516893050 pending tst.k8s.example.com example-com-certificate-1-196084347-2394518141 valid k8s.example.com example-com-certificate-1-196084347-3341303458 example.com example-org-certificate-1-196084347-3621474682 pending tst.k8s.example.org example-org-certificate-1-196084347-4044510119 pending k8s.example.org example-org-certificate-1-196084347-4127470729 valid example.org example-net-certificate-1-2696367434-2908286604 pending tst.k8s.example.net example-net-certificate-1-2696367434-3121967868 valid k8s.example.net example-net-certificate-1-2696367434-3152871262 valid example.net 

Secrets & certificates are created:

NAME TYPE DATA AGE example-com-tls-secret kubernetes.io/tls 2 93m example-net-tls-secret kubernetes.io/tls 2 93m example-org-tls-secret kubernetes.io/tls 2 93m NAME READY SECRET AGE example-com-certificate True example-com-tls-secret 93m example-net-certificate True example-net-tls-secret 93m example-org-certificate True example-org-tls-secret 93m 

Please let me know if anything is missing before we can merge this PR :-)

// Oliver

@SgtCoDFish SgtCoDFish added this to the 1.15 milestone May 14, 2024
Copy link
Member

@SgtCoDFish SgtCoDFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

We're not really able to easily test Azure changes so big thanks to @oliverplann for testing this! And of course, thank you to everyone else involved for your patience and your efforts here. This should make it into 1.15 for sure.

@cert-manager-prow cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2024
@cert-manager-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SgtCoDFish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2024
@inteon inteon modified the milestones: 1.15, 1.16 May 14, 2024
@cert-manager-prow cert-manager-prow bot merged commit 7db560c into cert-manager:master May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

6 participants