Skip to content

Conversation

@tyler-lloyd
Copy link
Contributor

@tyler-lloyd tyler-lloyd commented Dec 13, 2024

Reason for Change:

CNS should regularly check if it is able to reach the apiserver to mitigate the risk of losing access
and then failing silently. The issue linked below is a scenario where CNS was not reloading the
refreshed SA token from disk resulting in controller-runtime's failure to update its cache and watch
pods and NNCs. This unauthorized failure went unnoticed until workloads were scaled up on the node,
at which point CNS was unable to update the NNC to request for more IPs so the pods were stuck in
Pending.

For now, this will just add a checker when the ChannelMode is CRD. Any other modes will get the
default Ping checker built-in to controller-runtime.

Issue Fixed:

Follow up for Azure/AKS#4679

Requirements:

Notes:

@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from b78005a to 20ed505 Compare December 13, 2024 18:57
@rbtr rbtr added enhancement cns Related to CNS. labels Dec 13, 2024
@tyler-lloyd tyler-lloyd marked this pull request as ready for review December 16, 2024 17:23
@tyler-lloyd tyler-lloyd requested a review from a team as a code owner December 16, 2024 17:23
@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 1dffb35 to e7602f7 Compare December 16, 2024 17:26
Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be behind a flag since it is otherwise a breaking change for non-k8s scenarios.

not every instance of CNS will need (or can) check NNCs. The `CRD` channel mode is used by AKS to indicate that CNS will be reading/watching NNCs. `AzureHost` is a newer mode that's used in nodesubnet where NNCs aren't used and therefore CNS has no reason to have its health depend on NNC access.
@tyler-lloyd tyler-lloyd force-pushed the tyler-lloyd/cns-health-checker branch from 5f5812c to bcc574f Compare December 18, 2024 14:42
@github-actions
Copy link

github-actions bot commented Jan 2, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jan 2, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Jan 2, 2025
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jan 17, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Jan 20, 2025
@github-actions
Copy link

github-actions bot commented Feb 4, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Feb 4, 2025
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Feb 11, 2025
@rbtr rbtr reopened this Feb 11, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Feb 11, 2025
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Feb 26, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Feb 26, 2025
Copy link
Collaborator

@rbtr rbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stalebot reminded me of this 😟 looks good today, i'm happy to take it

@rbtr
Copy link
Collaborator

rbtr commented Feb 26, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).
@rbtr rbtr enabled auto-merge February 26, 2025 00:18
@rbtr rbtr dismissed ramiro-gamarra’s stale review February 26, 2025 00:19

ramiro is out of office, but has already resolved their comments

@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Mar 13, 2025
@rbtr rbtr removed the stale Stale due to inactivity. label Mar 13, 2025
@rbtr rbtr added this pull request to the merge queue Mar 17, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 17, 2025
@rbtr rbtr added this pull request to the merge queue Mar 18, 2025
Merged via the queue into Azure:master with commit d1409b3 Mar 18, 2025
22 checks passed
sivakami-projects pushed a commit that referenced this pull request Oct 23, 2025
* feat: CNS checks apiserver in healthz * chore: only check NNCs if `ChannelMode` is `CRD` not every instance of CNS will need (or can) check NNCs. The `CRD` channel mode is used by AKS to indicate that CNS will be reading/watching NNCs. `AzureHost` is a newer mode that's used in nodesubnet where NNCs aren't used and therefore CNS has no reason to have its health depend on NNC access. * test: add unit tests * refactor: return error from NewHealthzHandlerWithChecks instead of panicking * chore: address lint errors * refactor: only get kubeConfig when in CRD mode * chore: fix lint errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cns Related to CNS. enhancement

5 participants