Skip to content

Retina can break connectivity of pods to the Kubernetes Cluster IP in clusters using Cilium #252

@andreev-io

Description

@andreev-io

This issue is seen in both AKS and GCP. See notes for AKS at #252 (comment)

Describe the bug
Upon installation of Retina, connectivity can be lost for pods in a GKE cluster using managed Cilium.

To Reproduce

  1. Go to create a standard GKE cluster.
  2. Select the Standard: You manage your cluster option (see screenshot 1).
  3. Specify GKE version 1.26.11-gke.105500 in the No channel channel selector (see screenshot 2). We suspect the issue would occur with other versions too, but we used a specific one for reproducability.
  4. [Optional] Configure the cluster to run in one AZ with fewer nodes than the default to manage cost.
  5. [Important] In the Networking configuration tab for the entire cluster, select Enable Dataplane V2 to enable managed Cilium-powered networking.
  6. Create the cluster and wait for all default pods in the cluster to come up.
  7. Install Retina and wait for the agent pods to start.
> VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name) helm install retina oci://ghcr.io/microsoft/retina/charts/retina \ --set namespace=kube-system \ --version $VERSION \ --namespace kube-system \ --set image.tag=$VERSION \ --set operator.tag=$VERSION \ --set image.pullPolicy=Always \ --set logLevel=info \ --set operator.enabled=true \ --set operator.enableRetinaEndpoint=true \ --set enabledPlugin_linux="\[packetparser\]" \ --set enablePodLevel=true \ --set remoteContext=true 

Note: if you are running a cluster with small nodes, you might need to manually edit the retina-agent DaemonSet to lower resource requests. Wait until retina-agent pods start.

  1. Identify metrics-server running in the kube-system namespace and check its logs. You will see error logs such as
E0409 15:21:23.378785 1 webhook.go:202] Failed to make webhook authorizer request: Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled E0409 15:21:23.378851 1 errors.go:77] Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled 
  1. Identify the cluster IP and the endpoint IP:
> kubectl get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.114.192.1 <none> 443/TCP 45m > kubectl get ep NAME ENDPOINTS AGE kubernetes 10.128.0.7:443 45m 
  1. Connect to another pod and check connectivity to these origins. You'll see that there is connectivity to the endpoint IP but not to the service IP.
> kubectl debug -ti --image="nixery.dev/shell/curl" kube-dns-ff4bbcc87-tvzm7 -n kube-system bash-5.2# curl https://10.114.192.1 -v -k ... bash-5.2# curl https://10.128.0.7 -v -k * Trying 10.128.0.7:443... * Connected to 10.128.0.7 (10.128.0.7) port 443 * ALPN: curl offers h2,http/1.1 * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, Request CERT (13): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, Certificate (11): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN: server accepted h2 * Server certificate: * subject: CN=34.173.138.225 * start date: Apr 9 14:52:44 2024 GMT * expire date: Apr 8 14:54:44 2029 GMT * issuer: CN=ca353e3b-048b-4feb-aa93-19a7c8a6aa89 * SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway. * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * using HTTP/2 * [HTTP/2] [1] OPENED stream for https://10.128.0.7/ * [HTTP/2] [1] [:method: GET] * [HTTP/2] [1] [:scheme: https] * [HTTP/2] [1] [:authority: 10.128.0.7] * [HTTP/2] [1] [:path: /] * [HTTP/2] [1] [user-agent: curl/8.4.0] * [HTTP/2] [1] [accept: */*] > GET / HTTP/2 > Host: 10.128.0.7 > User-Agent: curl/8.4.0 > Accept: */* > * received GOAWAY, error=0, last_stream=1 < HTTP/2 403 < audit-id: 2c7f6280-d595-4ddf-850f-abf1cadd85d8 < cache-control: no-cache, private < content-type: application/json < x-content-type-options: nosniff < x-kubernetes-pf-flowschema-uid: 759447f6-3823-412a-86a3-09c764ef91eb < x-kubernetes-pf-prioritylevel-uid: 2707b41b-d15c-402a-a039-b0df8aff1c2d < content-length: 217 < date: Tue, 09 Apr 2024 15:45:36 GMT < { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": {}, "code": 403 * Closing connection * TLSv1.3 (OUT), TLS alert, close notify (256): 

Expected behaviour
No connectivity impact when installing Retina.

Screenshots
Step (2). Select Standard: You manage your cluster.
image

Step (3). No channel when specifying the version, then specify version 1.26.11-gke.1055000.
image

Step (4). Select Enable Dataplane V2 in the cluster network configuration tab.
image

Platform (please complete the following information):
See steps to reproduce.

Additional context
N/A

Metadata

Metadata

Type

No type

Projects

Status

Done

Status

Accepted

Relationships

None yet

Development

No branches or pull requests

Issue actions