This page describes the specialized Network Function Kubernetes operator that Google Distributed Cloud ships with. This operator implements a set of CustomResourceDefinitions (CRDs) that allow Distributed Cloud to execute high-performance workloads.
Network Function operator and SR-IOV functionality is not available on Distributed Cloud Servers.
The Network Function operator lets you do the following:
- Poll for existing network devices on a node.
- Query the IP address and physical link state for each network device on a node.
- Provision additional network interfaces on a node.
- Configure low-level system features on the node's physical machine required to support high-performance workloads.
- Use single-root input/output virtualization (SR-IOV) on PCI Express network interfaces to virtualize them into multiple virtual interfaces. You can then configure your Distributed Cloud workloads to use those virtual network interfaces.
Distributed Cloud's support for SR-IOV is based on the following open source projects:
Prerequisites
The Network Function operator fetches network configuration from the Distributed Cloud Edge Network API. To allow this, you must grant the Network Function operator service account the Edge Network Viewer role (roles/edgenetwork.viewer
) using the following command:
gcloud projects add-iam-policy-binding PROJECT_ID \ --role roles/edgenetwork.viewer \ --member "serviceAccount:PROJECT_ID.svc.id.goog[nf-operator/nf-angautomator-sa]"
Replace PROJECT_ID
with the ID of the target Google Cloud project.
Network Function operator resources
The Distributed Cloud Network Function operator implements the following Kubernetes CRDs:
Network
. Defines a virtual network that Pods can use to communicate with internal and external resources. You must create the corresponding VLAN using the Distributed Cloud Edge Network API before specifying it in this resource. For instructions, see Create a network.NetworkInterfaceState
. Enables the discovery of network interface states and querying a network interface for link state and IP address.NodeSystemConfigUpdate
. Enables the configuration of low-level system features such as kernel options andKubelet
flags.SriovNetworkNodePolicy
. Selects a group of SR-IOV virtualized network interfaces and instantiates the group as a Kubernetes resource. You can use this resource in aNetworkAttachmentDefinition
resource.SriovNetworkNodeState
. Lets you query the provisioning state of theSriovNetworkNodePolicy
resource on a Distributed Cloud node.NetworkAttachmentDefinition
. Lets you attach Distributed Cloud Pods to one or more logical or physical networks on your Distributed Cloud node. You must create the corresponding VLAN using the Distributed Cloud Edge Network API before specifying it in this resource. For instructions, see Create a network.
The Network Function operator also lets you define secondary network interfaces that do not use SR-IOV virtual functions.
Network
resource
The Network
resource defines a virtual network within the Distributed Cloud rack that Pods within your Distributed Cloud cluster can use to communicate with internal and external resources.
The Network
resource provides the following configurable parameters for the network interface exposed as writable fields:
spec.type
: specifies the network transport layer for this network. The only valid value isL2
. You must also specify anodeInterfaceMatcher.interfaceName
value.spec.nodeInterfaceMatcher.interfaceName
: the name of the physical network interface on the target Distributed Cloud node to use with this network.spec.gateway4
: the IP address of the network gateway for this network.spec.l2NetworkConfig.prefixLength4
: specifies the CIDR range for this network.
The following example illustrates the structure of the resource:
apiVersion: networking.gke.io/v1 kind: Network metadata: name: vlan200-network annotations: networking.gke.io/gdce-vlan-id: 200 networking.gke.io/gdce-vlan-mtu: 1500 spec: type: L2 nodeInterfaceMatcher: interfaceName: gdcenet0.200 gateway4: 10.53.0.1
NetworkInterfaceState
resource
The NetworkInterfaceState
resource is a read-only resource that lets you discover physical network interfaces on the node and collect runtime statistics on the network traffic flowing through those interfaces. Distributed Cloud creates a NetworkInterfaceState
resource for each node in a cluster.
The default configuration of Distributed Cloud machines includes a bonded network interface on the Rack Select Network Daughter Card (rNDC) named gdcenet0
. This interface bonds the eno1np0
and eno2np1
network interfaces. Each of those is connected to one Distributed Cloud ToR switch, respectively.
The NetworkInterfaceState
resource provides the following categories of network interface information exposed as read-only status fields.
General information:
status.interfaces.ifname
: the name of the target network interface.status.lastReportTime
: the time and date of the last status report for the target interface.
IP address configuration information:
status.interfaces.interfaceinfo.address
: the IP address assigned to the target interface.status.interfaces.interfaceinfo.dns
: the IP address of the DNS server assigned to the target interface.status.interfaces.interfaceinfo.gateway
: the IP address of the network gateway serving the target interface.status.interfaces.interfaceinfo.prefixlen
: the length of the IP prefix.
Hardware information:
status.interfaces.linkinfo.broadcast
: the broadcast MAC address of the target interface.status.interfaces.linkinfo.businfo
: the PCIe device path inbus:slot.function
format.status.interfaces.linkinfo.flags
: the interface flags—for example,BROADCAST
.status.interfaces.linkinfo.macAddress
: the Unicast MAC address of the target interface.status.interfaces.linkinfo.mtu
: the MTU value for the target interface.
Reception statistics:
status.interfaces.statistics.rx.bytes
: the total bytes received by the target interface.status.interfaces.statistics.rx.dropped
: the total packets dropped by the target interface.status.interfaces.statistics.rx.errors
: the total packet receive errors for the target interface.status.interfaces.statistics.rx.multicast
: the total multicast packets received by the target interface.status.interfaces.statistics.rx.overErrors
: the total packet receive over errors for the target interface.status.interfaces.statistics.rx.packets
: the total packets received by the target interface.
Transmission statistics:
status.interfaces.statistics.tx.bytes
: the total bytes transmitted by the target interface.status.interfaces.statistics.tx.carrierErrors
: the total carrier errors encountered by the target interface.status.interfaces.statistics.tx.collisions
: the total packet collisions encountered by the target interface.status.interfaces.statistics.tx.dropped
: the total packets dropped by the target interface.status.interfaces.statistics.tx.errors
: the total transmission errors for the target interface.status.interfaces.statistics.tx.packets
: the total packets transmitted by the target interface.
The following example illustrates the structure of the resource:
apiVersion: networking.gke.io/v1 kind: NetworkInterfaceState metadata: name: MyNode1 nodeName: MyNode1 status: interfaces: - ifname: eno1np0 linkinfo: businfo: 0000:1a:00.0 flags: up|broadcast|multicast macAddress: ba:16:03:9e:9c:87 mtu: 9000 statistics: rx: bytes: 1098522811 errors: 2 multicast: 190926 packets: 4988200 tx: bytes: 62157709961 packets: 169847139 - ifname: eno2np1 linkinfo: businfo: 0000:1a:00.1 flags: up|broadcast|multicast macAddress: ba:16:03:9e:9c:87 mtu: 9000 statistics: rx: bytes: 33061895405 multicast: 110203 packets: 110447356 tx: bytes: 2370516278 packets: 11324730 - ifname: enp95s0f0np0 interfaceinfo: - address: fe80::63f:72ff:fec4:2bf4 prefixlen: 64 linkinfo: businfo: 0000:5f:00.0 flags: up|broadcast|multicast macAddress: 04:3f:72:c4:2b:f4 mtu: 9000 statistics: rx: bytes: 37858381 multicast: 205645 packets: 205645 tx: bytes: 1207334 packets: 6542 - ifname: enp95s0f1np1 interfaceinfo: - address: fe80::63f:72ff:fec4:2bf5 prefixlen: 64 linkinfo: businfo: 0000:5f:00.1 flags: up|broadcast|multicast macAddress: 04:3f:72:c4:2b:f5 mtu: 9000 statistics: rx: bytes: 37852406 multicast: 205607 packets: 205607 tx: bytes: 1207872 packets: 6545 - ifname: enp134s0f0np0 interfaceinfo: - address: fe80::63f:72ff:fec4:2b6c prefixlen: 64 linkinfo: businfo: 0000:86:00.0 flags: up|broadcast|multicast macAddress: 04:3f:72:c4:2b:6c mtu: 9000 statistics: rx: bytes: 37988773 multicast: 205584 packets: 205584 tx: bytes: 1212385 packets: 6546 - ifname: enp134s0f1np1 interfaceinfo: - address: fe80::63f:72ff:fec4:2b6d prefixlen: 64 linkinfo: businfo: 0000:86:00.1 flags: up|broadcast|multicast macAddress: 04:3f:72:c4:2b:6d mtu: 9000 statistics: rx: bytes: 37980702 multicast: 205548 packets: 205548 tx: bytes: 1212297 packets: 6548 - ifname: gdcenet0 interfaceinfo: - address: 208.117.254.36 prefixlen: 28 - address: fe80::b816:3ff:fe9e:9c87 prefixlen: 64 linkinfo: flags: up|broadcast|multicast macAddress: ba:16:03:9e:9c:87 mtu: 9000 statistics: rx: bytes: 34160422968 errors: 2 multicast: 301129 packets: 115435591 tx: bytes: 64528301111 packets: 181171964 .. <remaining interfaces omitted> lastReportTime: "2022-03-30T07:35:44Z"
NodeSystemConfigUpdate
resource
The NodeSystemConfigUpdate
resource lets you make changes to the node's operating system configuration as well as modify Kubelet
flags. Changes other than sysctl
changes require a node reboot.
When instantiating this resource, you must specify the target nodes in the nodeSelector
field. You must include all key-value pairs for each target node in the nodeSelector
field. When you specify more than one target node in this field, the target nodes are updated one node at a time.
CAUTION: The nodeName
field has been deprecated. Using it immediately reboots the target nodes, including local control plane nodes, which can halt critical workloads.
The NodeSystemConfigUpdate
resource provides the following configuration fields specific to Distributed Cloud:
spec.containerRuntimeDNSConfig.ip
: specifies a list of IP addresses for private image registries.spec.containerRuntimeDNSConfig
: specifies a list of custom DNS entries used by the Container Runtime Environment on each Distributed Cloud node. Each entry consists of the following fields:ip
: specifies the target IPv4 address,domain
: specifies the corresponding domain,interface
: specifies the network egress interface through which the IP address specified in theip
field is reachable. You can specify an interface defined through the following resources:CustomNetworkInterfaceConfig
,Network
(by annotation),NetworkAttachmentDefinition
, (by annotation). This is a preview-level feature.
spec.kubeletConfig.cpuManagerPolicy
: specifies the Kubernetes CPUManager policy. Valid values areNone
andStatic
.spec.kubeletConfig.topologyManagerPolicy
: specifies the Kubernetes TopologyManager policy. Valid values areNone
,BestEffort
,Restricted
, andSingleNumaMode
.spec.osConfig.hugePagesConfig
: specifies the huge page configuration per NUMA node. Valid values are2MB
and1GB
. The number of huge pages requested is evenly distributed across both NUMA nodes in the system. For example, if you allocate 16 huge pages at 1 GB each, then each node receives a pre-allocation of 8 GB.spec.osConfig.isolatedCpusPerSocket
: specifies the number of isolated CPUs per socket. Required ifcpuManagerPolicy
is set toStatic
. The maximum number of isolated CPUs must be fewer than 80% of the total CPUs in the node.spec.osConfig.cpuIsolationPolicy
: specifies the CPU isolation policy. TheDefault
policy only isolatessystemd
tasks from CPUs reserved for workloads. TheKernel
policy marks the CPUs asisolcpus
and sets thercu_nocb
,nohz_full
, andrcu_nocb_poll
flags on each CPU.spec.sysctls.NodeLevel
: specifies thesysctls
parameters that you can configure globally on a node by using the Network Function operator. The configurable parameters are as follows:fs.inotify.max_user_instances
fs.inotify.max_user_watches
kernel.sched_rt_runtime_us
kernel.core_pattern
net.ipv4.tcp_wmem
net.ipv4.tcp_rmem
net.ipv4.tcp_slow_start_after_idle
net.ipv4.udp_rmem_min
net.ipv4.udp_wmem_min
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged
net.netfilter.nf_conntrack_tcp_timeout_max_retrans
net.sctp.auth_enable
net.sctp.sctp_mem
net.ipv4.udp_mem
net.ipv4.tcp_mem
net.ipv4.tcp_slow_start_after_idle
net.sctp.auth_enable
vm.max_map_count
You can also scope both safe and unsafe
sysctls
parameters to a specific Pod or namespace by using thetuning
Container Networking Interface (CNI) plug-in.
The NodeSystemConfigUpdate
resource provides the following read-only general status fields:
status.lastReportTime
: the most recent time that status was reported for the target interface.status.conditions.lastTransitionTime
: the most recent time that the condition of the interface has changed.status.conditions.observedGeneration
: denotes the.metadata.generation
value on which the initial condition was based.status.conditions.message
: an informative message describing the change of the interface's condition.status.conditions.reason
: a programmatic identifier denoting the reason for the last change of the interface's condition.status.conditions.status
: the status descriptor of the condition. Valid values areTrue
,False
, andUnknown
.status.conditions.type
: the condition type in camelCase.
The following example illustrates the structure of the resource:
apiVersion: networking.gke.io/v1 kind: NodeSystemConfigUpdate metadata: name: node-pool-1-config namespace: default spec: nodeSelector: baremetal.cluster.gke.io/node-pool: node-pool-1 networking.gke.io/worker-network-sriov.capable: true sysctls: nodeLevel: "net.ipv4.udp_mem" : "12348035 16464042 24696060" kubeletConfig: topologyManagerPolicy: BestEffort cpuManagerPolicy: Static osConfig: hugePagesConfig: "TWO_MB": 0 "ONE_GB": 16 isolatedCpusPerSocket: "0": 10 "1": 10
SriovNetworkNodePolicy
resource
The SriovNetworkNodePolicy
resource lets you allocate a group of SR-IOV virtual functions (VFs) on a Distributed Cloud physical machine and instantiate that group as a Kubernetes resource. You can then use this resource in a NetworkAttachmentDefinition
resource.
You can select each target VF by its PCIe vendor and device ID, its PCIe device addresses, or by its Linux enumerated device name. The SR-IOV Network Operator configures each physical network interface to provision the target VFs. This includes updating the network interface firmware, configuring the Linux kernel driver, and rebooting the Distributed Cloud machine, if necessary.
To discover the network interfaces available on your node, you can look up the NetworkInterfaceState
resources on that node in the nf-operator
namespace.
The following example illustrates the structure of the resource:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnx6-p2-sriov-en2 namespace: sriov-network-operator spec: deviceType: netdevice isRdma: true mtu: 9000 nicSelector: pfNames: - enp134s0f1np1 nodeSelector: edgecontainer.googleapis.com/network-sriov.capable: "true" numVfs: 31 priority: 99 resourceName: mlnx6_p2_sriov_en2
The preceding example creates a maximum of 31 VFs from the second port on the network interface named enp134s0f1np1
with an MTU value of 9000
(the maximum allowed value). Use the node selector label edgecontainer.googleapis.com/network-sriov.capable
, which is present on all Distributed Cloud nodes capable of SR-IOV.
For information about using this resource, see SriovNetworkNodeState
.
SriovNetworkNodeState
resource
The SriovNetworkNodeState
read-only resource lets you query the provisioning state of the SriovNetworkNodePolicy
resource on a Distributed Cloud node. It returns the complete configuration of the SriovNetworkNodePolicy
resource on the node as well as a list of active VFs on the node. The status.syncStatus
field indicates whether all SriovNetworkNodePolicy
resources defined for the node have been properly applied.
The following example illustrates the structure of the resource:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1 kind: SriovNetworkNodeState metadata: name: MyNode1 namespace: sriov-network-operator spec: dpConfigVersion: "1969684" interfaces: - mtu: 9000 name: enp134s0f1np1 numVfs: 31 pciAddress: 0000:86:00.1 vfGroups: - deviceType: netdevice mtu: 9000 policyName: mlnx6-p2-sriov-en2 resourceName: mlnx6_p2_sriov_en2 vfRange: 0-30 status: Status: Interfaces: Device ID: 1015 Driver: mlx5_core Link Speed: 25000 Mb/s Link Type: ETH Mac: ba:16:03:9e:9c:87 Mtu: 9000 Name: eno1np0 Pci Address: 0000:1a:00.0 Vendor: 15b3 Device ID: 1015 Driver: mlx5_core Link Speed: 25000 Mb/s Link Type: ETH Mac: ba:16:03:9e:9c:87 Mtu: 9000 Name: eno2np1 Pci Address: 0000:1a:00.1 Vendor: 15b3 Vfs: - Vfs: - deviceID: 101e driver: mlx5_core mac: c2:80:29:b5:63:55 mtu: 9000 name: enp134s0f1v0 pciAddress: 0000:86:04.1 vendor: 15b3 vfID: 0 - deviceID: 101e driver: mlx5_core mac: 7e:36:0c:82:d4:20 mtu: 9000 name: enp134s0f1v1 pciAddress: 0000:86:04.2 vendor: 15b3 vfID: 1 .. <omitted 29 other VFs here> syncStatus: Succeeded
For information about using this resource, see SriovNetworkNodeState
.
NetworkAttachmentDefinition
resource
The NetworkAttachmentDefinition
resource lets you attach Distributed Cloud Pods to one or more logical or physical networks on your Distributed Cloud node. It leverages the Multus-CNI framework and the SRIOV-CNI plugin.
Use an annotation to reference the name of the appropriate SriovNetworkNodePolicy
resource. When you create this annotation, do the following:
- Use the key
k8s.v1.cni.cncf.io/resourceName
. - Use the prefix
gke.io/
in its value, followed by the name of the targetSriovNetworkNodePolicy
resource.
The following example illustrates the structure of the resource:
apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: sriov-net1 namespace: mynamespace annotations: k8s.v1.cni.cncf.io/resourceName: gke.io/mlnx6_p2_sriov_en2 spec: config: '{ "type": "sriov", "cniVersion": "0.3.1", "name": "sriov-network", "ipam": { "type": "host-local", "subnet": "10.56.217.0/24", "routes": [{ "dst": "0.0.0.0/0" }], "gateway": "10.56.217.1" } }'
Upgrade NetworkAttachmentDefinition
resources to Distributed Cloud 1.4.0
Distributed Cloud version 1.4.0 replaces the bond0
interface with a new interface named gdcenet0
. The gdcenet0
interface lets you use the host management network interface card (NIC) in each Distributed Cloud machine in your rack for your workloads while keeping the Distributed Cloud management and control plane network traffic completely separated. To take advantage of this functionality, complete the steps in this section to reconfigure your NetworkAttachmentDefinition
resources, and then follow the instructions in Configure Distributed Cloud networking to provision the appropriate networks and subnetworks.
For each Distributed Cloud cluster on which you have deployed one or more NetworkAttachmentDefinition
resources, the following migration rules apply:
- For each new
NetworkAttachmentDefinition
resource, usegdcenet0
instead ofbond0
as the value of themaster
field. If you apply a resource that usesbond0
or an empty value for this field, Distributed Cloud replaces the value withgdcenet0
, and then stores and applies the resource to the cluster. - For each existing
NetworkAttachmentDefinition
resource, replacebond0
withgdcenet0
as the value of themaster
field, and then re-apply the resource to the cluster to restore full network connectivity to the affected Pods.
For information about using this resource, see NetworkAttachmentDefinition
.
Configure a secondary interface on a Pod using SR-IOV VFs
After you configure a SriovNetworkNodePolicy
resource and a corresponding NetworkAttachmentDefinition
resource, you can configure a secondary network interface on a Distributed Cloud Pod by using SR-IOV virtual functions.
To do so, add an annotation to your Distributed Cloud Pod definition as follows:
- Key:
k8s.v1.cni.cncf.io/networks
- Value:
nameSpace/<NetworkAttachmentDefinition1,nameSpace/NetworkAttachmentDefinition2...
The following example illustrates this annotation:
apiVersion: v1 kind: Pod metadata: name: sriovpod annotations: k8s.v1.cni.cncf.io/networks: mynamespace/sriov-net1 spec: containers: - name: sleeppodsriov command: ["sh", "-c", "trap : TERM INT; sleep infinity & wait"] image: alpine securityContext: capabilities: add: - NET_ADMIN
Configure a secondary interface on a Pod using the MacVLAN driver
Distributed Cloud also supports creating a secondary network interface on a Pod by using the MacVLAN driver. Only the gdcenet0
interface supports this configuration and only on Pods that run containerized workloads.
To configure an interface to use the MacVLAN driver:
Configure a
NetworkAttachmentDefinition
resource as shown in the following example:apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: macvlan-b400-1 annotations: networking.gke.io/gdce-vlan-id: 400 spec: config: '{ "type": "macvlan", "master": "gdcenet0.400", "ipam": { "type": "static", "addresses": [ { "address": "192.168.100.20/27", "gateway": "192.168.100.1" } ] ... } }'
Add an annotation to your Distributed Cloud Pod definition as follows:
apiVersion: v1 kind: Pod metadata: name: macvlan-testpod1 annotations: k8s.v1.cni.cncf.io/networks: macvlan-b400-1
Configure a secondary interface on a Pod using Distributed Cloud multi-networking
Distributed Cloud supports creating a secondary network interface on a Pod by using its multi-network feature. To do so, complete the following steps:
Configure a
Network
resource. For example:apiVersion: networking.gke.io/v1 kind: Network metadata: name: vlan200-network spec: type: L2 nodeInterfaceMatcher: interfaceName: vlan200-interface gateway4: 10.53.0.1
Add an annotation to your Distributed Cloud Pod definition as follows:
apiVersion: v1 kind: Pod metadata: name: myPod annotations: networking.gke.io/interfaces: [{"interfaceName":"eth1","network":"vlan200-network"}] networking.gke.io/default-interface: eth1 ...
What's next