Posted on Oct 27, 2024 • Edited on Oct 31, 2024

Network performance optimization with Nvidia ConnectX on Proxmox

#proxmox #linux #virtualmachine #network

Introduction

To improve network performance for virtual machines, we can use SR-IOV and tune network settings. SR-IOV (Single Root I/O Virtualization) is a feature that lets a single PCI Express (PCIe) device directly connect to virtual machines. This improves communication speed and reduces delays by bypassing the hypervisor.

The modern network adapter supports virtual functions (VFs) that can be used by virtual machines. The network adapter can create multiple virtual network adapters that can be assigned to virtual machines directly by SR-IOV.

This guide will show how to enable SR-IOV and VFs to adjust network performance in Proxmox with Nvidia ConnectX network adapters. We will use Mellanox ConnectX-6 Lx 25GbE NICs and linux bonding (port agrigation) to get the best reliability and speed.

Using bonding interfaces with SR-IOV can be tricky. However, Mellanox supports handling bonding interfaces at the hardware level. This means all virtual adapters connect to network hardware switches linked to the bonded interfaces.
To virtual machines, it seems like they are using just one network adapter.

Requirements

I assume you have a Proxmox server already installed and running.

Check the network adapter. In this example, we use Mellanox Technologies MT2894 (ConnectX-6 Lx) 25GbE NICs.
One network adapter with two ports.

lspci -nn | grep Ethernet 81:00.0 Ethernet controller [0200]: Mellanox Technologies MT2894 Family [ConnectX-6 Lx] [15b3:101f] 81:00.1 Ethernet controller [0200]: Mellanox Technologies MT2894 Family [ConnectX-6 Lx] [15b3:101f]

Find the network interface name.

dmesg | grep 81:00.0 | grep renamed mlx5_core 0000:81:00.0 enp129s0f0np0: renamed from eth2

We need to find the switchid of the network adapter enp129s0f0np0, result is 3264160003bd70c4.
It is used to identify the network adapter in the Open vSwitch configuration.

ip -d link show enp129s0f0np0 5: enp129s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether c4:70:bd:16:64:32 brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 768 numrxqueues 63 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 portname p0 switchid 3264160003bd70c4 parentbus pci parentdev 0000:81:00.0

Enable SR-IOV

Enable SR-IOV in the BIOS.

Enter the BIOS and check the option to enable SR-IOV. The option may be in different locations depending on the motherboard manufacturer. Some options related only for AMD processors.

CPU Virtualization (SVM Mode) -> Enabled
Chipset -> NBIO Common Options -> IOMMU -> Enabled
Chipset -> NBIO Common Options -> ACS Enable -> Enabled
Chipset -> NBIO Common Options -> PCIe ARI Support -> Enabled

Check the SR-IOV status.

dmesg | grep -i -e DMAR -e IOMMU

Check the IOMUU groups.

for g in $(find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V); do echo "IOMMU Group ${g##*/}:" for d in $g/devices/*; do echo -e "\t$(lspci -nns ${d##*/})" done; done;

Or by proxmox command.

pvesh get /nodes/`hostname -s`/hardware/pci --pci-class-blacklist ""

The result should have a lot of IOMMU Groups.

Configure Open vSwitch

Open vSwitch can offload bonding interfaces to the hardware, allowing the network adapter to manage ogrigeted traffic. It can also create a virtual switch on the network adapter, connecting virtual functions to the network.

Install Open vSwitch.

apt install openvswitch-switch ifupdown2 patch

Configure the network adapter to the switchdev mode. And set the number of virtual functions to 4.

vi /etc/udev/rules.d/70-persistent-net-vf.rules

# Ingress bond interface KERNELS=="0000:81:00.0", DRIVERS=="mlx5_core", SUBSYSTEMS=="pci", ACTION=="add", ATTR{sriov_totalvfs}=="?*", RUN+="/usr/sbin/devlink dev eswitch set pci/0000:81:00.0 mode switchdev", ATTR{sriov_numvfs}="0" KERNELS=="0000:81:00.1", DRIVERS=="mlx5_core", SUBSYSTEMS=="pci", ACTION=="add", ATTR{sriov_totalvfs}=="?*", RUN+="/usr/sbin/devlink dev eswitch set pci/0000:81:00.1 mode switchdev", ATTR{sriov_numvfs}="0" # Set the number of virtual functions to 4 SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="52ac120003bd70c4", ATTR{phys_port_name}=="p0", ATTR{device/sriov_totalvfs}=="?*", ATTR{device/sriov_numvfs}=="0", ATTR{device/sriov_numvfs}="4" # Rename the virtual network adapter to ovs-sw1pf0vf0, ovs-sw1pf0vf1, ovs-sw1pf0vf2, ovs-sw1pf0vf3 SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="52ac120003bd70c4", ATTR{phys_port_name}!="p[0-9]*", ATTR{phys_port_name}!="", NAME="ovs-sw1$attr{phys_port_name}"

We need to patch the openvswitch to support the hardware offload.
Add the following lines to the ovs-ctl script.
You need to disable auto update of the openvswitch-switch package after the changes, otherwise, the changes will be reverted.

patch -d/ -p0 --ignore-whitespace <<'EOF' --- /usr/share/openvswitch/scripts/ovs-ctl.diff 2024-10-16 01:25:28.369482552 +0000 +++ /usr/share/openvswitch/scripts/ovs-ctl 2024-10-16 01:27:32.740490528 +0000 @@ -162,6 +162,8 @@ # Initialize database settings. ovs_vsctl -- init -- set Open_vSwitch . db-version="$schemaver" \ || return 1 + ovs_vsctl -- set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=skip_sw ||: + ovs_vsctl -- set Open_vSwitch . other_config:lacp-fallback-ab=true ||: set_system_ids || return 1 if test X"$DELETE_BRIDGES" = Xyes; then for bridge in `ovs_vsctl list-br`; do EOF

After the changes, restart the server.
When the server is up, check the openvswitch configuration and offload status.

ovs-vsctl show ovs-vsctl get Open_vSwitch . other_config

Make sure the hw-offload=true is set.

{hw-offload="true", lacp-fallback-ab="true", tc-policy=skip_sw}

And network adapter is in the switchdev mode.

ip -d link show enp129s0f0np0

Output should be like this.

4: enp129s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether c4:70:bd:16:64:32 brd ff:ff:ff:ff:ff:ff promiscuity 1 allmulti 0 minmtu 68 maxmtu 9978 openvswitch_slave addrgenmode none numtxqueues 768 numrxqueues 63 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 portname p0 switchid 52ac120003bd70c4 parentbus pci parentdev 0000:81:00.0 vf 0 link/ether c4:70:ff:ff:ff:e0 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off vf 1 link/ether c4:70:ff:ff:ff:e1 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off vf 2 link/ether c4:70:ff:ff:ff:e2 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off vf 3 link/ether c4:70:ff:ff:ff:e3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

vf 0, vf 1, vf 2, vf 3 are the virtual functions.

Configure the bond interface

Add network bond interface. Where enp129s0f0np0 and enp129s0f1np1 are the physical network adapters.

vi /etc/network/interfaces

auto enp129s0f0np0 iface enp129s0f0np0 inet manual auto enp129s0f1np1 iface enp129s0f1np1 inet manual auto vmbr1 iface vmbr1 inet static ovs_type OVSBridge ovs_ports bond1 ovs_mtu 9000 address 192.168.1.2/24 auto bond1 iface bond1 inet manual ovs_type OVSBond ovs_bonds enp129s0f0np0 enp129s0f1np1 ovs_bridge vmbr1 ovs_mtu 9000 ovs_options lacp=active bond_mode=balance-tcp

Reboot the server.

Add the virtual functions to the Open vSwitch

vi /etc/network/interfaces.d/ovs-sw1.conf

# Add VFs to the offloaded switch auto ovs-sw1pf0vf0 iface ovs-sw1pf0vf0 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_mtu 9000 auto ovs-sw1pf0vf1 iface ovs-sw1pf0vf1 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_mtu 9000 auto ovs-sw1pf0vf2 iface ovs-sw1pf0vf2 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_mtu 9000 auto ovs-sw1pf0vf3 iface ovs-sw1pf0vf3 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_mtu 9000

After the changes, restart the server. To make sure the changes are applied.

Check the virtual functions:

ip -d link show | grep ovs-sw

Output should be like this.

14: ovs-sw1pf0vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 15: ovs-sw1pf0vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 16: ovs-sw1pf0vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 17: ovs-sw1pf0vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000

Now we have bonding interface bond1 and virtual functions ovs-sw1pf0vf0, ovs-sw1pf0vf1, ovs-sw1pf0vf2, ovs-sw1pf0vf3 connected to the virtual switch vmbr1 offloaded to the network hardware. We can use these interfaces in the virtual machines configuration, attached as a PCI device 0000:81:00.2 - 0000:81:00.5.

Configure the Proxmox

Let's create a resource mapping for the virtual functions.
Go to the Proxmox web interface, Datacenter -> Resource Mappings -> PCI Devices -> Add.

Name: network
Check all virtual functions starting from 0000:81:00.2 - 0000:81:00.5

The ofical documentation you can find here https://pve.proxmox.com/pve-docs/pve-admin-guide.html#resource_mapping

Configure the virtual machine

I hope you have a virtual machine already created.
Add the PCI devices to the virtual machine.

Go to the Proxmox web interface, Nodes -> your node -> Hardware -> PCI Devices -> Add.

Mappped Device: network
all options are default

After the changes, start the virtual machine.

Check the Virtual Machine

We've lunched the virtual machine with 16 vCPU, and eth1 is the virtual function ovs-sw1pf0vf0. Linux kernel 6.1.82.

# lscpu ... Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1 MiB (16 instances) L1i: 1 MiB (16 instances) L2: 8 MiB (16 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15

# ip -d link show eth1 9: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether c4:70:ff:ff:ff:e0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 96 numrxqueues 11 gso_max_size 65536 gso_max_segs 65535

Driver version:

# ethtool -i eth1 driver: mlx5_core version: 6.1.82-talos firmware-version: 26.36.1010 (MT_0000000547) expansion-rom-version: bus-info: 0000:00:10.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes

Netwok settings:

# ethtool eth1 Settings for eth1: Supported ports: [ Backplane ] Supported link modes: 1000baseT/Full 10000baseT/Full 1000baseKX/Full 10000baseKR/Full 10000baseR_FEC 25000baseCR/Full 25000baseKR/Full 25000baseSR/Full 1000baseX/Full 10000baseCR/Full 10000baseSR/Full 10000baseLR/Full 10000baseER/Full Supported pause frame use: Symmetric Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 1000baseT/Full 10000baseT/Full 1000baseKX/Full 10000baseKR/Full 10000baseR_FEC 25000baseCR/Full 25000baseKR/Full 25000baseSR/Full 1000baseX/Full 10000baseCR/Full 10000baseSR/Full 10000baseLR/Full 10000baseER/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Speed: 25000Mb/s Duplex: Full Auto-negotiation: on Port: Direct Attach Copper PHYAD: 0 Transceiver: internal Supports Wake-on: d Wake-on: d Current message level: 0x00000004 (4) link Link detected: yes

Netwok features:

# ethtool -k eth1 | grep " on" rx-checksumming: on tx-checksumming: on tx-checksum-ip-generic: on scatter-gather: on tx-scatter-gather: on tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp6-segmentation: on generic-segmentation-offload: on generic-receive-offload: on rx-vlan-offload: on tx-vlan-offload: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: on tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-gso-partial: on tx-udp-segmentation: on tx-vlan-stag-hw-insert: on rx-vlan-stag-filter: on [fixed]

Troubleshooting

ovs-vsctl show ovs-vsctl get Open_vSwitch . other_config ovs-appctl bond/show bond1 ovs-appctl lacp/show bond1 ovs-dpctl dump-flows -m ovs-appctl dpctl/dump-flows --names type=offloaded

Resources

Top comments (3)

Peter • Sep 6

Thank you for this excellent guide. Unfortunatelly in this configuration each virtual function is only bridged to one physical port. So the VM loses its network when the physical port to which it is connected stops working. Bond only works for proxmox access, but not for virtual functions.

As far as I understand the NVIDIA documentation, to configure VF-LAG which would offer a bond also for VMs, one has to use Linux bond instead of OVS bond. I have tried it, however it seems that Proxmox/Debian is not able to setup VF-LAG correctly. This has been fixed in Ubuntu some time ago, but obviosly not in Proxmox/Ubuntu. The log entry is "failed to activate vf lag".

Serge Logvinov • Sep 8

I don’t have full access to the server, so I cannot physically detach the cable from the network device.

However, based on the documentation I’ve reviewed, I assume that the bond interface is created on the network card itself, which is then connected to the virtual switch inside the card. All virtual functions (VFs) connect through this virtual switch, so the setup should continue to work properly even if one of the bonded ports is disconnected.

Peter • Sep 8

I have only one cable connected to the switch. All the VMs that use virtual functions from the port that is not connected have no network connectivity hwen using OVS to setup the bond. Zero incomming packets. Only proxmox is accessible on both ports. Maybe you have the possibility to disable one port on your switch to test it.

In the NVIDIA documentation which ha benn last updated 4 days ago, in the chapter "SR-IOV VF LAG" they first create a normal Linux bond. This seems to instruct the driver to create VF LAG on the hardware. After that they connect that Linux bond to OVS bridge:
docs.nvidia.com/doca/archive/2-8-0...

As I have mentioned, when I try to use the Linux bond, it fails on proxmox. This has been fixed for Ubuntu but not for Debian used by proxmox:
bugs.launchpad.net/ubuntu/+source/...