0

Cross-posted from superuser as I was unable to resolve this issue and haven gotten any replies yet. Maybe this community is more suited to help? Thank you!


I have a Fedora Server VM on macOS Sonoma using UTM. UTM is configured to use a shared network, which creates a VLAN the VM joins and through which the host OS can discover it.

Intermittently, all network connections from the macOS host to the VM drop and remain broken. The only workaround is to reboot the VM, which returns everything to a working state. SSH connections timeout and become unresponsive. Pings to the VM fail.

Could someone point me in the right direction here? Some facts below.

The VM hostname is work, IPv4 address is 192.168.64.2. I don't think it's a problem with the IP lease due to when the connections fail, it retains the IP address.

On the macOS host, ping and traceroute to the VM fail:

$ traceroute 192.168.64.2 traceroute to 192.168.64.2 (192.168.64.2), 64 hops max, 40 byte packets 1 * * * 2 *^C 
$ ping 192.168.64.2 PING 192.168.64.2 (192.168.64.2): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 Request timeout for icmp_seq 3 

The routing table looks OK though, AFAICT:

$ netstat -rn -f inet Routing tables Internet: Destination Gateway Flags Netif Expire default 192.168.178.1 UGScg en0 default link#24 UCSIg bridge100 ! 127 127.0.0.1 UCS lo0 127.0.0.1 127.0.0.1 UH lo0 169.254 link#15 UCS en0 ! 192.168.64 link#24 UC bridge100 ! 192.168.64.1 62.3e.5f.73.14.64 UHLWI lo0 192.168.64.2 8a.c3.94.a.86.e2 UHLWIi bridge100 1199 192.168.178 link#15 UCS en0 ! 192.168.178.1/32 link#15 UCS en0 ! 192.168.178.1 dc:15:c8:ef:b8:1d UHLWIir en0 1198 192.168.178.52/32 link#15 UCS en0 ! 224.0.0/4 link#15 UmCS en0 ! 224.0.0.251 1:0:5e:0:0:fb UHmLWI en0 224.0.0.251 1:0:5e:0:0:fb UHmLWIg bridge100 255.255.255.255/32 link#15 UCS en0 ! 

And when I drill into this specific route:

$ route get 192.168.64.2 route to: work destination: work interface: bridge100 flags: <UP,HOST,DONE,LLINFO,WASCLONED,IFSCOPE,IFREF> recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire 0 0 0 0 0 0 1500 1200 

Bridge network interface (this is created by UTM as part of the VLAN I believe?)

ifconfig bridge100 bridge100: flags=8a63<UP,BROADCAST,SMART,RUNNING,ALLMULTI,SIMPLEX,MULTICAST> mtu 1500 options=3<RXCSUM,TXCSUM> ether 62:3e:5f:73:14:64 inet 192.168.64.1 netmask 0xffffff00 broadcast 192.168.64.255 inet6 fe80::603e:5fff:fe73:1464%bridge100 prefixlen 64 scopeid 0x18 inet6 fd5d:9cb:9b9e:3946:141b:1dc6:2328:9f96 prefixlen 64 autoconf secured Configuration: id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 ipfilter disabled flags 0x0 member: vmenet0 flags=3<LEARNING,DISCOVER> ifmaxaddr 0 port 23 priority 0 path cost 0 Address cache: 8a:c3:94:a:86:e2 Vlan1 vmenet0 1199 flags=0<> nd6 options=201<PERFORMNUD,DAD> media: autoselect status: active 

ARP lookups seem to work:

$ arp work work (192.168.64.2) at 8a:c3:94:a:86:e2 on bridge100 ifscope [bridge] 

On the guest, I couldn't see anything out of the ordinary.

ifconfig_output

So bridge100 on the host is enp0s1 on the guest and it is UP.

I started looking for NetworkManager entries in journalctl as well, but since I don't really know what I was looking for I wasn't sure what to focus on.

I'd appreciate any help.


UPDATE 1:

As suggested in the comments, I ran a TCP capture in Wireshark. Around the time of the connection loss, I see 3 RST packets on 3 different TCP streams:

wireshark TCP capture

Given thatpings are broken thereafter, this is probably a symptom, not a cause?

UPDATE 2:

I captured all traffic on bridge100 now and after the disconnect, I see a pattern in ARP requests. Before the disruption, 192.168.64.1 (the macOS host) keeps asking who is 192.168.64.2 (the VM) and receives a reply each time:

arp reply

However, after the disruption, I see the host starting to broadcast the same question instead of addressing it to the VM's network interface directly (8a:c3:94:a:86:e2 is the VM's network IF); also, now the VM in return is starting to ask who has 192.168.64.1 which is the host (it did not do that before):

failed arp request

This seems to imply:

  1. The host is not getting an answer for ARP lookups; however, arp work at a command prompt has access to the MAC, as mentioned in above. Not sure if this is just another symptom, not a cause?

  2. The VM does have the MAC address is associated with 192.168.64.1 in the airport cache. arp 192.168.64.1 returns (incomplete) for HWaddress. This would at least explain that it is now impossible for the VM to reply to any network packets received from the host.

UPDATE 3:

Around the same time it just happened, I was seeing Spurious retransmission errors in TCP streams:

retransmission symptom

On the VM itself, I looked for a similar timestamp in journalctl and noticed that systemd-resolved (Fedora's primary DNS resolver) started to fail over into alternative DNS resolution schemes:

DNA failure symptom

7
  • 1
    The first thing is to perform a packet capture to determine which end is sending a RST. Commented Oct 18, 2024 at 10:13
  • That's a great idea @GregAskew -- running Wireshark now on bridge100. Would you just capture all the traffic for now and filter later? Commented Oct 18, 2024 at 11:05
  • @GregAskew It just happened again while I was recording TCP exchanges, and no one sent a reset. So I suspect this means it is not an issue with TCP client/server hanging up on each other, but something else. I suspected routing at first, but as I mention in the question, routing table looks OK? Commented Oct 21, 2024 at 14:30
  • Come to think of it, if even pings fail this is unlikely to be a problem with the transport layer? Doesn't ping use ICMP (and ARP) to find the destination? Commented Oct 21, 2024 at 18:09
  • My bad -- there were 3 RST messages, I am not sure why they first didn't show when filtering. But the payload is encrypted (likely this is from SSH streams) so I am not sure what this is telling me. But as far as who hung up on whom, it was Host to VM. So it is the client-side (macOS) that is closing the stream. Isn't this just a symptom though, not a cause, since whenever this happens, I can not even ping the VM anymore, which shouldn't rely on TCP? Commented Oct 22, 2024 at 6:46

2 Answers 2

0

I never got to the bottom of these issues, but I worked around them by:

  1. Moving from UTM/QEMU to VMWare Fusion, which is now free for all uses.
  2. Using NAT network mode instead of a shared VLAN.

Without NAT, I ran into similar issues with Fusion, with the added "convenience" that it allows you to hot-swap network devices with the VM running, which also fixed the problem temporarily (i.e. disabling/enabling the network adapter in the VM settings.)

I've had zero issues since moving to NAT. Even suspend/resume of the laptop works fine now.

0

I kind of found a temporary solution using a continuous ping to the VM. In a separate terminal window run this:

ping -i 5 <YOUR_VM_IP_ADDRESS> 

My setup is standard: macOS host, Ubuntu Server guest, and UTM using its "Shared Network" mode.

And crucially, for my use case, NAT was simply not an option. I needed direct, routable access within my local network segment.

The issue seemed to have stemmed from how the virtual network bridge within UTM (and potentially macOS itself) handles ARP (Address Resolution Protocol) cache entries for idle connections.

Every 5 seconds, a small ICMP (ping) packet is sent to the VM. To send this packet, the host must know the VM's MAC address. If the ARP entry is missing or stale, the host is forced to perform a new ARP request, and the VM dutifully replies. This constant, low-level chatter keeps the ARP entry fresh, effectively telling the network.

More elegant solution might be to configure ServerAliveInterval in SSH config.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.