just a hypothesis. accept() triggers a context switch. Context switches involves MMU commands, and can be slower in xen if it is being emulated in software (type2 hypervisor). This is much slower than a native OS doing a context switch and the MMU operations are executed at the CPU level.
Real hardware (and 32bit paravirtualized Xen) don't require any MMU changes during a syscall. This is because on real hardware, there is no VMM (Virtual Machine Monitor) to protect, and on 32bit paravirt, Xen can use x86 segments to protect the monitor.
For better or for worse (most would say better), segment limit checking is disabled in 64bit mode, so 64bit paravirtual Xen has to use the MMU to protect it's monitor. Basically, both the kernel and userspace actually run in ring3, but on different page tables. This means that expensive MMU updates (and TLB flushes) are required both on the way into and out of the kernel for every syscall.