On my machine, Ryzen 3900X, Ubuntu 22,
A basic C++ TCP server app that only sends 64K packets, and a basic c++ receiver that pulls 100GB of these packets and copies them to it's internal buffer only, single-threaded:
achieves ~30-33 GBit/sec for TCP connection (~4.0GB/sec) (not MBit)
and ~55-58GBit/sec for a socket connection, (~7.3 GB/sec)
and ~492Gbit/sec for in-process memcopy (~61GB /sec)
and, of course, would be faster still for a "shared memory" approach where you do not do any copies at all, merely pass a reference with some sort of synchronisation mechanism.
That's depending on whether I move the mouse and windows around during the test. So there clearly is a substantial difference in overhead, and could be in range of 2x depending on the details.
So for a high throughput, resource-constrained, embedded computer like Mosquito over Raspberry pi, there will be a benefit in using an unix socket and a benefit in not splitting the overall application into modules too small.
Of course, if your application only ever serves a database search replies or a web site, and your response time is in multiple millisecond range, then the processing time will dominate the network latency. But, for tiny computers that are stressed to the peak, and e.g. use multiple micro-services or multiple processing steps, this might make a difference in the overall application latency.
When developing for ultimate performance, it is worth knowing that the Linux console output is relatively costly -- so if in benchmark/production mode you can do away with printing to console, you may get a substantial performance boost.