2

When transferring large (Several GBs) of data over various forms of file transfer, such as: FTP, SFTP, NFS and Samba. They all suffer from the same issue of multiple small files hampering speeds down to MBs or KBs at times - even over a 10Gbps link.

However if I was to zip, tar or rar the entire folder before transferring, then the network link gets fully saturated.

  • What is it that causes this effect?

  • What can be done to improve the performance of large transfers with many small individual files over a network?

  • Out of the available file transfer protocols, which is best suited for this?

I have full administration over the network so all configurations and options are available like setting MTU and Buffer sizes on network interfaces and turning off async and encryption in file server configurations as a couple of throwaway ideas.

3
  • depends also of processor speed Commented Nov 6, 2020 at 11:49
  • I'm aware of that. But that also ties into my question, if it is CPU speed then why is a sequential file always saturating full speed while multiple small files hamper it. Ive also made tweaks to network interfaces by putting TCP and UDP checksum offloads onto the NICs, maximising the buffer sizes and disabling interrupt moderation. It still doesn't fully load the CPU on both sides. Commented Nov 6, 2020 at 12:04
  • maybe you have to programmatically transfer them one by one so that memory will be not in full use, i mean memory will not load all files at the same time but one after one to ensure maximum efficiency, this could be done with python multiprocessing or multithreading Commented Nov 6, 2020 at 12:20

2 Answers 2

4

File system metadata. Overhead needed to make files possible is underappreciated by sysadmins. Until they try to deal with many small files.

Say you have a million small 4 KB files, a decently fast storage with 8 drive spindles, and a 10 Gb link that the array can sometimes saturate with sequential reads. Further assume 100 IOPS per spindle, and it takes one IO per file (this is oversimplifying, but illustrates the point).

$ units "1e6 / (8 * 100 per sec)" "sec" * 1250 / 0.0008 

21 minutes! Instead, assume the million files are in one archive file, and sequential transfer can saturate the 10 Gb link. 80% useful throughput, due to being wrapped in IP in Ethernet.

$ units "(1e6 * 4 * 1024 * 8 bits) / (1e10 bits per second * .8)" "sec" * 4.096 / 0.24414062 

4 seconds is quite a bit faster.

If the underlying storage is small files, any file transfer protocol will have a problem with many of them. When IOPS of the array are the bottleneck, the protocol of the file server on top of it doesn't really help.

Fastest would be copying one big archive or disk image. Mostly sequential IO, least file system metadata.

Maybe with file serving protocols you don't have to copy everything. Mount the remote share and access the files you need. However, accessing directories with very large number of files, or copying them all, is still slow. (And beware, NFS servers going away unexpectedly can cause clients to hang stuck in IO forever.)

5

Each individual file transfer is a transaction and each transaction has overhead associated to it. A rough example:

  1. Client tells server: "I want to send a file, the filename is example.txt, size is 100 bytes".
  2. Server tells client: "OK, I am ready to receive".
  3. Client sends 100 bytes of file data to server.
  4. Server acknowledges client it received file, and closes the local file handle.

In steps 1,2 and 4, there is an additional round-trip between client and server, which reduces throughput. Also, the information sent in these steps adds up to the overall data to transmit. If the metadata is 20 bytes, that would be 20% overhead for a 100 byte file.

There is no way to avoid this per-file overhead on the protocols.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.