DEV Community

Cover image for HowTo: Working with large files in Ruby efficiently.
Thomas Jaskiewicz for Netguru

Posted on • Edited on • Originally published at tjay.dev

HowTo: Working with large files in Ruby efficiently.

How can we read files in Ruby?

* Testing file generated by running a following command:

❯ openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -keyout key.pem -out cert.pem 

It has a clearly defined the beginning and the end of the file which fill be useful while reading the files.

1. File.read() which is actually IO.read():

> file = File.read("cert.pem") => "-----BEGIN CERTIFICATE-----\nMIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\nTDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\nzqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\nTgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\naigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\nraNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\nGNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\naeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\nlybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n-----END CERTIFICATE-----\n" > file.bytesize => 956 > file.class => String 

read method reads the entire file's content and assigns it to the variable as single String.

2. File.new() and its synonym File.open():

> file = File.new("cert.pem") => #<File:cert.pem> > lines = file.readlines => ["-----BEGIN CERTIFICATE-----\n", "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n", "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n", "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n", "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n", "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n", "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n", "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n", "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n", "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n", "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n", "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n", "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n", "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n", "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n", "-----END CERTIFICATE-----\n"] > lines.class => Array 

new or open methods returns an instance of the File class on which we can call readlines method which reads the entire file's content, splits it line by line and returns an Array of Strings where one element is one line from the file.

3. File.readlines() which is actually IO.readlines():

> lines = File.readlines("cert.pem") => ["-----BEGIN CERTIFICATE-----\n", "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n", "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n", "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n", "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n", "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n", "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n", "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n", "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n", "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n", "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n", "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n", "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n", "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n", "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n", "-----END CERTIFICATE-----\n"] > lines.class => Array 

Here, we have the same output as in the previous example by calling just class method readlines on File class.

4. File.foreach() which is actually IO.foreach():

> file = File.foreach("./cert.pem") => #<Enumerator: ...> > file.entries => ["-----BEGIN CERTIFICATE-----\n", "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n", "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n", "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n", "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n", "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n", "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n", "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n", "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n", "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n", "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n", "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n", "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n", "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n", "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n", "-----END CERTIFICATE-----\n"] > lines.class => Array 

foreach method returns an Enumerator instance on which we call entries which returns an Array of String, again each element is a line from the file.

As we can see above there are many methods that allow us to read the file. However which one should we use and why? Let's create a large file and check those methods again!

Which methods should we use to read large files?

Generating our test file

At first, let's generate a large file with randomized data inside:

require 'securerandom' one_megabyte = 1024 * 1024 name = "large_1G" size = 1000 File.open("./#{name}.txt", 'wb') do |file| size.times do file.write(SecureRandom.random_bytes(one_megabyte)) end end 
  • w - Write-only, truncates existing file to zero length or creates a new file for writing.
  • b - Binary file mode. Suppresses EOL <-> CRLF conversion on Windows. And sets external encoding to ASCII-BIT unless explicitly specified.

As the result we generated 1GB file:

ls -lah ... -rw-r--r-- 1 user user 1.0G Aug 31 22:10 large_1G.txt 

Defining our metrics and profilers

There are probably 2 the most important metrics that we would like to track in our experiment:

  • Time - How long does it take to open and read the file?
  • Memory - How much memory does it take to open and read the file?

Also there will be one additional metric describing how many objects were freed by Garbage Collector.

We can prepare simple profiling methods:

# ./helpers.rb require 'benchmark' def profile_memory memory_usage_before = `ps -o rss= -p #{Process.pid}`.to_i yield memory_usage_after = `ps -o rss= -p #{Process.pid}`.to_i used_memory = ((memory_usage_after - memory_usage_before) / 1024.0).round(2) puts "Memory usage: #{used_memory} MB" end def profile_time time_elapsed = Benchmark.realtime do yield end puts "Time: #{time_elapsed.round(2)} seconds" end def profile_gc GC.start before = GC.stat(:total_freed_objects) yield GC.start after = GC.stat(:total_freed_objects) puts "Objects Freed: #{after - before}" end def profile profile_memory do profile_time do profile_gc do yield end end end end 

Testing our methods for reading files

  • .read
file = nil profile do file = File.read("large_1G.txt") end Objects Freed: 39 Time: 0.52 seconds Memory usage: 1000.05 MB 
  • .new + #readlines
file = nil profile do file = File.new("large_1G.txt").readlines end Objects Freed: 39 Time: 4.19 seconds Memory usage: 1298.4 MB 
  • .readlines
file = nil profile do file = File.readlines("large_1G.txt") end Objects Freed: 39 Time: 4.24 seconds Memory usage: 1284.61 MB 
  • .foreach
file = nil profile do file = File.foreach("large_1G.txt").to_a end Objects Freed: 40 Time: 4.42 seconds Memory usage: 1284.31 MB 

The examples we can see above allowed us to read the whole file and store it in local memory as one String or as an Array of Strings (each line from the file as one element in the Array).

As we can see, it requires at least as much memory as the size of the file:

  • one String - 1GB file requires 1GB of memory.
  • an Array of Strings - 1GB memory for file's content + additional memory for an Array (+- 300MB here). This approach has one advantage, we can access whichever line of the file we want as long as we know which line is it.

At this point we can see that the methods that we tested are not really efficient. The bigger the file, the more memory we need. In longer term this approach might lead to some serious consequences, even killing the application.

Now, we need to us ourselves a question. Can we process our files line by line? If so, then we can read our files in a different way:

  • .new + #each
file = nil profile do file = File.new("large_1G.txt") file.each { |line| line } end Objects Freed: 4100808 Time: 2.08 seconds Memory usage: 57.68 MB 
  • .new + #advise + #each
file = nil profile do file = File.new("large_1G.txt") file.advise(:sequential) file.each { |line| line } end Objects Freed: 4100808 Time: 2.22 seconds Memory usage: 55.71 MB 

Calling #advise method announces an intention to access data from the current file in a specific pattern. No major improvement here with using #advise method.

  • .new + #read - reading chunk by chunk
file = nil chunk_size = 4096 buf = "" profile do file = File.new("large_1G.txt") while buf = file.read(chunk_size) buf.tap { |buf| buf } end end Objects Freed: 256037 Time: 1.27 seconds Memory usage: 131.64 MB 

We defined the chunk as 4096 bytes and we read our file chunk by chunk. Depending on the structure of your file this approach might be useful.

  • .foreach + #each_entry
file = nil profile do file = File.foreach("large_1G.txt") file.each_entry { |line| line } end Objects Freed: 4100809 Time: 2.22 seconds Memory usage: 53.02 MB 

Creating an Enumerator instance as file and reading file line by line using each_entry method.

First thing we can notice is that memory usage is way lower. Main reason for that is that we read the file line by line and when the line is processed then it's garbage collected. We can see that by the size of the Objects Freed, it's quite high.

We also tried to use here an #advise method which we can tell how we want to process our file. More about IO#advise can be found in the documentation. Unfortunately, it didn't help us out here.

Except IO#each method we have also similar methods like IO#each_byte (reading byte by byte),IO#each_char (reading char by char) and IO#each_codepoint.

In the example with reading by chunks (IO#read) the memory usage will vary depending on the chunk size. If you find this way useful you can experiment with the chunk size.

When using IO.foreach we operate on Enumerator which gives us a few more methods like: IO#each_entry, IO#each_slice, IO#each_cons. There is also lazy method which returns a Enumerator::Lazy. Lazy Enumerator has a few additional methods which enumerate values only on an as-needed basis. If you don't need to read the entire file but, for example, looking for a particular line containing given expression then it might be worth to check it out.

I could finish the article at this point, but what if before we even start reading the file we need to decrypt it? Let's move further to the example.

Decrypting large file and processing it line by line

Prerequisites

Before we decrypt the file we need to encrypt our generated large file. We are going to use AES with 256 bits key length with Cipher Block Chaining (CBC) as mode.

cipher = OpenSSL::Cipher::AES256.new(:CBC) cipher.encrypt KEY = cipher.random_key IV = cipher.random_iv 

Now, let's encrypt out file:

cipher = OpenSSL::Cipher::AES256.new(:CBC) cipher.encrypt cipher.key = KEY cipher.iv = IV file = nil enc_file = nil profile do file = File.read("large_1G.txt") enc_file = File.open("large_1G.txt.enc", "wb") enc_file << cipher.update(file) enc_file << cipher.final end file.close enc_file.close Objects Freed: 12 Time: 3.6 seconds Memory usage: 1000.02 MB 

Seems like encrypting is also a quite memory consuming task. Let's adjust the algorithm a little bit:

cipher = OpenSSL::Cipher::AES256.new(:CBC) cipher.encrypt cipher.key = KEY cipher.iv = IV file = nil enc_file = nil profile do buf = "" file = File.open("large_1G.txt", "rb") enc_file = File.open("large_1G.txt.enc", "wb") while buf = file.read(4096) enc_file << cipher.update(buf) end enc_file << cipher.final end file.close enc_file.close Objects Freed: 768048 Time: 5.05 seconds Memory usage: 145.93 MB 

By changing the algorithm to read and cipher the file by chunks made the task much less memory consuming.

Decrypt

All right, let's try to decrypt it now:

decipher = OpenSSL::Cipher::AES256.new(:CBC) decipher.decrypt decipher.key = KEY decipher.iv = IV dec_file = nil enc_file = nil profile do buf = "" enc_file = File.open("large_1G.txt.enc", "rb") dec_file = File.open("large_1G.txt.dec", "wb") while buf = enc_file.read(4096) dec_file << decipher.update(buf) end dec_file << decipher.final end dec_file.close enc_file.close Objects Freed: 768050 Time: 3.5 seconds Memory usage: 152.12 MB 

Now, let's compare our files whether we properly encrypted and decrypted it:

❯ diff large_1G.txt large_1G.txt.dec 

No differences were found. We are good here!

We managed to lower the memory usage quite significantly. That's great!

Treat this article as a toolset that you can use in your specific case.

This article was originally posted on my personal dev blog: https://tjay.dev/

Photo by Erwan Hesry on Unsplash

Top comments (0)