Skip to content

Conversation

@fatkodima
Copy link
Member

@fatkodima fatkodima commented Apr 8, 2023

Example:

Person.pluck_in_batches(:name, :email) do |batch| jobs = batch.map { |name, email| PartyReminderJob.new(name, email) } ActiveJob.perform_all_later(jobs) end Person.pluck_each(:email) do |email| PartyMailer.with(email: email).welcome_email.deliver_later end

Plucking in batches is a very popular feature I saw many projects reimplement themselves to gain some performance.
I saw this in 2 my previous projects, in OSS projects (was able to find in mastodon), a few popular gems.

Benchmarks

Tested on a table with 50M records.
Compared to the recently introduced optimization for range batching.

CREATE TABLE users (id bigserial PRIMARY KEY, val integer); INSERT INTO users (val) SELECT floor(random() * 30 + 1)::int FROM generate_series(1, 50000000) AS i; ANALYZE users;

Whole table batching

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.in_batches(use_ranges: true) do |batch| batch.pluck(:id, :val) end elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"

Elapsed: 209.20533800008707s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.pluck_in_batches(:id, :val) { } elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"

Elapsed: 113.7704949999461s 🔥

Batching with conditions

Using ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.where("val = 21").in_batches(use_ranges: true) do |batch| batch.pluck(:id, :val) end elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"

Elapsed: 28.136486999923363s

No ranges:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.where("val = 21").in_batches do |batch| batch.pluck(:id, :val) end elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"

Elapsed: 39.96518399997149s

Plucking in batches:

start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.where("val = 21").pluck_in_batches(:id, :val) do |batch| end elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"

Elapsed: 16.415813000057824s 🔥

These numbers are for the db on my local machine. The improvement will be much larger in production due to simpler queries and SQL queries reduction by half.

Also, implementing this feature would make #47466 unneeded.

The logic in pluck_in_batches looks similar to in_batches, but trying to dry it (extracting similar logic into helper methods or trying to reuse pluck_in_batches inside in_batches) will make the code more complex and less understandable.

cc @nvasilevski (as we discussed it in https://discuss.rubyonrails.org/t/yield-record-ids-to-in-batches-block/81102)

@fatkodima
Copy link
Member Author

For anyone interested in this - currently released as a gem (https://github.com/fatkodima/pluck_in_batches).

@marckohlbrugge
Copy link
Contributor

marckohlbrugge commented May 22, 2024

Another common use case for this is when generating sitemaps.

Sitemaps need to be generated regularly, typically need to fetch many database records, but only need a handful of columns to generate the URLs. Object functionality is typically not needed.

I'm using @fatkodima's gem now, but this would be a welcome addition to Rails core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants