Monday
23Feb2009
Rails 2.3: Batch Finding
Monday, February 23, 2009 at 1:52AM
If you've ever worked with a huge number of Active Record objects and watched your server memory at the same time, you may have noticed considerably bloat in your Rails processes. That's because Active Record doesn't support database cursors (alas!) so all of those records come into memory at once.
In Rails 2.3, ActiveRecord::Base is adding two methods to help with this problem:
[sourcecode language='ruby']
Account.find_in_batches(:conditions => {:credit => true}) do |accts|
accts.each { |account| account.create_daily_charges! }
end
[/sourcecode]
The
[sourcecode language='ruby']
Account.each do |account|
account.create_daily_charges!
end
[/sourcecode]
A couple of caveats: first, if you're trying to loop through less than 1000 or so records, you should avoid the overhead of batches and use something like
In Rails 2.3, ActiveRecord::Base is adding two methods to help with this problem:
find_in_batches and each. Both of these methods return records in groups of 1000, allowing you to process one group before proceeding to the next, and keeping the memory pressure down:find_in_batches is the basic method here:[sourcecode language='ruby']
Account.find_in_batches(:conditions => {:credit => true}) do |accts|
accts.each { |account| account.create_daily_charges! }
end
[/sourcecode]
find_in_batches takes most of the options that find does, with the exception of :order and :limit. Records will always be returned in order of ascending primary key (and the primary key must be an integer). To change the number of records in a batch from the default 1000, use the :batch_size option.The
each method provides a wrapper around find_in_batches that returns individual records:[sourcecode language='ruby']
Account.each do |account|
account.create_daily_charges!
end
[/sourcecode]
A couple of caveats: first, if you're trying to loop through less than 1000 or so records, you should avoid the overhead of batches and use something like
Account.all.each or a regular finder to get the records. Second, if the table is very active (i.e., has a constant stream of inserts and deletes), using the batch methods may miss records due to changes in the table between batches (whereas finding all records will at least give you a complete point-in-time snapshot).

Reader Comments (5)
Thanks, it's interesting feature.
Will Paginate (http://wiki.github.com/mislav/will_paginate) has a very handy paginated_each method that functions similarly. Its nice to see this key functionality getting baked right in.
OK I just have to wonder... was the batch processing AT ALL inspired by my plugin (released on googlecode Apr'08, now on github):
http://github.com/pboling/boling_for_batches/tree/master
which does the same thing (and more still!).
Of course I realize that there is the distinct possibility that core rails folks hacked this up without a wisp of knowledge of my plugin, but still... if you'd like some ideas for other enhancements to batch processing take a look at it.
To enable stopping and starting or batch processing in batches you have the option of using:
:first_batch
and
:last_batch
Also why can't you honor the :order param? My plugin handles it fine...
With my batch plugin you can do this:
batch = BolingForBatches::Batch.new(:klass => Payment, :select => "DISTINCT transaction_id", :batch_size => 50, :order => 'transaction_id DESC', :first_batch => 10, :last_batch => 20)
batch.run(:check_status, false, true, true) #extra params are sent to method
I guess it was done this way to minimise changes to the existing AR code and adapters, but it's a long way from ideal.
With the OCI8 (Oracle) library, response rows are received from the server as a stream. So it would be really cool if ActiveRecord were able to build and yield each object as it was received, instead of adding them to an array and returning the array at the end. Then you would have a single query, no need to set a batch size, and no consistency problems.
I'm not sure how well this would work with other databases. For example, if you do a MySQL SELECT which returns a million rows, does the MySQL client library buffer the entire response before returning it to the application? I'd hope not...
You might want to know in your post that .each has been changed to find_each to resolve some conflicts.
Thanks!