A Fresh Cup is Mike Gunderloy's software development weblog, covering Ruby on Rails and whatever else I find interesting in the universe of software. I'm a full-time software developer: most of my time in recent years has been spent writing Rails, though I've dabbled in many other things and like most people who have been writing code for decades I can learn new stuff as needed.

Currently I'm unemployed and starting to look around for my next opportunity as a senior manager, team lead, or lead developer. Drop me a comment if you're interested or email MikeG1 [at] larkfarm.com.

« Double Shot #398 | Main | Double Shot #397 »

Rails 2.3: Batch Finding

If you've ever worked with a huge number of Active Record objects and watched your server memory at the same time, you may have noticed considerably bloat in your Rails processes. That's because Active Record doesn't support database cursors (alas!) so all of those records come into memory at once.

In Rails 2.3, ActiveRecord::Base is adding two methods to help with this problem: find_in_batches and each. Both of these methods return records in groups of 1000, allowing you to process one group before proceeding to the next, and keeping the memory pressure down:

find_in_batches is the basic method here:

[sourcecode language='ruby']
Account.find_in_batches(:conditions => {:credit => true}) do |accts|
accts.each { |account| account.create_daily_charges! }

find_in_batches takes most of the options that find does, with the exception of :order and :limit. Records will always be returned in order of ascending primary key (and the primary key must be an integer). To change the number of records in a batch from the default 1000, use the :batch_size option.

The each method provides a wrapper around find_in_batches that returns individual records:

[sourcecode language='ruby']
Account.each do |account|

A couple of caveats: first, if you're trying to loop through less than 1000 or so records, you should avoid the overhead of batches and use something like Account.all.each or a regular finder to get the records. Second, if the table is very active (i.e., has a constant stream of inserts and deletes), using the batch methods may miss records due to changes in the table between batches (whereas finding all records will at least give you a complete point-in-time snapshot).

Reader Comments (5)

Thanks, it's interesting feature.

February 23, 2009 | Unregistered Commenterktulhu

Will Paginate (http://wiki.github.com/mislav/will_paginate) has a very handy paginated_each method that functions similarly. Its nice to see this key functionality getting baked right in.

February 23, 2009 | Unregistered CommenterW. Andrew Loe III

OK I just have to wonder... was the batch processing AT ALL inspired by my plugin (released on googlecode Apr'08, now on github):


which does the same thing (and more still!).

Of course I realize that there is the distinct possibility that core rails folks hacked this up without a wisp of knowledge of my plugin, but still... if you'd like some ideas for other enhancements to batch processing take a look at it.

To enable stopping and starting or batch processing in batches you have the option of using:

Also why can't you honor the :order param? My plugin handles it fine...

With my batch plugin you can do this:
batch = BolingForBatches::Batch.new(:klass => Payment, :select => "DISTINCT transaction_id", :batch_size => 50, :order => 'transaction_id DESC', :first_batch => 10, :last_batch => 20)
batch.run(:check_status, false, true, true) #extra params are sent to method

March 10, 2009 | Unregistered CommenterPeter Boling

I guess it was done this way to minimise changes to the existing AR code and adapters, but it's a long way from ideal.

With the OCI8 (Oracle) library, response rows are received from the server as a stream. So it would be really cool if ActiveRecord were able to build and yield each object as it was received, instead of adding them to an array and returning the array at the end. Then you would have a single query, no need to set a batch size, and no consistency problems.

I'm not sure how well this would work with other databases. For example, if you do a MySQL SELECT which returns a million rows, does the MySQL client library buffer the entire response before returning it to the application? I'd hope not...

March 13, 2009 | Unregistered CommenterBrian Candler

You might want to know in your post that .each has been changed to find_each to resolve some conflicts.


May 24, 2009 | Unregistered CommenterMario

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>