Make ActiveRecord Faster

ActiveRecord is a wrapper around your data. By definition that should take memory, and oh indeed it does. It turns out the overhead is quite significant, in both the number of objects and in raw memory.

To see the overhead, let’s create a database table with 10 string columns and fill it with 10,000 rows, each row containing 10 strings of 100 chars.

chp3/app/db/migrate/20140722140429_large_tables.rb
 
class​ LargeTables < ActiveRecord::Migration
 
def​ up
 
create_table :things ​do​ |t|
 
10.times ​do​ |i|
 
t.string ​"col​#{i}​"
 
end
 
end
 
 
execute ​<<-END
 
insert into things(col0, col1, col2, col3, col4,
 
col5, col6, col7, col8, col9) (
 
select
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x')
 
from generate_series(1, 10000)
 
);
 
END
 
end
 
def​ down
 
drop_table :things
 
end
 
end

This migration creates 10 million bytes of data (10,000 * 10 * 100), approximately 9.5 MB. A database is quite efficient at storing that. For example, my PostgreSQL installation uses just 11 MB:

 
$ ​psql app_development
 
app_development=​# select pg_size_pretty(pg_relation_size('things'));
 
pg_size_pretty
 
----------------
 
11 MB

Let’s see how memory-efficient ActiveRecord is. We’ll need to create a Thing model:

chp3/app/app/models/thing.rb
 
class​ Thing < ActiveRecord::Base
 
end

And we’ll need to adapt our wrapper.rb measurement helper from the previous chapter to Rails:

chp3/app/lib/measure.rb
 
class​ Measure
 
 
def​ self.run(options = {gc: :enable})
 
if​ options[:gc] == :disable
 
GC.disable
 
elsif​ options[:gc] == :enable
 
# collect memory allocated during library loading
 
# and our own code before the measurement
 
GC.start
 
end
 
 
memory_before = `ps -o rss= -p #{Process.pid}`.to_i/1024
 
gc_stat_before = GC.stat
 
time = Benchmark.realtime ​do
 
yield
 
end
 
gc_stat_after = GC.stat
 
GC.start ​if​ options[:gc] == :enable
 
memory_after = `ps -o rss= -p #{Process.pid}`.to_i/1024
 
 
puts({
 
RUBY_VERSION => {
 
gc: options[:gc],
 
time: time.round(2),
 
gc_count: gc_stat_after[:count].to_i - gc_stat_before[:count].to_i,
 
memory: ​"%d MB"​ % (memory_after - memory_before)
 
}
 
}.to_json)
 
end
 
 
end

For this to work, add the lib directory to Rails’ autoload_paths in config/application.rb.

chp3/app/config/application.rb
 
config.autoload_paths << Rails.root.join(​'lib'​)

Got that? Good. Now we can run our migration and measure the memory usage. Note that this needs to be done in production mode to make sure we do not include any of Rails development mode’s side effects.

 
$ ​RAILS_ENV=production bundle exec rake db:create
 
$ ​RAILS_ENV=production bundle exec rake db:migrate
 
$ ​RAILS_ENV=production bundle exec rails console
 
2.2.0 :001 >​ Measure.run(gc: :disable) { Thing.all.load }
 
{"2.2.0":{"gc":"enable","time":0.32,"gc_count":1,"memory":"33 MB"}}
 
=> nil

ActiveRecord uses 3.5 times more memory than the size of the data. It also triggers one garbage collection during loading.

ActiveRecord is convenient, but the convenience that ActiveRecord affords comes at a steep price. I realize I’m not going to convince you to avoid ActiveRecord. But you do need to understand the consequences of using it. In 80% of cases, the speed of development is worth more than the cost in execution speed. In the remaining 20% of cases, you have other options. Let me show you them.

Load Only the Attributes You Need

Your first option is to load only the data you intend to use. Rails makes this very easy to do, like this:

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ Measure.run { Thing.all.select([:id, :col1, :col5]).load }
 
{"2.2.0":{"gc":"enable","time":0.21,"gc_count":1,"memory":"7 MB"}}
 
=> nil

This uses 5 times less memory and runs 1.5 times faster than Thing.all.load. The more columns you have, the more it makes sense to add select into the query, especially if you join tables.

Preload Aggressively

Another best practice is preloading. Every time you query into a has_many or belongs_to relationship, preload.

For example, let’s add a has_many relationship call to our Thing. We’ll need to set up the migration and ActiveRecord model.

chp3/app/db/migrate/20140724142101_minions.rb
 
class​ Minions < ActiveRecord::Migration
 
def​ up
 
create_table :minions ​do​ |t|
 
t.references :thing
 
10.times ​do​ |i|
 
t.string ​"mcol​#{i}​"
 
end
 
end
 
 
execute ​<<-END
 
insert into minions(thing_id,
 
mcol0, mcol1, mcol2, mcol3, mcol4,
 
mcol5, mcol6, mcol7, mcol8, mcol9) (
 
select
 
things.id,
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x'), rpad('x', 100, 'x'), rpad('x', 100, 'x'),
 
rpad('x', 100, 'x')
 
from things, generate_series(1, 10)
 
);
 
END
 
end
 
def​ down
 
drop_table :minions
 
end
 
end
chp3/app/app/models/minion.rb
 
class​ Minion < ActiveRecord::Base
 
belongs_to :thing
 
end
chp3/app/app/models/thing.rb
 
class​ Thing < ActiveRecord::Base
 
has_many :minions
 
end

Run the migration with RAILS_ENV=production bundle exec rake db:migrate and you will get 10 Minions for each Thing in the database.

Iterating over that data without preloading is not such a good idea.

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ Measure.run { Thing.all.each { |thing| thing.minions.load } }
 
{"2.2.0":{"gc":"enable","time":272.93,"gc_count":16,"memory":"478 MB"}}
 
=> nil

Good luck waiting for this one line of code to finish. It needs not only to load everything into memory, but also to execute 10,000 queries against the database to fetch the minions for each thing.

Preloading is the better way.

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ Measure.run { Thing.all.includes(:minions).load }
 
{"2.2.0":{"gc":"enable","time":11.59,"gc_count":19,"memory":"518 MB"}}
 
=> nil

Depending on the Rails version, this might be slightly less memory efficient. But the code finishes 25 times faster because Rails performs only two database queries—one to load things, and another to load minions.

Combine Selective Attribute Loading and Preloading

Even better is to take my advice from the Load Only the Attributes You Need section and select only the columns we need. But there’s a catch. Rails does not have a convenient way of selecting a subset of columns from the dependent model. For example, this will fail:

 
Thing.all.includes(:minions).select(​"col1"​, ​"minions.mcol4"​).load

It fails because includes(:minions) runs an additional query to fetch minions for the things it selected. And Rails is not smart enough to figure out which of the select columns belong to the Minions table.

If we queried from the side of the belongs_to association, we would use joins.

 
Minion.where(id: 1).joins(:thing).select(​"things.col1"​, ​"minions.mcol4"​)

From the has_many side joins will return duplicates of the same Thing object, 10 duplicates in our case. To combat that, we can use the PostgreSQL-specific array_agg feature that aggregates an array of columns from the joined table.

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ query = "select id, col1, array_agg(mcol4) from things
 
2.2.0 :002"> inner join
 
2.2.0 :003"> (select thing_id, mcol4 from minions) minions
 
2.2.0 :004"> on (things.id = minions.thing_id)
 
2.2.0 :005"> group by id, col1"
 
=> "select id, col1, array_agg(mcol4) from things
 
inner join
 
(select thing_id, mcol4 from minions) minions
 
on (things.id = minions.thing_id)
 
group by id, col1"
 
2.2.0 :006 >​ Measure.run { Thing.find_by_sql(query) }
 
{"2.2.0":{"gc":"enable","time":0.62,"gc_count":1,"memory":"8 MB"}}
 
=> nil

Just look at the memory consumption: 8 MB instead of 518 MB from a full select with preloading. As a bonus, this runs 20 times faster.

Restricting the number of columns you select can save you seconds of execution time and hundreds of megabytes of memory.

Use the Each! Pattern for Rails with find_each and find_in_batches

It is expensive to instantiate a lot of ActiveRecord models. Rails developers knew that and added two functions to loop through large datasets in batches. Both find_each and find_in_batches will load by default 1,000 objects and return them to you—the first function, one by one; the latter, the whole batch at once. You can ask for smaller or larger batches with the :batch_size option.

find_each and find_in_batches will still have to load all the objects in memory. So how do they improve performance? The effect is the same as with the each! pattern from Use the Each! Pattern. Once you’re done with the batch, GC can collect it. Let’s see how that works.

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ ObjectSpace.each_object(Thing).count
 
=> 0
 
2.2.0 :002 >​ Thing.find_in_batches { |batch|
 
2.2.0 :003?> GC.start
 
2.2.0 :004?> puts ObjectSpace.each_object(Thing).count
 
2.2.0 :005?> }
 
1000
 
2000
 
… 6 lines elided
 
2000
 
2000
 
=> nil
 
2.2.0 :006 >​ GC.start
 
=> nil
 
2.2.0 :007 >​ ObjectSpace.each_object(Thing).count
 
=> 0

GC indeed collects objects from previous batches, so no more than two batches are in memory during the iteration. Compare this with the regular each iterator over the list of objects returned by Thing.all.

 
$ ​RAILS_ENV=production bundle exec rails console
 
Loading production environment (Rails 4.1.4)
 
2.2.0 :001 >​ ObjectSpace.each_object(Thing).count
 
=> 0
 
2.2.0 :002 >​ Thing.all.each_with_index { |thing, i|
 
2.2.0 :003?> if i % 1000 == 0
 
2.2.0 :004?> GC.start
 
2.2.0 :005?> puts ObjectSpace.each_object(Thing).count
 
2.2.0 :006?> end
 
2.2.0 :007?> }; nil
 
10000
 
10000
 
… 6 lines elided
 
10000
 
10000
 
=> nil

Here we keep 10,000 objects for the whole duration of the each loop. This increases both total memory consumption and GC time. It also increases the risk of running out of memory if the dataset is too big (remember, ActiveRecord needs 3.5 times more space to store your data).

Use ActiveRecord without Instantiating Models

If all you need is to run a database query or update a column in the table, consider using the following ActiveRecord functions that do not instantiate models.

  • ActiveRecord::Base.connection.execute("select * from things")

    This function executes the query and returns its result unparsed.

  • ActiveRecord::Base.connection.select_values("select col5 from things")

    Similar to the previous function, but returns an array of values only from the first column of the query result.

  • Thing.all.pluck(:col1, :col5)

    Variation of the previous two functions. Returns an array of values that contains either the whole row or the columns you specified in the arguments to pluck.

  • Thing.where("id < 10").update_all(col1: ’something’)

    Updates columns in the table.

These not only save you memory, but also run faster because they neither instantiate models nor execute before/after filters. All they do is run plain SQL queries and, in some cases, return arrays as the result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.152.136