Tuesday, April 22, 2008

Rails' distributed asset hosts

Chad and others have written about the rails 2.0.2 (or was it 2.0.1?) feature of distributed asset hosts. Here's how I made it work for Amazon's S3 service. To enable the feature you add something similar to this in your environment setup files:

# Enable serving of images, stylesheets, and javascripts from an asset server
config.action_controller.asset_host = "http://asset%d.mytripscrapbook.com"

You'd have four asset hosts set up, like asset0..., asset1..., asset2..., asset3.mytripscrapbook.com. On S3 I have these set up as 4 buckets. Good article for doing this here.

In the compute_asset_host method of ActionView::Helpers, rails takes the modulus 4 of the hash function of the file source to calculate which asset host to use in the asset tag. For example, /javascripts/prototype.js gets hashed and assigned to asset: asset3.mytripscrapbook.com and its url becomes http://asset3.mytripscrapbook.com/javascripts/prototype.js

The easy way is to just replicate all resources across all asset hosts. Or in other words spend quadruple for file storage since you're storing four copies of everything. I couldn't stand that because so many of the resources in MyTripScrapbook are uploaded by users and each save would take four times as long (or I'd have to spin off threads to do the duplicate writes). Be nice if you could use the same hash function to know which asset host to place the resource in advance. But there's a problem.

Rails' compute_asset_host method computes the hash on the source of the resource. My example above was too simplistic. Typically, this is what such a source looks like: /javascripts/prototype.js?2398238734. And, yes, every time the cache-busting timestamp is changed, you'd get a different hash result (well, 75% of the time). So you'd have to replicate all resources across all assets since you'd never know where it would be fetched from.

To fix this, the asset_host method also accepts a Proc argument. Here's what my method looks like in production.rb:

config.action_controller.asset_host = Proc.new do |source|
host = "http://asset%d.mytripscrapbook.com"
filename = File.basename(source).split('?').first
host % (filename.hash % 4)
end

All it does is to strip the path from the source of the resource and strip off any cache-busters on the end. Leaving the raw file name which is repeatable. Rails adds the cache-baster and the path stuff back on when it generates the asset tag anyway but at least we have it looking on a repeatable asset host.

Using this script, I ran through all the existing resources in the existing single S3 bucket and wrote them to the four buckets. Then I deleted the single bucket.

#!/usr/bin/env ruby

require File.dirname(__FILE__) + '/../config/boot'
PUBLICDIR = File.expand_path("public", RAILS_ROOT)
require 'aws/s3'
require 'yaml'

bucket = "asset%d.mytripscrapbook.com"

access_keys = YAML.load_file("#{RAILS_ROOT}/config/amazon_s3.yml")['production']
AWS::S3::Base.establish_connection! :access_key_id => access_keys['access_key_id'], :secret_access_key => access_keys['secret_access_key']
last_item_read = ''
while 1 == 1
assets = AWS::S3::Bucket.find("assets.mytripscrapbook.com", :marker=>last_item_read)
assets.objects.each do |item|

source = File.basename(item.key).split('?').first
host = bucket % (source.hash % 4 )
puts "uploading #{source} as #{item.key} into #{host}"
AWS::S3::S3Object.store item.key, item.value, host, {:access=>:public_read, :cache_control=>"max-age=252460800", :expires=>"Fri, 22 Apr 2016 02:13:54 GMT"}
last_item_read = item.key
break unless assets.is_truncated
end

I was too lazy to insert better variables for the cache-control and expiry headers. I had a couple thousand resources in production and the script took about an hour to run.

Within the application I made sure I used similar hash function to write new resources to the correct bucket. There's no magic to using modulus 4 to assign to 4 asset hosts. You could use 5 or 10 or whatever. I am ok relying on the rails core team selection of four as an optimal number.

I must admit it took a while for me to grok that S3 doesn't have folders or paths. Every resource is stored by a unique key string. The keys just happen to have slashes in them and look like paths. In retrospect I'd design my key names better for S3 than simply using the existing folders in my original file system implementation. But it works.

If ruby's implementation of the hash function changes (always on the cards, I guess) (or perhaps you switch over to JRuby at some point) then you'd need to re-hash the whole collection (or duplicate the original bit-shifting C function) but at least it would be a one-time cost.

One thing Chad mentions didn't work for me. S3 doesn't allow you to use wildcard DNS to associate 4 asset host names to a single bucket. You have to actually have 4 separate buckets set up.

I'm sure this approach only makes sense when you have a high cost for initial writes of the resources and synching methods don't seem practical. Disk storage really is cheap, even at S3. Most applications would be much better off replicating all resources across all asset hosts.

Was it worth it? MyTripScrapbook is a very graphics intensive application because we try to make the pages so pretty. Implementing multiple distributed asset hosts literally halved our page loading times. Average page went from 6 seconds to 3 seconds. There's lots of other performance tuning (really, lots) to do but this was a clear win and worth the effort.

No comments: