Wednesday, April 30, 2008

Tiny MCE and XSS

Awfully obvious but documented here in case you have a brain-fart like I did.

javascript_include_tag "#{request.protocol}#{}#{request.port_string}/javascripts/tiny_mce/#{RAILS_ENV == 'development' ? 'tiny_mce_src.js' : 'tiny_mce.js'}"

Tuesday, April 22, 2008

Rails' distributed asset hosts

Chad and others have written about the rails 2.0.2 (or was it 2.0.1?) feature of distributed asset hosts. Here's how I made it work for Amazon's S3 service. To enable the feature you add something similar to this in your environment setup files:

# Enable serving of images, stylesheets, and javascripts from an asset server
config.action_controller.asset_host = ""

You'd have four asset hosts set up, like asset0..., asset1..., asset2..., On S3 I have these set up as 4 buckets. Good article for doing this here.

In the compute_asset_host method of ActionView::Helpers, rails takes the modulus 4 of the hash function of the file source to calculate which asset host to use in the asset tag. For example, /javascripts/prototype.js gets hashed and assigned to asset: and its url becomes

The easy way is to just replicate all resources across all asset hosts. Or in other words spend quadruple for file storage since you're storing four copies of everything. I couldn't stand that because so many of the resources in MyTripScrapbook are uploaded by users and each save would take four times as long (or I'd have to spin off threads to do the duplicate writes). Be nice if you could use the same hash function to know which asset host to place the resource in advance. But there's a problem.

Rails' compute_asset_host method computes the hash on the source of the resource. My example above was too simplistic. Typically, this is what such a source looks like: /javascripts/prototype.js?2398238734. And, yes, every time the cache-busting timestamp is changed, you'd get a different hash result (well, 75% of the time). So you'd have to replicate all resources across all assets since you'd never know where it would be fetched from.

To fix this, the asset_host method also accepts a Proc argument. Here's what my method looks like in production.rb:

config.action_controller.asset_host = do |source|
host = ""
filename = File.basename(source).split('?').first
host % (filename.hash % 4)

All it does is to strip the path from the source of the resource and strip off any cache-busters on the end. Leaving the raw file name which is repeatable. Rails adds the cache-baster and the path stuff back on when it generates the asset tag anyway but at least we have it looking on a repeatable asset host.

Using this script, I ran through all the existing resources in the existing single S3 bucket and wrote them to the four buckets. Then I deleted the single bucket.

#!/usr/bin/env ruby

require File.dirname(__FILE__) + '/../config/boot'
PUBLICDIR = File.expand_path("public", RAILS_ROOT)
require 'aws/s3'
require 'yaml'

bucket = ""

access_keys = YAML.load_file("#{RAILS_ROOT}/config/amazon_s3.yml")['production']
AWS::S3::Base.establish_connection! :access_key_id => access_keys['access_key_id'], :secret_access_key => access_keys['secret_access_key']
last_item_read = ''
while 1 == 1
assets = AWS::S3::Bucket.find("", :marker=>last_item_read)
assets.objects.each do |item|

source = File.basename(item.key).split('?').first
host = bucket % (source.hash % 4 )
puts "uploading #{source} as #{item.key} into #{host}" item.key, item.value, host, {:access=>:public_read, :cache_control=>"max-age=252460800", :expires=>"Fri, 22 Apr 2016 02:13:54 GMT"}
last_item_read = item.key
break unless assets.is_truncated

I was too lazy to insert better variables for the cache-control and expiry headers. I had a couple thousand resources in production and the script took about an hour to run.

Within the application I made sure I used similar hash function to write new resources to the correct bucket. There's no magic to using modulus 4 to assign to 4 asset hosts. You could use 5 or 10 or whatever. I am ok relying on the rails core team selection of four as an optimal number.

I must admit it took a while for me to grok that S3 doesn't have folders or paths. Every resource is stored by a unique key string. The keys just happen to have slashes in them and look like paths. In retrospect I'd design my key names better for S3 than simply using the existing folders in my original file system implementation. But it works.

If ruby's implementation of the hash function changes (always on the cards, I guess) (or perhaps you switch over to JRuby at some point) then you'd need to re-hash the whole collection (or duplicate the original bit-shifting C function) but at least it would be a one-time cost.

One thing Chad mentions didn't work for me. S3 doesn't allow you to use wildcard DNS to associate 4 asset host names to a single bucket. You have to actually have 4 separate buckets set up.

I'm sure this approach only makes sense when you have a high cost for initial writes of the resources and synching methods don't seem practical. Disk storage really is cheap, even at S3. Most applications would be much better off replicating all resources across all asset hosts.

Was it worth it? MyTripScrapbook is a very graphics intensive application because we try to make the pages so pretty. Implementing multiple distributed asset hosts literally halved our page loading times. Average page went from 6 seconds to 3 seconds. There's lots of other performance tuning (really, lots) to do but this was a clear win and worth the effort.

Sunday, April 20, 2008

Moving to morph: postgres migration.

This is the first in a series of posts about moving from a perfectly good VPS ( to the morph exchange platform. We love being on morph but it was not trivial for us to make the move.

Bear in mind, we have a production application with some 200MB of data over 6 months of operation. Not big, by any means, but significant.

Before anything else, we had to move from mysql to postgres. Morph only uses postgres (for now) because it supports almost-real-time database backups to an S3 bucket. We love it because it removes our 10 minute transaction log backups and nightly complete dumps.

First install postgres 8.3 on the mac.
Then install phppgadmin for admin features because that's what morph gives you to interact with your production database.
Doing the data migration turned out to be a headache.
First, we did a mysqldump from production, FTP'd it to a dev machine, and inserted it into a clean copy of the application and database.
We set up a different (clean) mac with only postgres and rails so that we'd be sure to get exactly the right pieces installed.
Then we followed this recipe to get the data into postgres.
For some reason the rake db:schema_dump made some errors with foreign key columns (making them :string instead of :integer). So we had to edit those by hand.
Turns out our data had quite a few rows with embedded line breaks (\n and \r) and we had to edit these by hand.

Application implications for postgres.
Many times we inadvertently relied on mysql retuning rows from an ActiveRecord fetch in id order. Postgres will not. Need to add specific :order hashes to every fetch.
Mysql sorts on :boolean fields differently than postgres does. We wanted NULL values first, which mysql does. Postgres, of course, puts them last.
Beware of mysql-specific functions in your sql. For example Postgres does not support the CONCAT function. I don't know if it's an ANSI/SQL92 function, but postgres uses the "||" operator to concat string columns.

Migration. Getting the data into the morph db when the app is first initialized requires setting up the schema and then inserting all the current production rows. We tried using pgphpadmin's SQL dialogs to do this remotely but we failed each time. We like rails, so we did it in the migration. To do it, we dumped the 200 MB of SQL into 4 separate files at the rails root. Then in the migration we called them via:

%w(tp1.sql tp2.sql tp3.sql tp4.sql).each do |sql_file| do |file|


Postgres keeps its serial columns' next sequential ids in so-called sequential tables. The inserts performed in the migration will not have incremented these because we wanted to use the existing id values from the production db. However, for new rows to be inserted in those tables, you'll need to update the sequential tables. We used this in the migration after the bulk load. Notice how we exclude tables that do not use sequential ids and empty tables.

# Gather tables that use sequential ids.
set_id_tables =
ActiveRecord::Base.connection.tables.sort.reject! do |tbl|
['schema_info', 'bookings_users', 'friendships', 'journal_entries_users', 'roles_users', 'sessions', 'slide_desk_images'].include?(tbl)

# Set the sequence to the highest of inserted records.
set_id_tables.each do |tbl|
execute "select setval('#{tbl}_id_seq', (select max(id) from #{tbl}));"

I'll add references and links to this post over the next few days.

Thursday, April 10, 2008

morph - a new deployment platform

I met Guy Naor, the CTO and one of the founders of morph exchange, at the Acts_as_conference in Orlando in March 2008. He spoke very eloquently about his new deployment platform for rails (and other) applications. They basically provide Amazon EC2 instances with an automated deployment system. You can choose how many instances to run on a sliding scale of price. The lowest level is free and the next lowest level (running two mongrels, one each on two EC2 instances) is around $30 per month. Their price list runs out at 12 mongrels (on 12 instances) at around $340 per month. I'm sure they'll also support more than that.

This means you can upload a rails app and scale it to reasonably large deployments with literally no effort.

For startups (like mine) that's gold. We've been using a variety of server deployments for MyTripScrapbook, most recently a VPS with We hosted out own apache2, mysql and mongrels on debian linux. We had no problem keeping 8 mongrels running in production with 512 MB RAM and this worked great with our Beta test customers and internal staff. However we knew that our first big customer would add 10,000 users in no time. 10,000 users who all wanted to use the application every day for sessions that are about an hour long.

Morph seemed like the best offering available. But it was a different direction than we'd been planning. The biggest change for us was the architecture change to replicated app servers. In later posts I'll describe some morph-specific changes we had to make, but for this post, I'll focus on moving all static assets to Amazon S3.

Our application is asset-heavy. We pack around 720KB of images on a typical page and another 170KB of javascript files - altogether around 45 resource requests. In order to optimize this we used Amazon's S3 service as a poor-man's Content Delivery Network. We split those static resources out over 4 S3 buckets for the bulk of the images plus another 2 buckets for javascript, css files, and the images used in css backgrounds. We wrote a hashing algorithm to determine which assets are stored and fetched from which asset host. This allows the typical browser to pipeline around 10 resource requests simultaneously. Amazon's network and availability also ensures very speedy delivery of our static content. Our pages typically load in half the time than before. Please see an earlier post for the rails-specific details.

Tuesday, April 1, 2008

TinyMCE and using Rails asset hosts.

Got a weird problem with running the TinyMCE editor from an asset host. It loads ok, but in IE (and IE only: 6 and 7) it fails to load the replaced contents of the textarea form field into its own rich text display. The rich text display is an iframe that it creates on the fly and fills by DOM manipulation.

I am almost certain what's going on that IE's XSS protection is different from FF, Safari, and Opera's and it's preventing the iframe from receiving the data from the javascript served from the asset host.

So I made the javascript tag for that js file only into a fully qualified URL. The rails asset tag won't try to fetch that from the asset host. It works but I don't like it.