Monday, May 19, 2008

morph - support

I mentioned earlier that I had met Guy Naor, CTO and co-founder of morph at a conference earlier this year. I didn't realize he was also CTO and co-founder at Famundo. I had been reading and learning from his rails blog for at least a year! 

Moving to a new platform and worse, to a platform that only just exited Beta, is a hairy experience. We'd deployed a cut-down version of MyTripScrapbook to morph a couple times to make sure we knew what the process was and what issues we'd run into. 

As is typical for a start-up we received word on a Friday that a very promising customer wanted to start trails in earnest early the following week - Monday, actually. We knew that morph's EC2 instances were serving our site pages faster than our VPS and we really wanted the trial to go well. So we decided to move our entire production platform to morph over the weekend. Talk about insane.

Also I was traveling so it was done from a succession of hotel rooms across Colorado.

It took the whole weekend and there were plenty of hiccups. Most were problems with our giant script to move the production database from MySQL to postgres, one was issues with DNS changes,  some were our lack of understanding with morph, a very few were morph issues.

I'll post in more detail about some of these, but a constant thread through all our problems was the availability and helpfulness of the morph staff. My first Help! email was answered in about 20 minutes and they seemed to have a team responding to our issues over the whole weekend, day and night. From the get-go the support staff cc'd Guy and his support manager Ken on every issue we had.

I could not ask for a more helpful support experience. And it has continued ever since then.

We were ready for our new client on Monday morning and the test went well. Not only that but most of our pages get served in less than half the time than before. We do a lot of image processing (Imagemagick) on many pages and the EC2 instances burn through complex compositing faster than my new iMac! and easily twice as fast as our old VPS. 

We are looking forward to scaling with morph as far as they will go.

Wednesday, April 30, 2008

Tiny MCE and XSS

Awfully obvious but documented here in case you have a brain-fart like I did.



javascript_include_tag "#{request.protocol}#{request.host}#{request.port_string}/javascripts/tiny_mce/#{RAILS_ENV == 'development' ? 'tiny_mce_src.js' : 'tiny_mce.js'}"

Tuesday, April 22, 2008

Rails' distributed asset hosts

Chad and others have written about the rails 2.0.2 (or was it 2.0.1?) feature of distributed asset hosts. Here's how I made it work for Amazon's S3 service. To enable the feature you add something similar to this in your environment setup files:

# Enable serving of images, stylesheets, and javascripts from an asset server
config.action_controller.asset_host = "http://asset%d.mytripscrapbook.com"

You'd have four asset hosts set up, like asset0..., asset1..., asset2..., asset3.mytripscrapbook.com. On S3 I have these set up as 4 buckets. Good article for doing this here.

In the compute_asset_host method of ActionView::Helpers, rails takes the modulus 4 of the hash function of the file source to calculate which asset host to use in the asset tag. For example, /javascripts/prototype.js gets hashed and assigned to asset: asset3.mytripscrapbook.com and its url becomes http://asset3.mytripscrapbook.com/javascripts/prototype.js

The easy way is to just replicate all resources across all asset hosts. Or in other words spend quadruple for file storage since you're storing four copies of everything. I couldn't stand that because so many of the resources in MyTripScrapbook are uploaded by users and each save would take four times as long (or I'd have to spin off threads to do the duplicate writes). Be nice if you could use the same hash function to know which asset host to place the resource in advance. But there's a problem.

Rails' compute_asset_host method computes the hash on the source of the resource. My example above was too simplistic. Typically, this is what such a source looks like: /javascripts/prototype.js?2398238734. And, yes, every time the cache-busting timestamp is changed, you'd get a different hash result (well, 75% of the time). So you'd have to replicate all resources across all assets since you'd never know where it would be fetched from.

To fix this, the asset_host method also accepts a Proc argument. Here's what my method looks like in production.rb:

config.action_controller.asset_host = Proc.new do |source|
host = "http://asset%d.mytripscrapbook.com"
filename = File.basename(source).split('?').first
host % (filename.hash % 4)
end

All it does is to strip the path from the source of the resource and strip off any cache-busters on the end. Leaving the raw file name which is repeatable. Rails adds the cache-baster and the path stuff back on when it generates the asset tag anyway but at least we have it looking on a repeatable asset host.

Using this script, I ran through all the existing resources in the existing single S3 bucket and wrote them to the four buckets. Then I deleted the single bucket.

#!/usr/bin/env ruby

require File.dirname(__FILE__) + '/../config/boot'
PUBLICDIR = File.expand_path("public", RAILS_ROOT)
require 'aws/s3'
require 'yaml'

bucket = "asset%d.mytripscrapbook.com"

access_keys = YAML.load_file("#{RAILS_ROOT}/config/amazon_s3.yml")['production']
AWS::S3::Base.establish_connection! :access_key_id => access_keys['access_key_id'], :secret_access_key => access_keys['secret_access_key']
last_item_read = ''
while 1 == 1
assets = AWS::S3::Bucket.find("assets.mytripscrapbook.com", :marker=>last_item_read)
assets.objects.each do |item|

source = File.basename(item.key).split('?').first
host = bucket % (source.hash % 4 )
puts "uploading #{source} as #{item.key} into #{host}"
AWS::S3::S3Object.store item.key, item.value, host, {:access=>:public_read, :cache_control=>"max-age=252460800", :expires=>"Fri, 22 Apr 2016 02:13:54 GMT"}
last_item_read = item.key
break unless assets.is_truncated
end

I was too lazy to insert better variables for the cache-control and expiry headers. I had a couple thousand resources in production and the script took about an hour to run.

Within the application I made sure I used similar hash function to write new resources to the correct bucket. There's no magic to using modulus 4 to assign to 4 asset hosts. You could use 5 or 10 or whatever. I am ok relying on the rails core team selection of four as an optimal number.

I must admit it took a while for me to grok that S3 doesn't have folders or paths. Every resource is stored by a unique key string. The keys just happen to have slashes in them and look like paths. In retrospect I'd design my key names better for S3 than simply using the existing folders in my original file system implementation. But it works.

If ruby's implementation of the hash function changes (always on the cards, I guess) (or perhaps you switch over to JRuby at some point) then you'd need to re-hash the whole collection (or duplicate the original bit-shifting C function) but at least it would be a one-time cost.

One thing Chad mentions didn't work for me. S3 doesn't allow you to use wildcard DNS to associate 4 asset host names to a single bucket. You have to actually have 4 separate buckets set up.

I'm sure this approach only makes sense when you have a high cost for initial writes of the resources and synching methods don't seem practical. Disk storage really is cheap, even at S3. Most applications would be much better off replicating all resources across all asset hosts.

Was it worth it? MyTripScrapbook is a very graphics intensive application because we try to make the pages so pretty. Implementing multiple distributed asset hosts literally halved our page loading times. Average page went from 6 seconds to 3 seconds. There's lots of other performance tuning (really, lots) to do but this was a clear win and worth the effort.

Sunday, April 20, 2008

Moving to morph: postgres migration.

This is the first in a series of posts about moving from a perfectly good VPS (RapidVPS.com) to the morph exchange platform. We love being on morph but it was not trivial for us to make the move.

Bear in mind, we have a production application with some 200MB of data over 6 months of operation. Not big, by any means, but significant.

Before anything else, we had to move from mysql to postgres. Morph only uses postgres (for now) because it supports almost-real-time database backups to an S3 bucket. We love it because it removes our 10 minute transaction log backups and nightly complete dumps.

First install postgres 8.3 on the mac.
Then install phppgadmin for admin features because that's what morph gives you to interact with your production database.
Doing the data migration turned out to be a headache.
First, we did a mysqldump from production, FTP'd it to a dev machine, and inserted it into a clean copy of the application and database.
We set up a different (clean) mac with only postgres and rails so that we'd be sure to get exactly the right pieces installed.
Then we followed this recipe to get the data into postgres.
For some reason the rake db:schema_dump made some errors with foreign key columns (making them :string instead of :integer). So we had to edit those by hand.
Turns out our data had quite a few rows with embedded line breaks (\n and \r) and we had to edit these by hand.

Application implications for postgres.
Many times we inadvertently relied on mysql retuning rows from an ActiveRecord fetch in id order. Postgres will not. Need to add specific :order hashes to every fetch.
Mysql sorts on :boolean fields differently than postgres does. We wanted NULL values first, which mysql does. Postgres, of course, puts them last.
Beware of mysql-specific functions in your sql. For example Postgres does not support the CONCAT function. I don't know if it's an ANSI/SQL92 function, but postgres uses the "||" operator to concat string columns.

Migration. Getting the data into the morph db when the app is first initialized requires setting up the schema and then inserting all the current production rows. We tried using pgphpadmin's SQL dialogs to do this remotely but we failed each time. We like rails, so we did it in the migration. To do it, we dumped the 200 MB of SQL into 4 separate files at the rails root. Then in the migration we called them via:

%w(tp1.sql tp2.sql tp3.sql tp4.sql).each do |sql_file|
File.open(sql_file) do |file|

execute file.read
end
end

Postgres keeps its serial columns' next sequential ids in so-called sequential tables. The inserts performed in the migration will not have incremented these because we wanted to use the existing id values from the production db. However, for new rows to be inserted in those tables, you'll need to update the sequential tables. We used this in the migration after the bulk load. Notice how we exclude tables that do not use sequential ids and empty tables.

# Gather tables that use sequential ids.
set_id_tables =
ActiveRecord::Base.connection.tables.sort.reject! do |tbl|
['schema_info', 'bookings_users', 'friendships', 'journal_entries_users', 'roles_users', 'sessions', 'slide_desk_images'].include?(tbl)
end

# Set the sequence to the highest of inserted records.
set_id_tables.each do |tbl|
execute "select setval('#{tbl}_id_seq', (select max(id) from #{tbl}));"
end

I'll add references and links to this post over the next few days.

Thursday, April 10, 2008

morph - a new deployment platform

I met Guy Naor, the CTO and one of the founders of morph exchange, at the Acts_as_conference in Orlando in March 2008. He spoke very eloquently about his new deployment platform for rails (and other) applications. They basically provide Amazon EC2 instances with an automated deployment system. You can choose how many instances to run on a sliding scale of price. The lowest level is free and the next lowest level (running two mongrels, one each on two EC2 instances) is around $30 per month. Their price list runs out at 12 mongrels (on 12 instances) at around $340 per month. I'm sure they'll also support more than that.

This means you can upload a rails app and scale it to reasonably large deployments with literally no effort.

For startups (like mine) that's gold. We've been using a variety of server deployments for MyTripScrapbook, most recently a VPS with RapidVPS.com. We hosted out own apache2, mysql and mongrels on debian linux. We had no problem keeping 8 mongrels running in production with 512 MB RAM and this worked great with our Beta test customers and internal staff. However we knew that our first big customer would add 10,000 users in no time. 10,000 users who all wanted to use the application every day for sessions that are about an hour long.

Morph seemed like the best offering available. But it was a different direction than we'd been planning. The biggest change for us was the architecture change to replicated app servers. In later posts I'll describe some morph-specific changes we had to make, but for this post, I'll focus on moving all static assets to Amazon S3.

Our application is asset-heavy. We pack around 720KB of images on a typical page and another 170KB of javascript files - altogether around 45 resource requests. In order to optimize this we used Amazon's S3 service as a poor-man's Content Delivery Network. We split those static resources out over 4 S3 buckets for the bulk of the images plus another 2 buckets for javascript, css files, and the images used in css backgrounds. We wrote a hashing algorithm to determine which assets are stored and fetched from which asset host. This allows the typical browser to pipeline around 10 resource requests simultaneously. Amazon's network and availability also ensures very speedy delivery of our static content. Our pages typically load in half the time than before. Please see an earlier post for the rails-specific details.


Tuesday, April 1, 2008

TinyMCE and using Rails asset hosts.

Got a weird problem with running the TinyMCE editor from an asset host. It loads ok, but in IE (and IE only: 6 and 7) it fails to load the replaced contents of the textarea form field into its own rich text display. The rich text display is an iframe that it creates on the fly and fills by DOM manipulation.

I am almost certain what's going on that IE's XSS protection is different from FF, Safari, and Opera's and it's preventing the iframe from receiving the data from the javascript served from the asset host.

So I made the javascript tag for that js file only into a fully qualified URL. The rails asset tag won't try to fetch that from the asset host. It works but I don't like it.

Saturday, March 15, 2008

Calling helpers in the rails console.

Trying to debug some asset tag helper issues and I found this...

Did you know you can call helpers directly in the rails console? Just use the helper object...

>> helper.link_to "this", "that"
=> <a href="that">this</a>


You can also call custom helpers (from your app), but I haven't tried it:

>> helper :my_custom_helper

Wednesday, March 12, 2008

UTF-8 and hpricot

I needed to take the text tagged in an XML document and make url strings out of it. Used hpricot to parse the XML. Like this:

doc = Hpricot.XML(itinerary_day_description)

and then used xpath to find the text within the <cite> tags that will form the basis of the url I need:

activities = doc/("cite")
activities.each do |activity|
title = activity.innerText
link = "<a href=\" '/redboxes/activity/#{title}\"
etc - you get the idea...

Hit the problem when the text contained non-ASCII UTF-8 characters (ñ, é, etc).

Hpricot conveniently converted them to HTML entities. And then innerText converted them into a meaningless character.

Not only does hpricot perform the HTML entity encoding in the initial XML document, but it performs it again every time the XML document gets processed.

Here's what I had to do to make this work.
  1. Use innerHTML instead of innerText. It preserves the HTML entity encoding that innerText didn't.
  2. Use the awesome HTMLEntities module from Paul Battley. I simply converted the title from an HTML entity back to native UTF-8 characters.
  3. Use CGI.escape for URL encoding.
So the final code snippet looks like:

doc = Hpricot.XML(itinerary_day_description)
coder = HTMLEntities.new
activities = doc/("cite")
activities.each do |activity|
title = coder.decode(activity.innerHTML)
link = "<a href=\" '/redboxes/activity/#{CGI.escape(title)}\" etc

Friday, February 29, 2008

you are being redirected

Just built a sitemap generator for MyTripScrapbook.com, a fun exercise to let search engines discover travelers' journals that are not able to be crawled. They can't be crawled because journals are dynamically displayed and there are no links to them - an interested friend needs to know the screen name of their traveler or enter the traveler's name in a search box. If the friend enters a url directly (like MyTripScrapbook.com/JaneDoe) for a traveler that does not exist, we simply redirect to the Home page where the search box lives.

Building the sitemap was fun because it will get large quickly (each traveler has around 10 pages of journal to be indexed, so with 100,000 travelers there's a million entries in the sitemap file, although we are not at that size yet) and I don't want mongrel timing out on generating a lengthy XML file. But that's not the purpose of this post.

It wasn't until I submitted the sitemap to Google that I discovered a real problem. Google refused to accept the site validation of uploading a coded file to the web server root because my rails app was returning a soft 404. Remember I said we redirect unfulfillable requests to the Home page? Well, the status in the http header was being set to 302 (moved, found), which is the rails default for a redirect_to instead of a 404 (not found). Google rightly refused to validate the coded file because the site never returned a 404 error.

The answer seemed simple: add a :status => 404 to the redirect_to statement. But no, this resulted in rails generating the strange "you are being redirected" page. Yes, that's the only content of the page and only when the user clicks on the link will they be taken to the redirected url. I don't understand why this is, but it is certainly not the desired user experience.

The answer was to drop the redirect_to and use plain old render :action=>"index", :status => 404 instead.

In a few months when the traveler numbers swell with the summer travel season, I'll need to partition the journal space using a siteindex file which, in turn, will point to multiple sitemaps, each of which cannot contain more than 50,000 entries.

Friday, February 8, 2008

Evan Phoenix at acts_as_conference

We're sitting here at acts_as_conference getting our learnin' on. The star of the show so far is Evan Phoenix. He actually used "je ne sais quoi" in his presentation in reference to code refactoring. We've collectively decided that Evan is the Anti-DHH. What a great guy. His entire speech was built around creating inclusiveness in Rubinius. Smart, and fun. Makes us want to commit some patches.

Friday, January 11, 2008

RubyWorks and Gem Sources

On one project, we used the awesome RubyWorks packages to install the production environment on our RedHat EL 5 servers. Then one day when upgrading some gems, we got the error "Could not find gemnamehere (> 0) in any repository". It turns out that those RubyWorks kids thoughtfully removed the original RubyForge gem source from your gem sources list. To see if this has happened to you type:

gem sources

If there is no value for gems.RubyForge.org in there, you might want to change it back. It worked like this for us:

sudo gem sources -r http://rubyworks.rubyforge.org/redhat/GEMS/i386
sudo gem sources -a http://gems.rubyforge.org

The best part is that we figured this out by reading the amazingly well documented gem source code under /usr/lib/ruby/1.8/rubygems/commands

Hope this helps someone else who can't update their gems!

Thursday, January 3, 2008

PDF/Writer

Decided to use PDF/Writer to generate a chunky report. Before I say how much I enjoyed using it, let me list its shortcomings:
  • No UTF-8 support yet. So those European characters are garbled.
  • No built-in support for text wrapping images or widow/orphan control.
  • No explicit support for headers, footers, TOC, and other common document elements.
  • No support for HTML tags, CSS-like layout, or any layout, really.
  • It's slow.
But it's ruby so there are some nice usage of blocks and other rubyisms that made it the best choice for me. I thought I'd include a couple chunks of code to illustrate how you can use PDF/Writer. I'm sure there are better ways but I couldn't find any code examples outside the demos on the site. So hoping this helps someone...

Here's my main page footer that is three lines of text below a horizontal rule including date, page numbering and some stuff:

pdf.start_page_numbering pdf.absolute_right_margin, 44, 9, :right, "Page of ."
pdf.open_object do |footer|

pdf.save_state
center = pdf.margin_x_middle
t1 = "#{tour_code.name} (#{tour_code.identifier})"
w1 = pdf.text_width(t1, 10) / 2.0
t2 = "Printed: #{Time.now.to_date}"
w2 = pdf.text_width(t2, 9)
t3 = "Copyright for brochure text and overview map belongs to #{tour_code.tour_operator.name}."
w3 = pdf.text_width(t3, 9) / 2.0
t4 = "Copyright for photos belongs to the respective owners. All else Copyright by MyTripScrapbook.com."
w4 = pdf.text_width(t4, 9) / 2.0
pdf.add_text_wrap(x-w1, 44, w1*2, t1, 10)
pdf.add_text_wrap(pdf.absolute_left_margin, 44, w2, t2, 9)
pdf.add_text_wrap(x-w3, 32, w3*3, t3, 9, :left)
pdf.add_text_wrap(x-w4, 22, w4*2, t4, 9)
pdf.line(pdf.absolute_left_margin, 60, pdf.absolute_right_margin , 60).stroke
pdf.restore_state
pdf.close_object
pdf.add_object(footer, :all_pages)
end


You'll see the single line that sets up the page numbering string pattern. Then the block that creates an arbitrary chunk of repeating text - in this case a footer. pdf is the pdf object that had been created earlier. Nothing remarkable about this except to notice the use of some dimensions using method calls (pdf.absolute_left_margin) and some absolute coordinates.

PDF/Writer allows you to place text or images in two ways: automatically at the current insertion point allowing line wrap to happen, or at a specified point on the page. In order to create a text-wrap effect around an image (in this case a map), I temporarily expanded the left margin by the width of the map, inserted the text at the current insertion point allowing line wrap to happen, and inserted the image into the empty wide left margin. Then checked to see which of the image and the text column was the longest, and advanced the insertion point to there plus a bit of padding.

marker = pdf.y
overview = tour_code.tour_code_overview
# Since we're using the thumb (200px wide) version of the map, we need to adjust the dimensions from the full size.
overview.height = overview.height*200/overview.width
overview.width = 200
pdf.left_margin += overview.width + 40
description = remove_pesky_tags(overview.description)
pdf.text description, :justification=>:full, :font_size=>10
text_height = (marker - pdf.y)
pdf.add_image_from_file RAILS_ROOT + overview.public_filename(:t), 72, marker - overview.height - 10
if overview.height > text_height
pdf.move_pointer(overview.height - text_height + 20)
end
pdf.left_margin += -(overview.width + 40)


Prior to starting a new heading, I added a primitive widow control. Fortunately I already knew the height of the text block (90 units):

if (pdf.y - 90 <>
pdf.start_new_page(true)
end


PDF/Writer also includes a handy way to measure the height of a text block without actually printing it. I'll use that for cases when I don't know the height in advance.

For a final example, it was useful to display two images side by side without having to explicitly measure each one and use coordinate positioning. So I used two columns. Here I print a text block, start 2 column mode, print the first image, start new column and print the second, then go back to single column mode for the final text block.

pdf.text activity.title + blank_line, :justification=>:center, :font_size=>12
pdf.start_columns 2, 2 # Start 2 columns.
pdf.image RAILS_ROOT+activity.public_filename(1, 300), :pad=>0, :justification=>:center
pdf.start_new_page # Starts next column.
pdf.image RAILS_ROOT+activity.public_filename(2, 300), :pad=>0, :justification=>:center if activity.image2_width
pdf.stop_columns # Back to single column.
pdf.text remove_pesky_tags(activity.description) + blank_line, :justification=>:full, :font_size=>10


You'll notice the remove_pesky_tags method which helps deal with (only some of) the issues of HTML tags and HTML entities.

def self.remove_pesky_tags(text)
new_text = text.gsub(/<\/p>

/,"\n\n")
new_text.gsub!(/

/,"")
new_text.gsub!(/<\/p>/,"")
new_text.gsub!(/"/,"'")
new_text.gsub!(//,"")
new_text.gsub!(/<\/cite>/,"
")
new_text.gsub!(/ /,"")
new_text.gsub!(/á/,"a")
new_text.gsub!(/é/,"e")
new_text.gsub!(/
/,"")

new_text.gsub!(//,"")
new_text.gsub!(/<\/sup>/,"")
new_text
end

The was only one gotcha for me: images are placed using coordinates relative to their bottom left corner, not their top left corner as you might expect. Also text_width returned inconsistent results for me.

With lots of reporting to be done, this was only the first day's taste of using PDF/Writer. I'm sure there'll be lots more helper methods to write and subtleties to understand. High on the list is figuring a way to use the QuickRef class and the techbook application. QuickRef provides a DSL for working with brochures. Techbook implements a markup language suitable for manual-like documents that goes a long way towards widow/orphan control, text tags, and other useful features.

Many thanks to Austin Ziegler, the original author, and the current band of workers who are working to update this terrific library.