Friday, February 29, 2008

you are being redirected

Just built a sitemap generator for MyTripScrapbook.com, a fun exercise to let search engines discover travelers' journals that are not able to be crawled. They can't be crawled because journals are dynamically displayed and there are no links to them - an interested friend needs to know the screen name of their traveler or enter the traveler's name in a search box. If the friend enters a url directly (like MyTripScrapbook.com/JaneDoe) for a traveler that does not exist, we simply redirect to the Home page where the search box lives.

Building the sitemap was fun because it will get large quickly (each traveler has around 10 pages of journal to be indexed, so with 100,000 travelers there's a million entries in the sitemap file, although we are not at that size yet) and I don't want mongrel timing out on generating a lengthy XML file. But that's not the purpose of this post.

It wasn't until I submitted the sitemap to Google that I discovered a real problem. Google refused to accept the site validation of uploading a coded file to the web server root because my rails app was returning a soft 404. Remember I said we redirect unfulfillable requests to the Home page? Well, the status in the http header was being set to 302 (moved, found), which is the rails default for a redirect_to instead of a 404 (not found). Google rightly refused to validate the coded file because the site never returned a 404 error.

The answer seemed simple: add a :status => 404 to the redirect_to statement. But no, this resulted in rails generating the strange "you are being redirected" page. Yes, that's the only content of the page and only when the user clicks on the link will they be taken to the redirected url. I don't understand why this is, but it is certainly not the desired user experience.

The answer was to drop the redirect_to and use plain old render :action=>"index", :status => 404 instead.

In a few months when the traveler numbers swell with the summer travel season, I'll need to partition the journal space using a siteindex file which, in turn, will point to multiple sitemaps, each of which cannot contain more than 50,000 entries.

No comments: