Wednesday, March 12, 2008

UTF-8 and hpricot

I needed to take the text tagged in an XML document and make url strings out of it. Used hpricot to parse the XML. Like this:

doc = Hpricot.XML(itinerary_day_description)

and then used xpath to find the text within the <cite> tags that will form the basis of the url I need:

activities = doc/("cite")
activities.each do |activity|
title = activity.innerText
link = "<a href=\" '/redboxes/activity/#{title}\"
etc - you get the idea...

Hit the problem when the text contained non-ASCII UTF-8 characters (ñ, é, etc).

Hpricot conveniently converted them to HTML entities. And then innerText converted them into a meaningless character.

Not only does hpricot perform the HTML entity encoding in the initial XML document, but it performs it again every time the XML document gets processed.

Here's what I had to do to make this work.
  1. Use innerHTML instead of innerText. It preserves the HTML entity encoding that innerText didn't.
  2. Use the awesome HTMLEntities module from Paul Battley. I simply converted the title from an HTML entity back to native UTF-8 characters.
  3. Use CGI.escape for URL encoding.
So the final code snippet looks like:

doc = Hpricot.XML(itinerary_day_description)
coder = HTMLEntities.new
activities = doc/("cite")
activities.each do |activity|
title = coder.decode(activity.innerHTML)
link = "<a href=\" '/redboxes/activity/#{CGI.escape(title)}\" etc

1 comment:

BobiJo said...

This really helped me, I had a similar problem. Thanks a lot. It's a real timesaver. :)