doc = Hpricot.XML(itinerary_day_description)
and then used xpath to find the text within the <cite> tags that will form the basis of the url I need:
activities = doc/("cite")
activities.each do |activity|
title = activity.innerText
link = "<a href=\" '/redboxes/activity/#{title}\" etc - you get the idea...
Hit the problem when the text contained non-ASCII UTF-8 characters (ñ, é, etc).
Hpricot conveniently converted them to HTML entities. And then innerText converted them into a meaningless character.
Not only does hpricot perform the HTML entity encoding in the initial XML document, but it performs it again every time the XML document gets processed.
Here's what I had to do to make this work.
- Use innerHTML instead of innerText. It preserves the HTML entity encoding that innerText didn't.
- Use the awesome HTMLEntities module from Paul Battley. I simply converted the title from an HTML entity back to native UTF-8 characters.
- Use CGI.escape for URL encoding.
doc = Hpricot.XML(itinerary_day_description)
coder = HTMLEntities.new
activities = doc/("cite")
activities.each do |activity|
title = coder.decode(activity.innerHTML)
link = "<a href=\" '/redboxes/activity/#{CGI.escape(title)}\" etc
1 comments:
This really helped me, I had a similar problem. Thanks a lot. It's a real timesaver. :)
Post a Comment