Thursday, July 26, 2007

Inner_text and HTML entities

Again using the wonderful Hpricot to parse another tour operator Web site. Part of the page text contained Ch& acirc;teau which of course was rendered by the browser as: Ch√Ęteau. The page was declaring Latin-1 character coding, but by using HTML entities, the authors made sure the browser did the right thing.

However, when hpricot was pulling the text using element.inner_text, I kept getting Ch?teau as the result. I guess hpricot is converting to some character set (pure speculation) and failing. Testing in irb, I made sure I'd done a $KCODE='u' to allow the console session to display unicode characters (actually I tried every setting for $KCODE) but it kept producing the Ch?teau.

The solution was to use inner_HTML instead. Returns Ch& acirc;teau which I can deal with.

Update: Hi Andrew,

I don't know if hpricot has changed since, but this is how I deal with this class of problems now. I use a nice plugin called HTMLEntities (Google is your friend)

# Tauck tours.
# Note the biggest problem is that Tauck formats their title line differently for different tours.
# Sometimes it's "Day 2 &endash; Historic Rouen", other times "Day 1: Welcome to Rome".
# Not worth the brain bruising of a regex to fix it. Just plan on modifying source when we work on Tauck.
def self.scrape_4(url)
logger ="log/development.log")
coder =
doc = Hpricot(open(url))
encoded_mess = doc/("#paneitinerary")
itinerary_portion = Hpricot(coder.decode(encoded_mess.to_html))
itin =
day = 1"span.days").each do |item|


Anonymous said...

You are right and this has changed since Hpricot 0.5 where inner_text used to return things like  , —, £, ´. I used to gsub them but now all I get is "?" which is no help at all!

Anonymous said... which I mean the unconverted tags, not the ones you can see above!
so nbsp, mdash, pound, acute all bounded by their & and ;.

Andrew Grimm said...

Has there been any progress with this?