However, when hpricot was pulling the text using element.inner_text, I kept getting Ch?teau as the result. I guess hpricot is converting to some character set (pure speculation) and failing. Testing in irb, I made sure I'd done a $KCODE='u' to allow the console session to display unicode characters (actually I tried every setting for $KCODE) but it kept producing the Ch?teau.
The solution was to use inner_HTML instead. Returns Ch& acirc;teau which I can deal with.
Update: Hi Andrew,
I don't know if hpricot has changed since, but this is how I deal with this class of problems now. I use a nice plugin called HTMLEntities (Google is your friend)
# Tauck tours.
# Note the biggest problem is that Tauck formats their title line differently for different tours.
# Sometimes it's "Day 2 &endash; Historic Rouen", other times "Day 1: Welcome to Rome".
# Not worth the brain bruising of a regex to fix it. Just plan on modifying source when we work on Tauck.
def self.scrape_4(url)
logger = Logger.new("log/development.log")
coder = HTMLEntities.new
doc = Hpricot(open(url))
encoded_mess = doc/("#paneitinerary")
itinerary_portion = Hpricot(coder.decode(encoded_mess.to_html))
itin = Array.new
day = 1
itinerary_portion.search("span.days").each do |item|
etc...