Monday, May 28, 2007

hpricot and .NET sites

Need to scrape tour operator sites to extract itinerary information. Some of them provide a Web service, but since they're mostly SOAP, I'd rather just scrape the site to get the data. That way, we'll have a similar tool to use for all tour operators.

Using _why's hpricot tool for the scraping. Had great success with it scraping blogger sites in the past. However, I got this error message on opening the url:
Hpricot::ParseError (ran out of buffer space on element <input>...)
Tried other pages on the same site and other pages just to make sure I hadn't messed up the open method. No problems. Glanced at the page source in Firefox's source display - no evident errors.

First thought was to run the tour operator page through a validator to see if there were missing tag closures or weird tags. Found lots of errors (56) but none seemed like they would cause overrun of buffer space on initial parsing. So I pasted the page into textmate to start removing each error one at a time to identify the culprit.
Whoa! That hidden input tag used by .NET to track state - viewstate - is huge! No wonder it blows the attribute buffer. It wasn't evident in the Firefox view without word wrap.
Sure enough _why has provided a method to increase this if you run into .NET pages like this. Simply increase the buffer size before you try to open such a url:

Hpricot.buffer_size = 262144
doc = Hpricot(open(""))

Thanks _why!


Sam said...

You ripper! Exactly what I needed!

danbaatar said...

When I try this

Hpricot.buffer_size = 262144

I get an error:

undefined method `buffer_size=' for Hpricot:Module

Am I doing something wrong? said...

Dan, stuipd question from me: you have a

require 'hpricot'

before attempting to change the buffer size, right? Jim.

coderrr said...

this patch fixes hpricot to dynamically allocate more memory as needed, so you never get these errors...