Friday, June 8, 2007

HTML Screen Scraping With Groovy

Recently, I wrote a script to scrape UV Index data from the Tropospheric Emission Monitoring Internet Service and then upload that data to Google Calendar as "Web Content". Here's how I went about scraping the data

I used Groovy's XmlSlurper to do the heavy lifting. XmlSlurper uses SAX underneath and, importantly, lets you choose a different SAXParser. As I wanted to parse HTML and not XML, I used TagSoup as my SAXParser.


slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())


Everytime I run the script (with cron, overnight), I'm going to want to use the latest data, so I created a URL object that had the URL for the UV Index data for London, and then used groovy's "withReader" enhancement to read and parse the data


url = new URL("http://blah/blah?blah")

url.withReader { reader ->

html = slurper.parse(reader)

//we should now have a parsed file

...scrapeing code...

}


As the data I was looking for was conviently located in a table, all I had to do was find the path to the table (firebug comes in real handy here)


tbl = html.body.table.tr.td.dl.dd.table


That gives a table and we can use a closure to iterate over the rows



tbl.tr.list().each { row ->

... row parsing code ...

}



Each row has a td list, so any particular cell of a row can then be accessed as row.td[X]. In order to get a row as a string, you'll need to use toString (so, to get the data of the first cell as a string, it's row.td[0].toString()).

I came across an interesting issue with the trim function when I was trying to parse the first column into a DateTime (using the Joda Time library. There were some non-breaking spaces in the String, and trim doesn't trim non breaking spaces, so I had to run a quick regular expression on the String to get rid of them


ds = row.td[0].toString().replaceAll(/\xA0/ , {""})


So putting it all together (though without the Google API code to do the uploading)



slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());

url = new URL('http://www.temis.nl/uvradiation/nrt/uvindex.php?lon=-0.07&lat=51.30')

url.withReader { reader ->
html = slurper.parse(reader);

tbl = html.body.table.tr.td.dl.dd.table

tbl.tr.list().each { row ->

if (row.td.size() == 3) {
//trim doesn't work on a non breaking space
ds = row.td[0].toString().replaceAll(/\xA0/, {""}).trim()
uvi = row.td[1].toString().toFloat()

//now do something with the date and uv index

}
}
}



There is one thing to note here -it does a quick check to make sure that there a 3 columns (date, uv index and ozone column), this is because there will be an extra row at the start of the table that contains the city name, if TEMIS know what the city name is for a set of co-ordinates.


I ran into a couple of niggles with the Google side of things, but that's probably best left to a different post

8 comments:

MAHESH said...

can you please describe in detail how did you write the code

I also want to parse the contents from a html table

Robert O'Connor said...

So that your code is easier to read, and to add syntax highlighting check use this out.

Cheers!

Scott Heaberlin said...

Thanks for this; I keep coming back to it as a refresher.

One thing that bit me and at least one other person I know was that firebug often (always?) adds a nonexistent <tbody> in between <table> and <tr>. If you try to use gpath like "dom.body.table.tbody.tr.td" it's likely to not work... you have to ignore the tbody and just go with "dom.body.table.tr.td" and you're good to go.

Thanks again!

alexd said...

Just wanted to say thanks for this post - I used it to help me write my own scraper. I've put example code from my scraper up on my blog, I hope it's useful.

Sheldon said...

Can you explain in detail how to write code for you
Thesis | Dissertation | Essay | Assignment

paul said...

Suffice it to say, thanks for this post - I use it to help me write my own scraper. I have put my blog on the example code, I hope this is useful, from my blade had.
Social media news Social Bookmarking Submission

rakeback said...

this code was really helpful. thanks a lot
Adrian fromrakeback

Armu said...

http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html